Overview

Py-Chaos-Agent is configured via a YAML file that defines agent behavior and failure injection parameters. The default configuration file is config.yaml in the application root.

Configuration Structure

agent:
  # Agent-level settings

failures:
  # Individual failure type configurations

Agent Configuration

agent.interval_seconds

Type: Integer
Default: None (required)
Description: How often (in seconds) the agent checks for potential chaos injections.

agent:
  interval_seconds: 10 # Check every 10 seconds

Lower values increase chaos frequency but consume more resources. Recommended range: 5-60 seconds.

agent.dry_run

Type: Boolean
Default: false
Description: When true, the agent logs what it would do without executing actual chaos.

agent:
  dry_run: true # Test configuration without actual chaos

Use cases:

  • Testing configuration changes
  • Validating probability settings
  • Demonstrating chaos behavior safely

Failure Configurations

All failure types share common parameters:

Common Parameters

enabled

Type: Boolean
Required: Yes
Description: Whether this failure type is active.

failures:
  cpu:
    enabled: false # Disable CPU chaos

probability

Type: Float (0.0 - 1.0)
Required: Yes
Description: Chance of injection occurring each interval.

failures:
  cpu:
    probability: 0.3 # 30% chance per check

Guidelines:

  • Start with low probabilities (0.1-0.3) for testing
  • Higher probabilities create more aggressive chaos
  • Probability of 1.0 means inject every interval
  • Probability of 0.0 effectively disables the failure

duration_seconds

Type: Integer
Required: For CPU, Memory, Network
Description: How long the chaos effect lasts.

failures:
  cpu:
    duration_seconds: 5 # Hog CPU for 5 seconds

CPU Failure

Spawns processes that consume CPU cycles.

Configuration

failures:
  cpu:
    enabled: true
    probability: 0.3
    duration_seconds: 5
    cores: 2

Parameters

cores

Type: Integer
Default: 1
Description: Number of CPU cores to stress simultaneously.

Examples:

cores: 1   # Light CPU stress
cores: 2   # Moderate CPU stress
cores: 4   # Heavy CPU stress

Considerations:

  • Setting cores higher than available CPU can cause severe performance degradation
  • Use dry-run to verify impact before production testing
  • CPU stress blocks during execution; avoid very long durations

Example Scenarios

Light, frequent CPU spikes:

cpu:
  enabled: true
  probability: 0.5
  duration_seconds: 3
  cores: 1

Heavy, rare CPU stress:

cpu:
  enabled: true
  probability: 0.1
  duration_seconds: 30
  cores: 4

Memory Failure

Allocates and holds memory for a specified duration.

Configuration

failures:
  memory:
    enabled: true
    probability: 0.2
    duration_seconds: 10
    mb: 200

Parameters

mb

Type: Integer
Default: 100
Description: Amount of memory to allocate in megabytes.

Examples:

mb: 50    # Light memory pressure
mb: 200   # Moderate memory pressure
mb: 1024  # Heavy memory pressure (1GB)

Considerations:

  • Memory injection runs in a background thread
  • Allocated memory is filled with data to prevent OS optimization
  • Setting mb higher than available memory may cause OOM kills
  • The chaos agent itself requires memory; account for this

Example Scenarios

Gradual memory pressure:

memory:
  enabled: true
  probability: 0.4
  duration_seconds: 15
  mb: 100

Sudden memory spike:

memory:
  enabled: true
  probability: 0.1
  duration_seconds: 5
  mb: 500

Process Failure

Terminates target processes by name or command line.

Configuration

failures:
  process:
    enabled: true
    probability: 0.3
    target_name: "target-app"

Parameters

target_name

Type: String
Required: Yes
Description: Name or substring to match against process names or command lines.

Examples:

target_name: "my_app"        # Kills my_app process
target_name: "api_server"       # Kill API servere
target_name: "target-app"   # Kill specific app

Matching behavior:

  • Case-insensitive substring matching
  • Matches against process name (e.g., python3)
  • Also matches against command line arguments
  • First matching process is terminated

Safety mechanisms:

  • Chaos agent never kills itself
  • Protected PIDs: agent process, parent, and all children
  • Processes with "chaos" or "agent.py" in command line are excluded

Termination Process

  1. Send SIGTERM (graceful shutdown)
  2. Wait up to 3 seconds for termination
  3. If still running, send SIGKILL (forced termination)
  4. Wait up to 2 seconds for confirmation

Security Considerations

Target Name Safety: - ✅ Use specific application names: "myapp", "api-server", "worker-service" - ❌ Avoid generic names: "python", "java", "node" (will be rejected) - Minimum 3 characters required for safety

Protected Processes: The chaos agent will never kill: - System critical processes (systemd, init) - Container infrastructure (dockerd, containerd, kubelet) - Network services (sshd, NetworkManager) - The chaos agent itself and its children

Example Scenarios

High availability testing:

process:
  enabled: true
  probability: 0.5
  target_name: "api-server"

Rare catastrophic failure:

process:
  enabled: true
  probability: 0.05
  target_name: "database"

Network Failure

Injects network latency using Linux traffic control (tc).

Configuration

failures:
  network:
    enabled: true
    probability: 0.25
    duration_seconds: 10
    interface: "eth0"
    delay_ms: 300

Parameters

interface

Type: String
Default: "eth0"
Description: Network interface to apply latency to.

Common values:

  • eth0 - Primary Ethernet interface
  • wlan0 - Wireless interface
  • lo - Loopback (for testing)

Finding your interface:

# Inside the container
ip addr show

# Or
ifconfig

delay_ms

Type: Integer
Default: 100
Description: Network latency to add in milliseconds.

Examples:

delay_ms: 50    # Slight delay (good connection)
delay_ms: 200   # Noticeable delay (poor connection)
delay_ms: 1000  # Severe delay (1 second)

Real-world equivalents:

  • 10-50ms: Local network
  • 50-150ms: Cross-country
  • 150-300ms: Intercontinental
  • 300+ms: Satellite connection

Requirements

Kubernetes:

securityContext:
  capabilities:
    add: ["NET_ADMIN"]

Docker:

privileged: true # Or add NET_ADMIN capability

Cleanup

Network rules are automatically cleaned up:

  • On agent shutdown (SIGTERM/SIGINT)
  • Before each new network injection
  • On startup (removes any leftover rules)

Example Scenarios

Intermittent latency spikes:

network:
  enabled: true
  probability: 0.3
  duration_seconds: 5
  delay_ms: 200
  interface: "eth0"

Sustained poor connectivity:

network:
  enabled: true
  probability: 0.6
  duration_seconds: 30
  delay_ms: 500
  interface: "eth0"

Complete Configuration Examples

Conservative Testing

For initial resilience testing in non-critical environments:

agent:
  interval_seconds: 30
  dry_run: false

failures:
  cpu:
    enabled: true
    probability: 0.2
    duration_seconds: 5
    cores: 1

  memory:
    enabled: true
    probability: 0.15
    duration_seconds: 8
    mb: 100

  process:
    enabled: false # Disabled for initial testing

  network:
    enabled: true
    probability: 0.2
    duration_seconds: 10
    interface: "eth0"
    delay_ms: 150

Aggressive Testing

For chaos engineering in robust test environments:

agent:
  interval_seconds: 10
  dry_run: false

failures:
  cpu:
    enabled: true
    probability: 0.5
    duration_seconds: 8
    cores: 2

  memory:
    enabled: true
    probability: 0.4
    duration_seconds: 12
    mb: 300

  process:
    enabled: true
    probability: 0.3
    target_name: "target-app"

  network:
    enabled: true
    probability: 0.4
    duration_seconds: 15
    interface: "eth0"
    delay_ms: 400

Process-Only Testing

For testing application restart and recovery:

agent:
  interval_seconds: 20
  dry_run: false

failures:
  cpu:
    enabled: false

  memory:
    enabled: false

  process:
    enabled: true
    probability: 0.8
    target_name: "myapp"

  network:
    enabled: false

Network-Only Testing

For testing distributed system behavior under latency:

agent:
  interval_seconds: 15
  dry_run: false

failures:
  cpu:
    enabled: false

  memory:
    enabled: false

  process:
    enabled: false

  network:
    enabled: true
    probability: 0.6
    duration_seconds: 20
    interface: "eth0"
    delay_ms: 300

Configuration in Kubernetes

Using ConfigMaps

apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-config
data:
  config.yaml: |
    agent:
      interval_seconds: 15
      dry_run: false
    failures:
      # ... your configuration

Mount as volume:

volumes:
  - name: config
    configMap:
      name: chaos-config

volumeMounts:
  - name: config
    mountPath: /app/config.yaml
    subPath: config.yaml

Dynamic Updates

To update configuration without restarting pods:

# Edit ConfigMap
kubectl edit configmap chaos-config -n chaos-demo

# Delete pod to pick up new config
kubectl delete pod -n chaos-demo -l app=resilient-app

Best Practices

  1. Start with dry-run: Always test configuration with dry_run: true first

  2. Gradual probability increase: Begin with low probabilities (0.1-0.2) and increase gradually

  3. Monitor during testing: Watch metrics and application logs during chaos experiments

  4. Document baselines: Record normal system behavior before introducing chaos

  5. Single failure type: Test one failure type at a time initially

  6. Reasonable durations: Keep durations short (5-15 seconds) for most scenarios

  7. Consider dependencies: Account for application startup time when configuring process kills

  8. Network interface verification: Confirm correct interface name before network testing

  9. Resource awareness: Don't allocate more resources than available (especially memory)

  10. Version control: Keep configurations in git with descriptive commit messages

Troubleshooting

Configuration not loading

# Verify YAML syntax
python -c "import yaml; yaml.safe_load(open('config.yaml'))"

# Check file permissions
ls -l config.yaml

# Verify mount in Kubernetes
kubectl exec -it <pod-name> -c chaos-agent -- cat /app/config.yaml

Chaos not triggering

  • Check enabled: true for desired failure types
  • Verify probability is > 0
  • Confirm dry_run is false
  • Check agent logs for error messages
  • Verify required capabilities (NET_ADMIN for network)

Unexpected behavior

  • Review probability calculations (multiply probability × checks per minute)
  • Confirm durations aren't overlapping
  • Check for resource constraints on chaos agent pod
  • Verify target names match actual processes