Configuration

Overview¶

Py-Chaos-Agent is configured via a YAML file that defines agent behavior and failure injection parameters. The default configuration file is config.yaml in the application root.

Configuration Structure¶

agent:
  # Agent-level settings

failures:
  # Individual failure type configurations

Agent Configuration¶

`agent.interval_seconds`¶

Type: Integer
Default: None (required)
Description: How often (in seconds) the agent checks for potential chaos injections.

agent:
  interval_seconds: 10 # Check every 10 seconds

Lower values increase chaos frequency but consume more resources. Recommended range: 5-60 seconds.

`agent.dry_run`¶

Type: Boolean
Default: false
Description: When true, the agent logs what it would do without executing actual chaos.

agent:
  dry_run: true # Test configuration without actual chaos

Use cases:

Testing configuration changes
Validating probability settings
Demonstrating chaos behavior safely

Failure Configurations¶

All failure types share common parameters:

Common Parameters¶

`enabled`¶

Type: Boolean
Required: Yes
Description: Whether this failure type is active.

failures:
  cpu:
    enabled: false # Disable CPU chaos

`probability`¶

Type: Float (0.0 - 1.0)
Required: Yes
Description: Chance of injection occurring each interval.

failures:
  cpu:
    probability: 0.3 # 30% chance per check

Guidelines:

Start with low probabilities (0.1-0.3) for testing
Higher probabilities create more aggressive chaos
Probability of 1.0 means inject every interval
Probability of 0.0 effectively disables the failure

`duration_seconds`¶

Type: Integer
Required: For CPU, Memory, Network
Description: How long the chaos effect lasts.

failures:
  cpu:
    duration_seconds: 5 # Hog CPU for 5 seconds

CPU Failure¶

Spawns processes that consume CPU cycles.

Configuration¶

failures:
  cpu:
    enabled: true
    probability: 0.3
    duration_seconds: 5
    cores: 2

Parameters¶

`cores`¶

Type: Integer
Default: 1
Description: Number of CPU cores to stress simultaneously.

Examples:

cores: 1   # Light CPU stress
cores: 2   # Moderate CPU stress
cores: 4   # Heavy CPU stress

Considerations:

Setting cores higher than available CPU can cause severe performance degradation
Use dry-run to verify impact before production testing
CPU stress blocks during execution; avoid very long durations

Example Scenarios¶

Light, frequent CPU spikes:

cpu:
  enabled: true
  probability: 0.5
  duration_seconds: 3
  cores: 1

Heavy, rare CPU stress:

cpu:
  enabled: true
  probability: 0.1
  duration_seconds: 30
  cores: 4

Memory Failure¶

Allocates and holds memory for a specified duration.

Configuration¶

failures:
  memory:
    enabled: true
    probability: 0.2
    duration_seconds: 10
    mb: 200

Parameters¶

`mb`¶

Type: Integer
Default: 100
Description: Amount of memory to allocate in megabytes.

Examples:

mb: 50    # Light memory pressure
mb: 200   # Moderate memory pressure
mb: 1024  # Heavy memory pressure (1GB)

Considerations:

Memory injection runs in a background thread
Allocated memory is filled with data to prevent OS optimization
Setting mb higher than available memory may cause OOM kills
The chaos agent itself requires memory; account for this

Example Scenarios¶

Gradual memory pressure:

memory:
  enabled: true
  probability: 0.4
  duration_seconds: 15
  mb: 100

Sudden memory spike:

memory:
  enabled: true
  probability: 0.1
  duration_seconds: 5
  mb: 500

Process Failure¶

Terminates target processes by name or command line.

Configuration¶

failures:
  process:
    enabled: true
    probability: 0.3
    target_name: "target-app"

Parameters¶

`target_name`¶

Type: String
Required: Yes
Description: Name or substring to match against process names or command lines.

Examples:

target_name: "my_app"        # Kills my_app process
target_name: "api_server"       # Kill API servere
target_name: "target-app"   # Kill specific app

Matching behavior:

Case-insensitive substring matching
Matches against process name (e.g., python3)
Also matches against command line arguments
First matching process is terminated

Safety mechanisms:

Chaos agent never kills itself
Protected PIDs: agent process, parent, and all children
Processes with "chaos" or "agent.py" in command line are excluded

Termination Process¶

Send SIGTERM (graceful shutdown)
Wait up to 3 seconds for termination
If still running, send SIGKILL (forced termination)
Wait up to 2 seconds for confirmation

Security Considerations¶

Target Name Safety: - ✅ Use specific application names: "myapp", "api-server", "worker-service" - ❌ Avoid generic names: "python", "java", "node" (will be rejected) - Minimum 3 characters required for safety

Protected Processes: The chaos agent will never kill: - System critical processes (systemd, init) - Container infrastructure (dockerd, containerd, kubelet) - Network services (sshd, NetworkManager) - The chaos agent itself and its children

Example Scenarios¶

High availability testing:

process:
  enabled: true
  probability: 0.5
  target_name: "api-server"

Rare catastrophic failure:

process:
  enabled: true
  probability: 0.05
  target_name: "database"

Network Failure¶

Injects network latency using Linux traffic control (tc).

Configuration¶

failures:
  network:
    enabled: true
    probability: 0.25
    duration_seconds: 10
    interface: "eth0"
    delay_ms: 300

Parameters¶

`interface`¶

Type: String
Default: "eth0"
Description: Network interface to apply latency to.

Common values:

eth0 - Primary Ethernet interface
wlan0 - Wireless interface
lo - Loopback (for testing)

Finding your interface:

# Inside the container
ip addr show

# Or
ifconfig

`delay_ms`¶

Type: Integer
Default: 100
Description: Network latency to add in milliseconds.

Examples:

delay_ms: 50    # Slight delay (good connection)
delay_ms: 200   # Noticeable delay (poor connection)
delay_ms: 1000  # Severe delay (1 second)

Real-world equivalents:

10-50ms: Local network
50-150ms: Cross-country
150-300ms: Intercontinental
300+ms: Satellite connection

Requirements¶

Kubernetes:

securityContext:
  capabilities:
    add: ["NET_ADMIN"]

Docker:

privileged: true # Or add NET_ADMIN capability

Cleanup¶

Network rules are automatically cleaned up:

On agent shutdown (SIGTERM/SIGINT)
Before each new network injection
On startup (removes any leftover rules)

Example Scenarios¶

Intermittent latency spikes:

network:
  enabled: true
  probability: 0.3
  duration_seconds: 5
  delay_ms: 200
  interface: "eth0"

Sustained poor connectivity:

network:
  enabled: true
  probability: 0.6
  duration_seconds: 30
  delay_ms: 500
  interface: "eth0"

Complete Configuration Examples¶

Conservative Testing¶

For initial resilience testing in non-critical environments:

agent:
  interval_seconds: 30
  dry_run: false

failures:
  cpu:
    enabled: true
    probability: 0.2
    duration_seconds: 5
    cores: 1

  memory:
    enabled: true
    probability: 0.15
    duration_seconds: 8
    mb: 100

  process:
    enabled: false # Disabled for initial testing

  network:
    enabled: true
    probability: 0.2
    duration_seconds: 10
    interface: "eth0"
    delay_ms: 150

Aggressive Testing¶

For chaos engineering in robust test environments:

agent:
  interval_seconds: 10
  dry_run: false

failures:
  cpu:
    enabled: true
    probability: 0.5
    duration_seconds: 8
    cores: 2

  memory:
    enabled: true
    probability: 0.4
    duration_seconds: 12
    mb: 300

  process:
    enabled: true
    probability: 0.3
    target_name: "target-app"

  network:
    enabled: true
    probability: 0.4
    duration_seconds: 15
    interface: "eth0"
    delay_ms: 400

Process-Only Testing¶

For testing application restart and recovery:

agent:
  interval_seconds: 20
  dry_run: false

failures:
  cpu:
    enabled: false

  memory:
    enabled: false

  process:
    enabled: true
    probability: 0.8
    target_name: "myapp"

  network:
    enabled: false

Network-Only Testing¶

For testing distributed system behavior under latency:

agent:
  interval_seconds: 15
  dry_run: false

failures:
  cpu:
    enabled: false

  memory:
    enabled: false

  process:
    enabled: false

  network:
    enabled: true
    probability: 0.6
    duration_seconds: 20
    interface: "eth0"
    delay_ms: 300

Configuration in Kubernetes¶

Using ConfigMaps¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-config
data:
  config.yaml: |
    agent:
      interval_seconds: 15
      dry_run: false
    failures:
      # ... your configuration

Mount as volume:

volumes:
  - name: config
    configMap:
      name: chaos-config

volumeMounts:
  - name: config
    mountPath: /app/config.yaml
    subPath: config.yaml

Dynamic Updates¶

To update configuration without restarting pods:

# Edit ConfigMap
kubectl edit configmap chaos-config -n chaos-demo

# Delete pod to pick up new config
kubectl delete pod -n chaos-demo -l app=resilient-app

Best Practices¶

Start with dry-run: Always test configuration with dry_run: true first
Gradual probability increase: Begin with low probabilities (0.1-0.2) and increase gradually
Monitor during testing: Watch metrics and application logs during chaos experiments
Document baselines: Record normal system behavior before introducing chaos
Single failure type: Test one failure type at a time initially
Reasonable durations: Keep durations short (5-15 seconds) for most scenarios
Consider dependencies: Account for application startup time when configuring process kills
Network interface verification: Confirm correct interface name before network testing
Resource awareness: Don't allocate more resources than available (especially memory)
Version control: Keep configurations in git with descriptive commit messages

Troubleshooting¶

Configuration not loading¶

# Verify YAML syntax
python -c "import yaml; yaml.safe_load(open('config.yaml'))"

# Check file permissions
ls -l config.yaml

# Verify mount in Kubernetes
kubectl exec -it <pod-name> -c chaos-agent -- cat /app/config.yaml

Chaos not triggering¶

Check enabled: true for desired failure types
Verify probability is > 0
Confirm dry_run is false
Check agent logs for error messages
Verify required capabilities (NET_ADMIN for network)

Unexpected behavior¶

Review probability calculations (multiply probability × checks per minute)
Confirm durations aren't overlapping
Check for resource constraints on chaos agent pod
Verify target names match actual processes

Configuration

Overview¶

Configuration Structure¶

Agent Configuration¶

agent.interval_seconds¶

agent.dry_run¶

Failure Configurations¶

Common Parameters¶

enabled¶

probability¶

duration_seconds¶

CPU Failure¶

Configuration¶

Parameters¶

cores¶

Example Scenarios¶

Memory Failure¶

Configuration¶

Parameters¶

mb¶

Example Scenarios¶

Process Failure¶

Configuration¶

Parameters¶

target_name¶

Termination Process¶

Security Considerations¶

Example Scenarios¶

Network Failure¶

Configuration¶

Parameters¶

interface¶

delay_ms¶

Requirements¶

Cleanup¶

Example Scenarios¶

Complete Configuration Examples¶

Conservative Testing¶

Aggressive Testing¶

Process-Only Testing¶

Network-Only Testing¶

Configuration in Kubernetes¶

Using ConfigMaps¶

Dynamic Updates¶

Best Practices¶

Troubleshooting¶

Configuration not loading¶

Chaos not triggering¶

Unexpected behavior¶

`agent.interval_seconds`¶

`agent.dry_run`¶

`enabled`¶

`probability`¶

`duration_seconds`¶

`cores`¶

`mb`¶

`target_name`¶

`interface`¶

`delay_ms`¶