Overview¶
Py-Chaos-Agent is configured via a YAML file that defines agent behavior and failure injection parameters. The default configuration file is config.yaml in the application root.
Configuration Structure¶
Agent Configuration¶
agent.interval_seconds¶
Type: Integer
Default: None (required)
Description: How often (in seconds) the agent checks for potential chaos injections.
Lower values increase chaos frequency but consume more resources. Recommended range: 5-60 seconds.
agent.dry_run¶
Type: Boolean
Default: false
Description: When true, the agent logs what it would do without executing actual chaos.
Use cases:
- Testing configuration changes
- Validating probability settings
- Demonstrating chaos behavior safely
Failure Configurations¶
All failure types share common parameters:
Common Parameters¶
enabled¶
Type: Boolean
Required: Yes
Description: Whether this failure type is active.
probability¶
Type: Float (0.0 - 1.0)
Required: Yes
Description: Chance of injection occurring each interval.
Guidelines:
- Start with low probabilities (0.1-0.3) for testing
- Higher probabilities create more aggressive chaos
- Probability of 1.0 means inject every interval
- Probability of 0.0 effectively disables the failure
duration_seconds¶
Type: Integer
Required: For CPU, Memory, Network
Description: How long the chaos effect lasts.
CPU Failure¶
Spawns processes that consume CPU cycles.
Configuration¶
Parameters¶
cores¶
Type: Integer
Default: 1
Description: Number of CPU cores to stress simultaneously.
Examples:
Considerations:
- Setting
coreshigher than available CPU can cause severe performance degradation - Use dry-run to verify impact before production testing
- CPU stress blocks during execution; avoid very long durations
Example Scenarios¶
Light, frequent CPU spikes:
Heavy, rare CPU stress:
Memory Failure¶
Allocates and holds memory for a specified duration.
Configuration¶
Parameters¶
mb¶
Type: Integer
Default: 100
Description: Amount of memory to allocate in megabytes.
Examples:
mb: 50 # Light memory pressure
mb: 200 # Moderate memory pressure
mb: 1024 # Heavy memory pressure (1GB)
Considerations:
- Memory injection runs in a background thread
- Allocated memory is filled with data to prevent OS optimization
- Setting
mbhigher than available memory may cause OOM kills - The chaos agent itself requires memory; account for this
Example Scenarios¶
Gradual memory pressure:
Sudden memory spike:
Process Failure¶
Terminates target processes by name or command line.
Configuration¶
Parameters¶
target_name¶
Type: String
Required: Yes
Description: Name or substring to match against process names or command lines.
Examples:
target_name: "my_app" # Kills my_app process
target_name: "api_server" # Kill API servere
target_name: "target-app" # Kill specific app
Matching behavior:
- Case-insensitive substring matching
- Matches against process name (e.g.,
python3) - Also matches against command line arguments
- First matching process is terminated
Safety mechanisms:
- Chaos agent never kills itself
- Protected PIDs: agent process, parent, and all children
- Processes with "chaos" or "agent.py" in command line are excluded
Termination Process¶
- Send SIGTERM (graceful shutdown)
- Wait up to 3 seconds for termination
- If still running, send SIGKILL (forced termination)
- Wait up to 2 seconds for confirmation
Security Considerations¶
Target Name Safety:
- ✅ Use specific application names: "myapp", "api-server", "worker-service"
- ❌ Avoid generic names: "python", "java", "node" (will be rejected)
- Minimum 3 characters required for safety
Protected Processes: The chaos agent will never kill: - System critical processes (systemd, init) - Container infrastructure (dockerd, containerd, kubelet) - Network services (sshd, NetworkManager) - The chaos agent itself and its children
Example Scenarios¶
High availability testing:
Rare catastrophic failure:
Network Failure¶
Injects network latency using Linux traffic control (tc).
Configuration¶
failures:
network:
enabled: true
probability: 0.25
duration_seconds: 10
interface: "eth0"
delay_ms: 300
Parameters¶
interface¶
Type: String
Default: "eth0"
Description: Network interface to apply latency to.
Common values:
eth0- Primary Ethernet interfacewlan0- Wireless interfacelo- Loopback (for testing)
Finding your interface:
delay_ms¶
Type: Integer
Default: 100
Description: Network latency to add in milliseconds.
Examples:
delay_ms: 50 # Slight delay (good connection)
delay_ms: 200 # Noticeable delay (poor connection)
delay_ms: 1000 # Severe delay (1 second)
Real-world equivalents:
- 10-50ms: Local network
- 50-150ms: Cross-country
- 150-300ms: Intercontinental
- 300+ms: Satellite connection
Requirements¶
Kubernetes:
Docker:
Cleanup¶
Network rules are automatically cleaned up:
- On agent shutdown (SIGTERM/SIGINT)
- Before each new network injection
- On startup (removes any leftover rules)
Example Scenarios¶
Intermittent latency spikes:
Sustained poor connectivity:
Complete Configuration Examples¶
Conservative Testing¶
For initial resilience testing in non-critical environments:
agent:
interval_seconds: 30
dry_run: false
failures:
cpu:
enabled: true
probability: 0.2
duration_seconds: 5
cores: 1
memory:
enabled: true
probability: 0.15
duration_seconds: 8
mb: 100
process:
enabled: false # Disabled for initial testing
network:
enabled: true
probability: 0.2
duration_seconds: 10
interface: "eth0"
delay_ms: 150
Aggressive Testing¶
For chaos engineering in robust test environments:
agent:
interval_seconds: 10
dry_run: false
failures:
cpu:
enabled: true
probability: 0.5
duration_seconds: 8
cores: 2
memory:
enabled: true
probability: 0.4
duration_seconds: 12
mb: 300
process:
enabled: true
probability: 0.3
target_name: "target-app"
network:
enabled: true
probability: 0.4
duration_seconds: 15
interface: "eth0"
delay_ms: 400
Process-Only Testing¶
For testing application restart and recovery:
agent:
interval_seconds: 20
dry_run: false
failures:
cpu:
enabled: false
memory:
enabled: false
process:
enabled: true
probability: 0.8
target_name: "myapp"
network:
enabled: false
Network-Only Testing¶
For testing distributed system behavior under latency:
agent:
interval_seconds: 15
dry_run: false
failures:
cpu:
enabled: false
memory:
enabled: false
process:
enabled: false
network:
enabled: true
probability: 0.6
duration_seconds: 20
interface: "eth0"
delay_ms: 300
Configuration in Kubernetes¶
Using ConfigMaps¶
apiVersion: v1
kind: ConfigMap
metadata:
name: chaos-config
data:
config.yaml: |
agent:
interval_seconds: 15
dry_run: false
failures:
# ... your configuration
Mount as volume:
volumes:
- name: config
configMap:
name: chaos-config
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
Dynamic Updates¶
To update configuration without restarting pods:
# Edit ConfigMap
kubectl edit configmap chaos-config -n chaos-demo
# Delete pod to pick up new config
kubectl delete pod -n chaos-demo -l app=resilient-app
Best Practices¶
-
Start with dry-run: Always test configuration with
dry_run: truefirst -
Gradual probability increase: Begin with low probabilities (0.1-0.2) and increase gradually
-
Monitor during testing: Watch metrics and application logs during chaos experiments
-
Document baselines: Record normal system behavior before introducing chaos
-
Single failure type: Test one failure type at a time initially
-
Reasonable durations: Keep durations short (5-15 seconds) for most scenarios
-
Consider dependencies: Account for application startup time when configuring process kills
-
Network interface verification: Confirm correct interface name before network testing
-
Resource awareness: Don't allocate more resources than available (especially memory)
-
Version control: Keep configurations in git with descriptive commit messages
Troubleshooting¶
Configuration not loading¶
# Verify YAML syntax
python -c "import yaml; yaml.safe_load(open('config.yaml'))"
# Check file permissions
ls -l config.yaml
# Verify mount in Kubernetes
kubectl exec -it <pod-name> -c chaos-agent -- cat /app/config.yaml
Chaos not triggering¶
- Check
enabled: truefor desired failure types - Verify probability is > 0
- Confirm dry_run is
false - Check agent logs for error messages
- Verify required capabilities (NET_ADMIN for network)
Unexpected behavior¶
- Review probability calculations (multiply probability × checks per minute)
- Confirm durations aren't overlapping
- Check for resource constraints on chaos agent pod
- Verify target names match actual processes