Common Issues and Solutions¶
Agent Not Starting¶
Symptom¶
kubectl get pods -n chaos-demo
# NAME READY STATUS RESTARTS AGE
# resilient-app-xxx-xxx 1/2 CrashLoopBackOff 5 3m
Diagnosis¶
# Check agent logs
kubectl logs -n chaos-demo <pod-name> -c chaos-agent
# Common errors:
# - "FileNotFoundError: config.yaml"
# - "ModuleNotFoundError: No module named 'psutil'"
# - "PermissionError: [Errno 13]"
Solutions¶
Config file not found:
# Verify ConfigMap exists
kubectl get configmap -n chaos-demo chaos-config
# Verify mount path
kubectl describe pod -n chaos-demo <pod-name> | grep -A 10 Mounts
# Fix: Ensure ConfigMap is properly mounted
Missing dependencies:
# Rebuild Docker image with all dependencies
docker build -t py-chaos-agent:latest -f docker/Dockerfile .
# Verify requirements.txt includes all packages
cat requirements.txt
Permission issues:
# Ensure container runs as root for chaos operations
securityContext:
runAsUser: 0
capabilities:
add: ["NET_ADMIN"]
Chaos Not Triggering¶
Symptom¶
# Metrics show only skipped injections
curl localhost:8000/metrics | grep chaos_injections_total
# chaos_injections_total{failure_type="cpu",status="skipped"} 100
# chaos_injections_total{failure_type="cpu",status="success"} 0
Diagnosis¶
# Check configuration
kubectl get configmap -n chaos-demo chaos-config -o yaml
# Check agent logs for clues
kubectl logs -n chaos-demo <pod-name> -c chaos-agent -f
Common Causes and Fixes¶
1. Dry-run mode enabled:
# Check config
agent:
dry_run: true # This prevents actual chaos!
# Fix: Set to false
agent:
dry_run: false
2. Low probability:
# With probability 0.1 and interval 30s:
# Expected: ~2 injections per hour
failures:
cpu:
probability: 0.1 # Very low!
# Fix: Increase for testing
failures:
cpu:
probability: 0.5 # Much more likely
3. All failures disabled:
failures:
cpu:
enabled: false
memory:
enabled: false
process:
enabled: false
network:
enabled: false
# Fix: Enable at least one
4. Process not found:
# For process failures, target must exist
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep "No killable process"
# Check what processes are running
kubectl exec -n chaos-demo <pod-name> -c target-app -- ps aux
# Fix: Update target_name to match actual process
Network Chaos Not Working¶
Symptom¶
[NETWORK] Failed: Operation not permitted
chaos_injections_total{failure_type="network",status="failed"} increasing
Diagnosis¶
# Check if NET_ADMIN capability is granted
kubectl get pod -n chaos-demo <pod-name> -o yaml | grep -A 5 capabilities
Solution¶
# Add NET_ADMIN capability
containers:
- name: chaos-agent
securityContext:
capabilities:
add: ["NET_ADMIN"]
# Or use privileged mode (less secure)
containers:
- name: chaos-agent
securityContext:
privileged: true
Verify Fix¶
# Recreate pod
kubectl delete pod -n chaos-demo <pod-name>
# Check logs
kubectl logs -n chaos-demo <pod-name> -c chaos-agent -f
# Should see: [NETWORK] Adding Xms latency...
Process Kills Not Working¶
Symptom¶
Diagnosis¶
# Check what processes are visible
kubectl exec -n chaos-demo <pod-name> -c chaos-agent -- ps aux
# If you don't see target app processes, check shareProcessNamespace
kubectl get pod -n chaos-demo <pod-name> -o yaml | grep shareProcessNamespace
Solution¶
# Enable process namespace sharing
spec:
shareProcessNamespace: true # CRITICAL for process chaos
containers:
- name: target-app
# ...
- name: chaos-agent
# ...
Verify Fix¶
# Delete and recreate pod
kubectl delete pod -n chaos-demo <pod-name>
# Verify processes are visible
kubectl exec -n chaos-demo <pod-name> -c chaos-agent -- ps aux | grep target
High Failure Rate¶
Symptom¶
chaos_injections_total{failure_type="network",status="failed"} 45
chaos_injections_total{failure_type="network",status="success"} 5
Diagnosis¶
# Get detailed error messages
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep "Failed:"
# Common errors:
# - "RTNETLINK answers: File exists" (network rules conflict)
# - "Cannot allocate memory" (insufficient memory)
# - "Command not found: tc" (missing tools)
Solutions¶
Network rules conflict:
# Clean up manually
kubectl exec -n chaos-demo <pod-name> -c chaos-agent -- tc qdisc del dev eth0 root 2>/dev/null
# Or restart pod
kubectl delete pod -n chaos-demo <pod-name>
Insufficient memory:
# Reduce memory allocation or increase pod limits
failures:
memory:
mb: 100 # Reduced from 500
# Or increase pod resources
resources:
limits:
memory: 1Gi # Increased limit
Missing tools:
# Ensure Dockerfile installs all required tools
RUN apt-get update && apt-get install -y \
iproute2 \
procps \
&& rm -rf /var/lib/apt/lists/*
Metrics Not Appearing¶
Symptom¶
curl localhost:8000/metrics
# curl: (7) Failed to connect to localhost port 8000: Connection refused
Diagnosis¶
# Check if metrics server started
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep "Metrics"
# Should see: [Metrics] Prometheus exporter running on :8000
# Check if port is exposed
kubectl get pod -n chaos-demo <pod-name> -o yaml | grep -A 5 ports
Solutions¶
Port not exposed:
Port forward not working:
# Kill existing port forwards
pkill -f "port-forward"
# Create new port forward
kubectl port-forward -n chaos-demo <pod-name> 8000:8000
# Or use service
kubectl port-forward -n chaos-demo svc/resilient-app 8000:8000
Metrics server crashed:
# Check for exceptions in metrics.py
# Increase debugging
import logging
logging.basicConfig(level=logging.DEBUG)
Docker Compose Issues¶
Symptom¶
docker-compose up
# ERROR: for chaos-agent Cannot start service chaos-agent:
# error creating network: permission denied
Solutions¶
Permission denied:
# Run with sudo (Linux)
sudo docker-compose up
# Or add user to docker group
sudo usermod -aG docker $USER
newgrp docker
Containers can't communicate:
# Check network
docker network ls
docker network inspect py-chaos-agent_default
# Verify network mode in docker-compose.yml
network_mode: "service:target-app" # Correct
Volume mount issues:
# Check file exists
ls -la config.yaml
# Verify mount in docker-compose.yml
volumes:
- ./config.yaml:/app/config.yaml:ro
# Debug inside container
docker-compose run --rm chaos-agent ls -la /app/
Kubernetes Deployment Issues¶
Image Pull Errors¶
Solutions:
# For local images (minikube/kind)
eval $(minikube docker-env) # or
kind load docker-image py-chaos-agent:latest
# For ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin <ecr-url>
# Verify image exists
docker images | grep chaos-agent
Pod Stuck in Pending¶
kubectl describe pod -n chaos-demo <pod-name>
# Events:
# FailedScheduling: 0/3 nodes available: insufficient memory
Solutions:
# Reduce resource requests
resources:
requests:
cpu: 100m # Reduced
memory: 128Mi # Reduced
limits:
cpu: 500m
memory: 512Mi
Configuration Issues¶
Invalid YAML Syntax¶
kubectl apply -f config.yaml
# error: error parsing config.yaml: yaml: line 10: did not find expected key
Solutions:
# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('config.yaml'))"
# Use online validator: https://www.yamllint.com/
# Common issues:
# - Incorrect indentation (use spaces, not tabs)
# - Missing colons
# - Unquoted special characters
Configuration Not Loading¶
# Agent keeps using old configuration
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep CONFIG
# [CONFIG] Interval: 10s # Old value!
Solutions:
# ConfigMap updates don't auto-reload pods
# Must delete pod to pick up changes
kubectl delete pod -n chaos-demo <pod-name>
# Or use a rolling restart
kubectl rollout restart deployment -n chaos-demo resilient-app
# Verify ConfigMap updated
kubectl get configmap -n chaos-demo chaos-config -o yaml
Performance Issues¶
High CPU Usage¶
Diagnosis:
# Check if CPU chaos is stuck
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep CPU
# Check metrics
curl localhost:8000/metrics | grep chaos_injection_active
# chaos_injection_active{failure_type="cpu"} 1 # Still active!
Solutions:
# Reduce CPU chaos intensity
failures:
cpu:
cores: 1 # Reduced from 4
duration_seconds: 5 # Reduced from 30
# Or adjust interval
agent:
interval_seconds: 60 # Increased from 10
Memory Leak¶
kubectl top pod -n chaos-demo
# NAME CPU(cores) MEMORY(bytes)
# resilient-app-xxx-xxx 100m 950Mi # Increasing!
Solutions:
# Check for stuck memory injections
curl localhost:8000/metrics | grep memory
# Set proper resource limits
resources:
limits:
memory: 512Mi # Pod will be OOM killed if exceeded
# Restart pod periodically if needed
kubectl delete pod -n chaos-demo <pod-name>
Debugging Techniques¶
Enable Debug Logging¶
# Add to src/agent.py
import logging
logging.basicConfig(
level=logging.DEBUG,
format='[%(asctime)s] %(levelname)s: %(message)s'
)
Interactive Debugging¶
# Access chaos agent container
kubectl exec -it -n chaos-demo <pod-name> -c chaos-agent -- /bin/bash
# Test commands manually
tc qdisc show dev eth0
ps aux | grep target
python -c "import psutil; print(psutil.cpu_percent())"
Check System Resources¶
# Inside container
free -h # Memory
df -h # Disk
ip addr show # Network interfaces
ps auxf # Process tree
Validate Metrics Manually¶
# Python script to check metrics
import requests
response = requests.get('http://localhost:8000/metrics')
for line in response.text.split('\n'):
if 'chaos_' in line and not line.startswith('#'):
print(line)
Getting Help¶
Before Opening an Issue¶
- Check logs:
kubectl logs -n chaos-demo <pod-name> -c chaos-agent > agent.log
kubectl describe pod -n chaos-demo <pod-name> > pod-describe.txt
- Collect configuration:
kubectl get configmap -n chaos-demo chaos-config -o yaml > config.yaml
kubectl get pod -n chaos-demo <pod-name> -o yaml > pod.yaml
- Get metrics:
- Document steps:
- What you tried
- What you expected
- What actually happened
- Environment details (Kubernetes version, cloud provider)
Issue Template¶
## Environment
- Kubernetes version:
- Cloud provider:
- Chaos agent version:
## Problem Description
[Clear description of the issue]
## Steps to Reproduce
1.
2.
3.
## Expected Behavior
[What should happen]
## Actual Behavior
[What actually happens]
## Logs
[Paste relevant logs]