Troubleshooting

Common Issues and Solutions¶

Agent Not Starting¶

Symptom¶

kubectl get pods -n chaos-demo
# NAME                             READY   STATUS             RESTARTS   AGE
# resilient-app-xxx-xxx            1/2     CrashLoopBackOff   5          3m

Diagnosis¶

# Check agent logs
kubectl logs -n chaos-demo <pod-name> -c chaos-agent

# Common errors:
# - "FileNotFoundError: config.yaml"
# - "ModuleNotFoundError: No module named 'psutil'"
# - "PermissionError: [Errno 13]"

Solutions¶

Config file not found:

# Verify ConfigMap exists
kubectl get configmap -n chaos-demo chaos-config

# Verify mount path
kubectl describe pod -n chaos-demo <pod-name> | grep -A 10 Mounts

# Fix: Ensure ConfigMap is properly mounted

Missing dependencies:

# Rebuild Docker image with all dependencies
docker build -t py-chaos-agent:latest -f docker/Dockerfile .

# Verify requirements.txt includes all packages
cat requirements.txt

Permission issues:

# Ensure container runs as root for chaos operations
securityContext:
  runAsUser: 0
  capabilities:
    add: ["NET_ADMIN"]

Chaos Not Triggering¶

Symptom¶

# Metrics show only skipped injections
curl localhost:8000/metrics | grep chaos_injections_total
# chaos_injections_total{failure_type="cpu",status="skipped"} 100
# chaos_injections_total{failure_type="cpu",status="success"} 0

Diagnosis¶

# Check configuration
kubectl get configmap -n chaos-demo chaos-config -o yaml

# Check agent logs for clues
kubectl logs -n chaos-demo <pod-name> -c chaos-agent -f

Common Causes and Fixes¶

1. Dry-run mode enabled:

# Check config
agent:
  dry_run: true  # This prevents actual chaos!

# Fix: Set to false
agent:
  dry_run: false

2. Low probability:

# With probability 0.1 and interval 30s:
# Expected: ~2 injections per hour
failures:
  cpu:
    probability: 0.1  # Very low!

# Fix: Increase for testing
failures:
  cpu:
    probability: 0.5  # Much more likely

3. All failures disabled:

failures:
  cpu:
    enabled: false
  memory:
    enabled: false
  process:
    enabled: false
  network:
    enabled: false
# Fix: Enable at least one

4. Process not found:

# For process failures, target must exist
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep "No killable process"

# Check what processes are running
kubectl exec -n chaos-demo <pod-name> -c target-app -- ps aux

# Fix: Update target_name to match actual process

Network Chaos Not Working¶

Symptom¶

[NETWORK] Failed: Operation not permitted
chaos_injections_total{failure_type="network",status="failed"} increasing

Diagnosis¶

# Check if NET_ADMIN capability is granted
kubectl get pod -n chaos-demo <pod-name> -o yaml | grep -A 5 capabilities

Solution¶

# Add NET_ADMIN capability
containers:
  - name: chaos-agent
    securityContext:
      capabilities:
        add: ["NET_ADMIN"]

# Or use privileged mode (less secure)
containers:
  - name: chaos-agent
    securityContext:
      privileged: true

Verify Fix¶

# Recreate pod
kubectl delete pod -n chaos-demo <pod-name>

# Check logs
kubectl logs -n chaos-demo <pod-name> -c chaos-agent -f
# Should see: [NETWORK] Adding Xms latency...

Process Kills Not Working¶

Symptom¶

[PROCESS] No killable process named 'myapp' found

Diagnosis¶

# Check what processes are visible
kubectl exec -n chaos-demo <pod-name> -c chaos-agent -- ps aux

# If you don't see target app processes, check shareProcessNamespace
kubectl get pod -n chaos-demo <pod-name> -o yaml | grep shareProcessNamespace

Solution¶

# Enable process namespace sharing
spec:
  shareProcessNamespace: true # CRITICAL for process chaos

  containers:
    - name: target-app
      # ...
    - name: chaos-agent
      # ...

Verify Fix¶

# Delete and recreate pod
kubectl delete pod -n chaos-demo <pod-name>

# Verify processes are visible
kubectl exec -n chaos-demo <pod-name> -c chaos-agent -- ps aux | grep target

High Failure Rate¶

Symptom¶

chaos_injections_total{failure_type="network",status="failed"} 45
chaos_injections_total{failure_type="network",status="success"} 5

Diagnosis¶

# Get detailed error messages
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep "Failed:"

# Common errors:
# - "RTNETLINK answers: File exists" (network rules conflict)
# - "Cannot allocate memory" (insufficient memory)
# - "Command not found: tc" (missing tools)

Solutions¶

Network rules conflict:

# Clean up manually
kubectl exec -n chaos-demo <pod-name> -c chaos-agent -- tc qdisc del dev eth0 root 2>/dev/null

# Or restart pod
kubectl delete pod -n chaos-demo <pod-name>

Insufficient memory:

# Reduce memory allocation or increase pod limits
failures:
  memory:
    mb: 100 # Reduced from 500

# Or increase pod resources
resources:
  limits:
    memory: 1Gi # Increased limit

Missing tools:

# Ensure Dockerfile installs all required tools
RUN apt-get update && apt-get install -y \
    iproute2 \
    procps \
    && rm -rf /var/lib/apt/lists/*

Metrics Not Appearing¶

Symptom¶

curl localhost:8000/metrics
# curl: (7) Failed to connect to localhost port 8000: Connection refused

Diagnosis¶

# Check if metrics server started
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep "Metrics"
# Should see: [Metrics] Prometheus exporter running on :8000

# Check if port is exposed
kubectl get pod -n chaos-demo <pod-name> -o yaml | grep -A 5 ports

Solutions¶

Port not exposed:

containers:
  - name: chaos-agent
    ports:
      - containerPort: 8000
        name: metrics

Port forward not working:

# Kill existing port forwards
pkill -f "port-forward"

# Create new port forward
kubectl port-forward -n chaos-demo <pod-name> 8000:8000

# Or use service
kubectl port-forward -n chaos-demo svc/resilient-app 8000:8000

Metrics server crashed:

# Check for exceptions in metrics.py
# Increase debugging
import logging
logging.basicConfig(level=logging.DEBUG)

Docker Compose Issues¶

Symptom¶

docker-compose up
# ERROR: for chaos-agent  Cannot start service chaos-agent:
# error creating network: permission denied

Solutions¶

Permission denied:

# Run with sudo (Linux)
sudo docker-compose up

# Or add user to docker group
sudo usermod -aG docker $USER
newgrp docker

Containers can't communicate:

# Check network
docker network ls
docker network inspect py-chaos-agent_default

# Verify network mode in docker-compose.yml
network_mode: "service:target-app"  # Correct

Volume mount issues:

# Check file exists
ls -la config.yaml

# Verify mount in docker-compose.yml
volumes:
  - ./config.yaml:/app/config.yaml:ro

# Debug inside container
docker-compose run --rm chaos-agent ls -la /app/

Kubernetes Deployment Issues¶

Image Pull Errors¶

kubectl describe pod -n chaos-demo <pod-name>
# Events:
#   Failed to pull image: ImagePullBackOff

Solutions:

# For local images (minikube/kind)
eval $(minikube docker-env)  # or
kind load docker-image py-chaos-agent:latest

# For ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin <ecr-url>

# Verify image exists
docker images | grep chaos-agent

Pod Stuck in Pending¶

kubectl describe pod -n chaos-demo <pod-name>
# Events:
#   FailedScheduling: 0/3 nodes available: insufficient memory

Solutions:

# Reduce resource requests
resources:
  requests:
    cpu: 100m # Reduced
    memory: 128Mi # Reduced
  limits:
    cpu: 500m
    memory: 512Mi

Configuration Issues¶

Invalid YAML Syntax¶

kubectl apply -f config.yaml
# error: error parsing config.yaml: yaml: line 10: did not find expected key

Solutions:

# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('config.yaml'))"

# Use online validator: https://www.yamllint.com/

# Common issues:
# - Incorrect indentation (use spaces, not tabs)
# - Missing colons
# - Unquoted special characters

Configuration Not Loading¶

# Agent keeps using old configuration
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep CONFIG
# [CONFIG] Interval: 10s  # Old value!

Solutions:

# ConfigMap updates don't auto-reload pods
# Must delete pod to pick up changes
kubectl delete pod -n chaos-demo <pod-name>

# Or use a rolling restart
kubectl rollout restart deployment -n chaos-demo resilient-app

# Verify ConfigMap updated
kubectl get configmap -n chaos-demo chaos-config -o yaml

Performance Issues¶

High CPU Usage¶

kubectl top pod -n chaos-demo
# NAME                             CPU(cores)   MEMORY(bytes)
# resilient-app-xxx-xxx            950m         150Mi

Diagnosis:

# Check if CPU chaos is stuck
kubectl logs -n chaos-demo <pod-name> -c chaos-agent | grep CPU

# Check metrics
curl localhost:8000/metrics | grep chaos_injection_active
# chaos_injection_active{failure_type="cpu"} 1  # Still active!

Solutions:

# Reduce CPU chaos intensity
failures:
  cpu:
    cores: 1 # Reduced from 4
    duration_seconds: 5 # Reduced from 30

# Or adjust interval
agent:
  interval_seconds: 60 # Increased from 10

Memory Leak¶

kubectl top pod -n chaos-demo
# NAME                             CPU(cores)   MEMORY(bytes)
# resilient-app-xxx-xxx            100m         950Mi  # Increasing!

Solutions:

# Check for stuck memory injections
curl localhost:8000/metrics | grep memory

# Set proper resource limits
resources:
  limits:
    memory: 512Mi  # Pod will be OOM killed if exceeded

# Restart pod periodically if needed
kubectl delete pod -n chaos-demo <pod-name>

Debugging Techniques¶

Enable Debug Logging¶

# Add to src/agent.py
import logging
logging.basicConfig(
    level=logging.DEBUG,
    format='[%(asctime)s] %(levelname)s: %(message)s'
)

Interactive Debugging¶

# Access chaos agent container
kubectl exec -it -n chaos-demo <pod-name> -c chaos-agent -- /bin/bash

# Test commands manually
tc qdisc show dev eth0
ps aux | grep target
python -c "import psutil; print(psutil.cpu_percent())"

Check System Resources¶

# Inside container
free -h          # Memory
df -h            # Disk
ip addr show     # Network interfaces
ps auxf          # Process tree

Validate Metrics Manually¶

# Python script to check metrics
import requests

response = requests.get('http://localhost:8000/metrics')
for line in response.text.split('\n'):
    if 'chaos_' in line and not line.startswith('#'):
        print(line)

Getting Help¶

Before Opening an Issue¶

Check logs:

kubectl logs -n chaos-demo <pod-name> -c chaos-agent > agent.log
kubectl describe pod -n chaos-demo <pod-name> > pod-describe.txt

Collect configuration:

kubectl get configmap -n chaos-demo chaos-config -o yaml > config.yaml
kubectl get pod -n chaos-demo <pod-name> -o yaml > pod.yaml

Get metrics:

curl localhost:8000/metrics > metrics.txt

Document steps:
What you tried
What you expected
What actually happened
Environment details (Kubernetes version, cloud provider)

Issue Template¶

## Environment

- Kubernetes version:
- Cloud provider:
- Chaos agent version:

## Problem Description

[Clear description of the issue]

## Steps to Reproduce

1.
2.
3.

## Expected Behavior

[What should happen]

## Actual Behavior

[What actually happens]

## Logs

[Paste relevant logs]

## Configuration
```yaml
[Paste relevant config]

## Additional Resources

- [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug/)
- [Docker Debugging Guide](https://docs.docker.com/config/containers/troubleshoot/)
- [Prometheus Troubleshooting](https://prometheus.io/docs/prometheus/latest/troubleshooting/)