Chaos engineering is a powerful tool for building resilient systems, but it must be used responsibly.

Core Principles

Only Test Systems You Own

Critical

Only use Py-Chaos-Agent on systems you own or have explicit written permission to test. Unauthorized chaos testing is unethical and potentially illegal.

Never Use in Production Without Approval

Production chaos engineering requires:

  • Formal approval from leadership and stakeholders
  • Comprehensive monitoring and alerting
  • Immediate rollback procedures
  • On-call team availability
  • Clear runbook documentation
  • Gradual rollout strategy

Start Small and Controlled

Begin your chaos journey with:

  1. Dry run mode - Test configuration without injection
  2. Development environments - Low-risk testing grounds
  3. Low probability - Start with 5-10% injection rates
  4. Single failure modes - Test one thing at a time
  5. Short durations - Keep failures brief initially

Safety Features

Built-in Protections

Py-Chaos-Agent includes several safety mechanisms:

  • Self-protection: Won't terminate its own process
  • Dry run mode: Test without actual injection
  • Configurable probabilities: Control injection frequency
  • Duration limits: Automatic cleanup after timeout
  • Metrics and logging: Full observability

Configuration Safety

agent:
  interval_seconds: 10 # Don't set too low
  dry_run: true # Start with dry run

failures:
  cpu:
    probability: 0.1 # Start with low probability
    duration_seconds: 5 # Keep durations short

Ethical Guidelines

Minimize Harm

  • Test during low-traffic periods
  • Have rollback plans ready
  • Monitor system health continuously
  • Stop immediately if issues arise
  • Notify relevant teams before testing

Respect User Impact

Remember that chaos testing can affect:

  • User experience and satisfaction
  • Service availability
  • System performance
  • Team on-call burden

Balance testing goals with user needs.

Transparent Communication

  • Document all chaos experiments
  • Notify stakeholders before testing
  • Share results and learnings
  • Be honest about failures and incidents

Best Practices

Pre-Testing Checklist

Before running chaos experiments:

  • Have explicit permission to test
  • Verify monitoring and alerting works
  • Document expected behavior
  • Prepare rollback procedures
  • Notify relevant teams
  • Set up logging and metrics collection
  • Start with dry run mode
  • Test during appropriate hours

During Testing

  • Monitor metrics continuously
  • Watch for unexpected behavior
  • Be ready to stop immediately
  • Document all observations
  • Keep stakeholders informed

Post-Testing

  • Analyze results thoroughly
  • Document lessons learned
  • Share findings with team
  • Update runbooks and documentation
  • Plan improvements based on findings

Risk Management

Assess Before Testing

Consider these factors:

  • System criticality: How important is the system?
  • User impact: How many users will be affected?
  • Recovery time: How quickly can we recover?
  • Team availability: Is support available?
  • Monitoring coverage: Can we detect issues?

Gradual Rollout

Increase chaos intensity gradually:

  1. Week 1: Dry run mode, verify configuration
  2. Week 2: Enable one failure mode at 10% probability
  3. Week 3: Increase to 20-30% if stable
  4. Week 4: Add second failure mode
  5. Month 2: Fine-tune based on learnings

Know When to Stop

Stop chaos testing immediately if:

  • Customer-impacting incidents occur
  • System behavior becomes unpredictable
  • Monitoring or alerting fails
  • Team is overwhelmed
  • Stakeholders request a halt

Terms of Service

Ensure chaos testing doesn't violate:

  • Cloud provider terms of service
  • Service level agreements (SLAs)
  • Compliance requirements
  • Regulatory constraints

Liability

  • Document all testing activities
  • Maintain audit logs
  • Follow organizational policies
  • Consult legal team for production testing

Getting Help

If you have questions about safe chaos engineering:

Report Issues

If you discover a safety issue with Py-Chaos-Agent:

  • Open a security issue on GitHub
  • Contact maintainers directly
  • Don't disclose publicly until fixed

Remember: The goal of chaos engineering is to build better systems, not to cause harm. Always prioritize safety, ethics, and user well-being.