Safety & Ethics

Chaos engineering is a powerful tool for building resilient systems, but it must be used responsibly.

Core Principles¶

Only Test Systems You Own¶

Critical

Only use Py-Chaos-Agent on systems you own or have explicit written permission to test. Unauthorized chaos testing is unethical and potentially illegal.

Never Use in Production Without Approval¶

Production chaos engineering requires:

Formal approval from leadership and stakeholders
Comprehensive monitoring and alerting
Immediate rollback procedures
On-call team availability
Clear runbook documentation
Gradual rollout strategy

Start Small and Controlled¶

Begin your chaos journey with:

Dry run mode - Test configuration without injection
Development environments - Low-risk testing grounds
Low probability - Start with 5-10% injection rates
Single failure modes - Test one thing at a time
Short durations - Keep failures brief initially

Safety Features¶

Built-in Protections¶

Py-Chaos-Agent includes several safety mechanisms:

Self-protection: Won't terminate its own process
Dry run mode: Test without actual injection
Configurable probabilities: Control injection frequency
Duration limits: Automatic cleanup after timeout
Metrics and logging: Full observability

Configuration Safety¶

agent:
  interval_seconds: 10 # Don't set too low
  dry_run: true # Start with dry run

failures:
  cpu:
    probability: 0.1 # Start with low probability
    duration_seconds: 5 # Keep durations short

Ethical Guidelines¶

Minimize Harm¶

Test during low-traffic periods
Have rollback plans ready
Monitor system health continuously
Stop immediately if issues arise
Notify relevant teams before testing

Respect User Impact¶

Remember that chaos testing can affect:

User experience and satisfaction
Service availability
System performance
Team on-call burden

Balance testing goals with user needs.

Transparent Communication¶

Document all chaos experiments
Notify stakeholders before testing
Share results and learnings
Be honest about failures and incidents

Best Practices¶

Pre-Testing Checklist¶

Before running chaos experiments:

Have explicit permission to test
Verify monitoring and alerting works
Document expected behavior
Prepare rollback procedures
Notify relevant teams
Set up logging and metrics collection
Start with dry run mode
Test during appropriate hours

During Testing¶

Monitor metrics continuously
Watch for unexpected behavior
Be ready to stop immediately
Document all observations
Keep stakeholders informed

Post-Testing¶

Analyze results thoroughly
Document lessons learned
Share findings with team
Update runbooks and documentation
Plan improvements based on findings

Risk Management¶

Assess Before Testing¶

Consider these factors:

System criticality: How important is the system?
User impact: How many users will be affected?
Recovery time: How quickly can we recover?
Team availability: Is support available?
Monitoring coverage: Can we detect issues?

Gradual Rollout¶

Increase chaos intensity gradually:

Week 1: Dry run mode, verify configuration
Week 2: Enable one failure mode at 10% probability
Week 3: Increase to 20-30% if stable
Week 4: Add second failure mode
Month 2: Fine-tune based on learnings

Know When to Stop¶

Stop chaos testing immediately if:

Customer-impacting incidents occur
System behavior becomes unpredictable
Monitoring or alerting fails
Team is overwhelmed
Stakeholders request a halt

Legal Considerations¶

Terms of Service¶

Ensure chaos testing doesn't violate:

Cloud provider terms of service
Service level agreements (SLAs)
Compliance requirements
Regulatory constraints

Liability¶

Document all testing activities
Maintain audit logs
Follow organizational policies
Consult legal team for production testing

Getting Help¶

If you have questions about safe chaos engineering:

Review Principles of Chaos Engineering
Join chaos engineering communities
Consult with SRE/DevOps teams
Start small and learn incrementally

Report Issues¶

If you discover a safety issue with Py-Chaos-Agent:

Open a security issue on GitHub
Contact maintainers directly
Don't disclose publicly until fixed

Remember: The goal of chaos engineering is to build better systems, not to cause harm. Always prioritize safety, ethics, and user well-being.