Safety & Ethics
Chaos engineering is a powerful tool for building resilient systems, but it must be used responsibly.
Core Principles¶
Only Test Systems You Own¶
Critical
Only use Py-Chaos-Agent on systems you own or have explicit written permission to test. Unauthorized chaos testing is unethical and potentially illegal.
Never Use in Production Without Approval¶
Production chaos engineering requires:
- Formal approval from leadership and stakeholders
- Comprehensive monitoring and alerting
- Immediate rollback procedures
- On-call team availability
- Clear runbook documentation
- Gradual rollout strategy
Start Small and Controlled¶
Begin your chaos journey with:
- Dry run mode - Test configuration without injection
- Development environments - Low-risk testing grounds
- Low probability - Start with 5-10% injection rates
- Single failure modes - Test one thing at a time
- Short durations - Keep failures brief initially
Safety Features¶
Built-in Protections¶
Py-Chaos-Agent includes several safety mechanisms:
- Self-protection: Won't terminate its own process
- Dry run mode: Test without actual injection
- Configurable probabilities: Control injection frequency
- Duration limits: Automatic cleanup after timeout
- Metrics and logging: Full observability
Configuration Safety¶
agent:
interval_seconds: 10 # Don't set too low
dry_run: true # Start with dry run
failures:
cpu:
probability: 0.1 # Start with low probability
duration_seconds: 5 # Keep durations short
Ethical Guidelines¶
Minimize Harm¶
- Test during low-traffic periods
- Have rollback plans ready
- Monitor system health continuously
- Stop immediately if issues arise
- Notify relevant teams before testing
Respect User Impact¶
Remember that chaos testing can affect:
- User experience and satisfaction
- Service availability
- System performance
- Team on-call burden
Balance testing goals with user needs.
Transparent Communication¶
- Document all chaos experiments
- Notify stakeholders before testing
- Share results and learnings
- Be honest about failures and incidents
Best Practices¶
Pre-Testing Checklist¶
Before running chaos experiments:
- Have explicit permission to test
- Verify monitoring and alerting works
- Document expected behavior
- Prepare rollback procedures
- Notify relevant teams
- Set up logging and metrics collection
- Start with dry run mode
- Test during appropriate hours
During Testing¶
- Monitor metrics continuously
- Watch for unexpected behavior
- Be ready to stop immediately
- Document all observations
- Keep stakeholders informed
Post-Testing¶
- Analyze results thoroughly
- Document lessons learned
- Share findings with team
- Update runbooks and documentation
- Plan improvements based on findings
Risk Management¶
Assess Before Testing¶
Consider these factors:
- System criticality: How important is the system?
- User impact: How many users will be affected?
- Recovery time: How quickly can we recover?
- Team availability: Is support available?
- Monitoring coverage: Can we detect issues?
Gradual Rollout¶
Increase chaos intensity gradually:
- Week 1: Dry run mode, verify configuration
- Week 2: Enable one failure mode at 10% probability
- Week 3: Increase to 20-30% if stable
- Week 4: Add second failure mode
- Month 2: Fine-tune based on learnings
Know When to Stop¶
Stop chaos testing immediately if:
- Customer-impacting incidents occur
- System behavior becomes unpredictable
- Monitoring or alerting fails
- Team is overwhelmed
- Stakeholders request a halt
Legal Considerations¶
Terms of Service¶
Ensure chaos testing doesn't violate:
- Cloud provider terms of service
- Service level agreements (SLAs)
- Compliance requirements
- Regulatory constraints
Liability¶
- Document all testing activities
- Maintain audit logs
- Follow organizational policies
- Consult legal team for production testing
Getting Help¶
If you have questions about safe chaos engineering:
- Review Principles of Chaos Engineering
- Join chaos engineering communities
- Consult with SRE/DevOps teams
- Start small and learn incrementally
Report Issues¶
If you discover a safety issue with Py-Chaos-Agent:
- Open a security issue on GitHub
- Contact maintainers directly
- Don't disclose publicly until fixed
Remember: The goal of chaos engineering is to build better systems, not to cause harm. Always prioritize safety, ethics, and user well-being.