2012年9月24日 星期一

Some good articles about testing and monitoring

l   Expect failure at any time
l   Automation is key
l   Dashboards are essential

l   Backups are good only if you can restore them
l   If it's not monitored, it's not in production
l   If a protocol has an acronym, you need to learn it
l   The most important skill you need to master is problem solving
l   You need at least 2 of everything in production
l   Keep your systems secure
l   Logging is your best friend
l   You need to know a scripting language
l   Document everything
l   Always try to be a leader

Step 1. Configure a good monitoring and alerting system
Step 2. Configure a good resource graphing system
Step 3. Dashboards, dashboards, dashboards
Step 4. Correlate errors with resource state and capacity
Step 5. Expect failures and recover quickly and gracefully

l   Test-infected vs. monitoring-infected
l   Adding tests vs. adding monitoring checks
l   Ignoring broken tests vs. ignoring monitoring alerts
l   Improving test coverage vs. improving monitoring coverage
l   Measure and graph everything