Courtesy of Adrian Colyer, who runs the kaleidoscopically illuminating The Morning Paper blog.
A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Designing and Deploying Internet-Scale Services."
The nub of the matter is summarised immediately:
- Does the design expect failures to happen regularly and handle them gracefully?
- Have we kept things as simple as possible?
- Have we automated everything?
Then some choice selections (text verbatim, some points excluded):
Overall Application Design & Development
- Can the service survive failure without human administrative interaction?
- Are failure paths frequently tested?
- Have we documented all conceivable component failure modes and combinations thereof?
- Does our design tolerate these failure modes? And if not, have we undertaken a risk assessment to determine the risk is acceptable?
- Have we avoided single points of failure?
Automatic Management and Provisioning
- Are all of our operations restartable?
- Is all persistent state stored redundantly?
- Have we automated provisioning and installation?
- Are configuration and code delivered by development in a single unit?
- Is the unit created by development used all the way through the lifecycle (test and prod. deployment)?
- Is there an audit log mechanism to capture all changes made in production?
- Have we eliminated any dependency on local storage for non-recoverable information?
- Is our deployment model as simple as it can possibly be? (Hard to beat file copy!)
- Are we using a chaos monkey in production?
(How to handle dependencies on other services / components).
- Can we tolerate highly variable latency in service calls? Do we have timeout mechanisms in place and can we retry interactions after a timeout (idempotency)?
- Are all retries reported, and have we bounded the number of retries?
- Do we have circuit breakers in place to prevent cascading failures? Do they 'fail fast'?
- Have we implemented inter-service monitoring and alerting?
Release Cycle and Testing
- Are we shipping often enough?
- Have we defined specific criteria around the intended user experience? Are we continuously monitoring it?
- Are we collecting the actual numbers rather than just summary reports? Raw data will always be needed for diagnosis.
- Have we minimized false-positives in the alerting system?
- Do we have a process in place to catch performance and capacity degradations in new releases?
- Are we running tests using real data?
- Do we have (and run) system-level acceptance tests?
Hardware Selection and Standardization
(I deviate from the Hamilton paper here, on the assumption that you'll use at least an IaaS layer).
- Do we depend only on standard IaaS compute, storage, and network facilities?
- Have we avoided dependencies on specific hardware features?
- Have we abstracted the network and naming? (For service discovery)
Operations and Capacity Planning
- Is there a devops team that takes shared responsibility for both developing and operating the service?
- Do we have a discipline of only making one change at a time?
- Is everything that might need to be configured or tuned in production able to be changed without a code change?
Auditing, Monitoring, and Alerting
- Are we tracking the alerts:trouble-ticket ratio (goal is near 1:1)?
- Are we tracking the number of systems health issues that don't have corresponding alerts? (goal is near zero)
- Have we instrumented every customer interaction that flows through the system? Are we reporting anomalies?
- Do we have automated testing that takes a customer view of the service?
- Do we have individual accounts for everyone who interacts with the system?
- Are we tracking all fault-tolerant mechanisms to expose failures they may be hiding?
- Do we have sufficient assertions in the code base?
- Are we keeping historical performance and log data?
- Are we exposing suitable health information for monitoring?
- Do our problem reports contain enough information to diagnose the problem?
- Can we snapshot system state for debugging outside of production?
- Are we recording all significant system actions? Both commands sent by users, and what the system internally does.
Graceful Degradation and Admission Control
- Can we meter admission to slowly bring a system back up after a failure?
Customer and Press Communication Plan
- Do we have a communications plan in place for issues such as wide-scale system unavailability, data loss or corruption, security breaches, privacy violations etc..?