This week I went to a talk on AWS architecture from Netflix’s chief architecture - Adrian Cockcroft. The talk covered a broad range of topics and in a good deal of technical detail (yeesss). Adrian’s blog on performance tools, cloud architecture and capacity planning is here and he’s worth following on twitter… @adrianco
Here’s details on the netflix tech stack, take home quotes/information, netflix monkeys and what Netflix don’t use AWS for:
Netflix tech stack:
- Netflix is completely cloud apart from one datacentre (think that used for backup). So internally Netflix use cloud apps like workday, box and evernote.
- Pagerduty (for alerting developers), app dynamics to drill down into application performance, logging using hive and pig, search is powered using solr.
- They use memcached for caching.
- The use jenkins build server and Cassandra which they update every 2 months – this ensures no large jump in versions.
Take home quotes and information gems:
- Good developers are the most scarce resource – optimise for this.
- Adrian hinted at no-ops at Netflix but his blog post “Ops, DevOps and PaaS (NoOps) at Netflix” expands on this.
- When setting out an architecture, Netflix use rules for “anti-architecture”. No dictation of you must use X to developers, just that your service shouldn’t be Y.
- Every developer ships a restful service and two developers per service.
- Code to fail fast, so service will call another service
- Developers use concept of circuit breakers so that service developers know which part fails
- Open sourcing code on github is great way to get developers to clean up code!
- A/B testing is baked in the netflix application as a restful service.
- Netflix release process is to spin up new machines with the new code on the same load balancer as live, gradually introduce traffic and if bad just remove the servers.
- Incident reviews (RCA) are low stress and good things.
- Auto scaling AWS is done on business metrics.
- Some AMI IOPS performance stats - Netflix get about 500 IOPS from m2.4xlarge and over 100,000 IOPs from a hi1.4xlarge!!
- Image bakery (for AMIs) is how they build and deploy, this is going to be open sourced
Continuous testing using the Monkeys
- Testing is continuous and on live using monkeys on different types depending on the test.
- Chaos monkey kills processes randomly to ensure stateless of the application. This runs from 9am to 3pm mon to fri on live instances
- Latency monkey introduces well latency into the application to ensure application is coded defensively to deal with this.
- Conformity monkey deletes stuff using the API that doesn’t conform.
- There is also a Chaos Gorilla which kills entire AWS zones! the first time Netflix did this it was really painful but they learned and the second time the zone migration was a lot smoother.
- Applications can opt out but generally they learn to opt-in to the chaos.
Netflix doesn’t use AWS for streaming
They rollout a custom built 4U server with terabytes of storage which they orchestrate from their AWS environment.
These 4U servers is Netflix rolling out it’s own CDN into ISPs – to save both Netflix and ISP bandwidth costs. Should CDN’s be worried..?
footnote #1 last year it was reported that AWS outaged last year took out netflix – it didn’t. They had a bug, which once fixed mean’t that the zone migration worked.
footnote#2 In the photo at the top, that’s me in the second row, red jacket slightly hidden!