Netflix AWS architecture – talk from Adrian Cockcroft

A5RR0gRCMAEai5xThis week I went to a talk on AWS architecture from Netflix’s chief architecture - Adrian Cockcroft.  The talk covered a broad range of topics and in a good deal of technical detail (yeesss).  Adrian’s blog on performance tools, cloud architecture and capacity planning is here and he’s worth following on twitter… @adrianco

Here’s details on the netflix tech stack, take home quotes/information, netflix monkeys and what Netflix don’t use AWS for:

Netflix tech stack:

  • Netflix is completely cloud apart from one datacentre (think that used for backup).  So internally Netflix use cloud apps like workday, box and evernote.
  • Pagerduty (for alerting developers), app dynamics to drill down into application performance, logging using hive and pig, search is powered using solr.
  • They use memcached for caching.
  • The use jenkins build server and Cassandra which they update every 2 months – this ensures no large jump in versions.

Take home quotes and information gems:

  • Good developers are the most scarce resource – optimise for this.
  • Adrian hinted at  no-ops at Netflix but his blog post “Ops, DevOps and PaaS (NoOps) at Netflix” expands on this.
  • When setting out an architecture, Netflix use rules for “anti-architecture”.  No dictation of  you must use X to developers, just that your service shouldn’t be Y.
  • Every developer ships a restful service and  two developers per service.
  • Code to fail fast, so service will call another service
  • Developers use concept of circuit breakers so that service developers know which part fails
  • Open sourcing code on github  is great way to get developers to clean up code!
  • A/B testing is baked in the netflix application as a restful service.
  • Netflix release process is to spin up new machines with the new code on the same load balancer as live, gradually introduce traffic and if bad just remove the servers.
  • Incident reviews (RCA) are low stress and good things.
  • Auto scaling AWS is done on business metrics.
  • Some AMI IOPS performance stats - Netflix get about 500 IOPS from m2.4xlarge and over 100,000 IOPs from a hi1.4xlarge!!
  • Image bakery (for AMIs) is how they build and deploy, this is going to be open sourced

Continuous testing using the Monkeys

  • Testing is continuous and on live using monkeys on different types depending on the test.
  • Chaos monkey kills processes randomly to ensure stateless of the application.  This runs from 9am to 3pm mon to fri on live instances
  • Latency monkey introduces well latency into the application to ensure application is coded defensively to deal with this.
  • Conformity monkey deletes stuff using the API that doesn’t conform.
  • There is also a Chaos Gorilla which kills entire AWS zones! the first time Netflix did this it was really painful but they learned and the second time the zone migration was a lot smoother.
  • Applications can opt out but generally they learn to opt-in to the chaos.

Netflix doesn’t use AWS for streaming

netflix custom hardware beastThey rollout a custom built 4U server with terabytes of storage which they orchestrate from their AWS environment.
These 4U servers is Netflix rolling out it’s own CDN into ISPs – to save both Netflix and ISP bandwidth costs.  Should CDN’s be worried..?

footnote #1 last year it was reported that AWS outaged last year took out netflix – it didn’t.  They had a bug, which once fixed mean’t that the zone migration worked.

footnote#2 In the photo at the top, that’s me in the second row, red jacket slightly hidden!