Latest version of Netflix Architecture presentation, variants presented several times during October 2012

  • 1. Globally  Distributed  Cloud   Applica4ons  at  Ne7lix   October  2012   Adrian  Cockcro3   @adrianco  #ne6lixcloud   h;p://  
  • 2. Adrian  Cockcro3  •  Director,  Architecture  for  Cloud  Systems,  Ne6lix  Inc.   –  Previously  Director  for  PersonalizaMon  Pla6orm  •  DisMnguished  Availability  Engineer,  eBay  Inc.  2004-­‐7   –  Founding  member  of  eBay  Research  Labs  •  DisMnguished  Engineer,  Sun  Microsystems  Inc.  1988-­‐2004   –  2003-­‐4  Chief  Architect  High  Performance  Technical  CompuMng   –  2001  Author:  Capacity  Planning  for  Web  Services   –  1999  Author:  Resource  Management   –  1995  &  1998  Author:  Sun  Performance  and  Tuning   –  1996  Japanese  EdiMon  of  Sun  Performance  and  Tuning   •   SPARC  &  Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)  •  More   –  Twi;er  @adrianco  –  Blog  h;p://   –  PresentaMons  at  h;p://  
  • 3. The  Ne6lix  Streaming  Service   Now  in  USA,  Canada,  LaMn  America,   UK,  Ireland,  Sweden,  Denmark,   Norway  and  Finland  
  • 4. US  Non-­‐Member  Web  Site   AdverMsing  and  MarkeMng  Driven  
  • 5. Member  Web  Site   PersonalizaMon  Driven  
  • 6. Streaming  Device  API   Netflix Ready Devices From: May 2008 To: May 2010
  • 7. Content  Delivery  Service  Distributed  storage  nodes  controlled  by  Ne6lix  cloud  services  
  • 8. Abstract  •  Ne6lix  on  Cloud  –  What,  Why  and  When  •  Globally  Distributed  Architecture  •  Open  Source  Components  
  • 9. Why  Use  Cloud?      
  • 10. Things  we  don’t  do  
  • 11. What  Ne6lix  Did  •  Moved  to  SaaS   –  Corporate  IT  –  OneLogin,  Workday,  Box,  Evernote…   –  Tools  –  Pagerduty,  AppDynamics,  EMR  (Hadoop)  •  Built  our  own  PaaS   –  Customized  to  make  our  developers  producMve   –  Large  scale,  global,  highly  available,  leveraging  AWS  •  Moved  incremental  capacity  to  IaaS   –  No  new  datacenter  space  since  2008  as  we  grew   –  Moved  our  streaming  apps  to  the  cloud  
  • 12. Keeping  up  with  Developer  Trends   In  producMon   at  Ne6lix  •  Big  Data/Hadoop   2009  •  AWS  Cloud   2009  •  ApplicaMon  Performance  Management   2010  •  Integrated  DevOps  PracMces   2010  •  ConMnuous  IntegraMon/Delivery   2010  •  NoSQL   2010  •  Pla6orm  as  a  Service;  Fine  grain  SOA   2010  •  Social  coding,  open  development/github   2011  
  • 13. AWS  specific  feature  dependence….      
  • 14. Portability  vs.  FuncMonality  •  Portability  –  the  OperaMons  focus   –  Avoid  vendor  lock-­‐in   –  Support  datacenter  based  use  cases   –  Possible  operaMons  cost  savings  •  FuncMonality  –  the  Developer  focus   –  Less  complex  test  and  debug,  one  mature  supplier   –  Faster  Mme  to  market  for  your  products   –  Possible  developer  Mme/cost  savings  
  • 15. FuncMonal  PaaS  •  IaaS  base  -­‐  all  the  features  of  AWS   –  Very  large  scale,  mature,  global,  evolving  rapidly   –  ELB,  Autoscale,  VPC,  SQS,  EIP,  EMR,  etc,  etc.   –  E.g.  Large  files  (TB)  and  mulMpart  writes  in  S3  •  FuncMonal  PaaS  –  Ne6lix  added  features   –  ConMnuous  build/deploy,  SOA,  HA  pa;erns     –  Asgard  console,  Monkeys,  Big  data  tools   –  Cassandra/Zookeeper  data  store  automaMon  
  • 16. How  Ne6lix  Works  Consumer  Electronics   User  Data  AWS  Cloud   Web  Site  or   Discovery  API   Services   PersonalizaMon  CDN  Edge  LocaMons   DRM   Customer  Device   Streaming  API   (PC,  PS3,  TV…)   QoS  Logging   CDN   Management  and   Steering   OpenConnect   CDN  Boxes   Content  Encoding  
  • 17. Component  Services   (Simplified  view  using  AppDynamics)  
  • 18. Web  Server  Dependencies  Flow   (Home  page  business  transacMon  as  seen  by  AppDynamics)   Cassandra   memcached   Web  service  Start  Here   S3  bucket  
  • 19. One  Request  Snapshot   (captured  because  it  was  unusually  slow)  
  • 20. Current  Architectural  Pa;erns  for  Availability  •  Isolated  Services   –  Resilient  Business  logic  •  Three  Balanced  Availability  Zones   –  Resilient  to  Infrastructure  outage  •  Triple  Replicated  Persistence   –  Durable  distributed  Storage  •  Isolated  Regions   –  US  and  EU  don’t  take  each  other  down  
  • 21. Isolated  Services    Test  With  Chaos  Monkey,  Latency  Monkey
  • 22. Three  Balanced  Availability  Zones   Test  with  Chaos  Gorilla   Load  Balancers   Zone  A   Zone  B   Zone  C  Cassandra  and  Evcache   Cassandra  and  Evcache   Cassandra  and  Evcache   Replicas   Replicas   Replicas  
  • 23. Triple  Replicated  Persistence   Cassandra  maintenance  affects  individual  replicas     Load  Balancers   Zone  A   Zone  B   Zone  C  Cassandra  and  Evcache   Cassandra  and  Evcache   Cassandra  and  Evcache   Replicas   Replicas   Replicas  
  • 24. Isolated  Regions   US-­‐East  Load  Balancers   EU-­‐West  Load  Balancers   Zone  A   Zone  B   Zone  C   Zone  A   Zone  B   Zone  C  Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas  
  • 25. Failure  Modes  and  Effects  Failure  Mode   Probability   Mi4ga4on  Plan  ApplicaMon  Failure   High   AutomaMc  degraded  response  AWS  Region  Failure   Low   Wait  for  region  to  recover  AWS  Zone  Failure   Medium   ConMnue  to  run  on  2  out  of  3  zones  Datacenter  Failure   Medium   Migrate  more  funcMons  to  cloud  Data  store  failure   Low   Restore  from  S3  backups  S3  failure   Low   Restore  from  remote  archive  
  • 26. Ne6lix  Deployed  on  AWS   2009   2009   2010   2010   2010   2011  Content   Logs   Play   WWW   API   CS   Content   S3   InternaMonal   Management   DRM   Sign-­‐Up   Metadata   CS  lookup   Terabytes   EC2   Search   Device   DiagnosMcs   EMR   CDN  rouMng   Config   &  AcMons   Encoding   Solr   S3   Movie   TV  Movie   Customer   Hive  &  Pig   Bookmarks   Choosing   Choosing   Call  Log   Petabytes   Business   Social   Logging   RaMngs   Facebook   CS  AnalyMcs   Intelligence   CDNs   ISPs   Terabits   Customers  
  • 27. Cloud  Architecture  Pa;erns   Where  do  we  start?  
  • 28. Datacenter  to  Cloud  TransiMon  Goals  •  Faster   –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls   –  Measured  as  mean  and  99th  percenMle   –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user  •  Scalable   –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases   –  No  central  verMcally  scaled  databases   –  Leverage  AWS  elasMc  capacity  effecMvely  •  Available   –  SubstanMally  higher  robustness  and  availability  than  datacenter  services   –  Leverage  mulMple  AWS  availability  zones   –  No  scheduled  down  Mme,  no  central  database  schema  to  change  •  ProducMve   –  OpMmize  agility  of  a  large  development  team  with  automaMon  and  tools   –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)   –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  
  • 29. Ne6lix  Datacenter  vs.  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL  SMcky  In-­‐Memory  Session   Shared  Memcached  Session   Cha;y  Protocols   Latency  Tolerant  Protocols  Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  Pa;erns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
  • 30. Cassandra  on  AWS  A  highly  available  and  durable   deployment  pa;ern  
  • 31. Cassandra  Service  Pa;ern   Cassandra  Cluster  Service  REST  Clients   Managed  by  Priam   Between  6  and  72  nodes   Data  Access  REST  Service   Astyanax  Cassandra  Client   Datacenter   Update  Flow   Appdynamics  Service  Flow  VisualizaMon  
  • 32. ProducMon  Deployment  Totally  Denormalized  Data  Model   Over  50  Cassandra  Clusters   Over  500  nodes   Over  30TB  of  daily  backups   Biggest  cluster  72  nodes   1  cluster  over  250Kwrites/s  
  • 33. Astyanax  -­‐  Cassandra  Write  Data  Flows   Single  Region,  MulMple  Availability  Zone,  Token  Aware   Cassandra   • Disks   • Zone  A  1.  Client  Writes  to  local   Cassandra  3   2   Cassandra   If  a  node  goes  offline,   coordinator   • Disks   4 3  Disks   4   •  hinted  handoff  2.  Coodinator  writes  to   • Zone  C   1 • Zone  B   completes  the  write   2   other  zones  3.  Nodes  return  ack   Token   when  the  node  comes   back  up.  4.  Data  wri;en  to   Aware     internal  commit  log   Clients   Requests  can  choose  to   disks  (no  more  than   Cassandra   Cassandra   wait  for  one  node,  a   10  seconds  later)   • Disks   • Disks   quorum,  or  all  nodes  to   • Zone  B   • Zone  C   ack  the  write   3     Cassandra   SSTable  disk  writes  and   • Disks   4   compacMons  occur   • Zone  A   asynchronously  
  • 34. Data  Flows  for  MulM-­‐Region  Writes   Token  Aware,  Consistency  Level  =  Local  Quorum  1.  Client  writes  to  local  replicas   If  a  node  or  region  goes  offline,  hinted  handoff  2.  Local  write  acks  returned  to   completes  the  write  when  the  node  comes  back  up.   Client  which  conMnues  when   Nightly  global  compare  and  repair  jobs  ensure   2  of  3  local  nodes  are   everything  stays  consistent.   commi;ed  3.  Local  coordinator  writes  to   remote  coordinator.     Cassandra   100+ms  latency  4.  When  data  arrives,  remote   Cassandra   •  Disks   •  Disks   •  Zone  A   •  Zone  A   coordinator  node  acks  and   Cassandra   2   2   Cassandra   Cassandra   4   Cassandra   6   6   3   5   Disks  6   copies  to  other  remote  zones   6   •  Disks   •  Disks   •  Zone  C   •  Zone  B   •  •  Zone  C   4  Disks  B   •  •  Zone   1   4  5.  Remote  nodes  ack  to  local   US   EU   coordinator   Clients   Clients   Cassandra   2   Cassandra   Cassandra   Cassandra  6.  Data  flushed  to  internal   •  Disks   •  Zone  B   •  Disks   6   •  Zone  C   •  Disks   •  Zone  B   •  Disks   •  Zone  C   commit  log  disks  (no  more   Cassandra   6   5   Cassandra   than  10  seconds  later)   •  Disks   •  Disks   •  Zone  A   •  Zone  A  
  • 35. ETL  for  Cassandra  •  Data  is  de-­‐normalized  over  many  clusters!  •  Too  many  to  restore  from  backups  for  ETL  •  SoluMon  –  read  backup  files  using  Hadoop  •  Aegisthus   –  h;p://­‐bulk-­‐data-­‐pipeline-­‐out-­‐of.html   –  High  throughput  raw  SSTable  processing   –  Re-­‐normalizes  many  clusters  to  a  consistent  view   –  Extract,  Transform,  then  Load  into  Teradata  
  • 36. Benchmarks  and  Scalability  
  • 37. Cloud  Deployment  Scalability   New  Autoscaled  AMI  –  zero  to  500  instances  from  21:38:52  -­‐  21:46:32,  7m40s   Scaled  up  and  down  over  a  few  days,  total  2176  instance  launches,  m2.2xlarge  (4  core  34GB)     Min. 1st Qu. Median Mean 3rd Qu. Max. ! 41.0 104.2 149.0 171.8 215.8 562.0!
  • 38. Scalability  from  48  to  288  nodes  on  AWS   h;p://­‐cassandra-­‐scalability-­‐on.html   Client  Writes/s  by  node  count  –  Replica4on  Factor  =  3  1200000   1099837  1000000   800000   Used  288  of  m1.xlarge   4  CPU,  15  GB  RAM,  8  ECU   600000   537172   Cassandra  0.86   Benchmark  config  only   400000   366828   existed  for  about  1hr   200000   174373   0   0   50   100   150   200   250   300   350  
  • 39. Cassandra  on  AWS  The  Past   The  Future  •  Instance:  m2.4xlarge   •  Instance:  hi1.4xlarge  •  Storage:  2  drives,  1.7TB   •  Storage:  2  SSD  volumes,  2TB  •  CPU:  8  Cores,  26  ECU   •  CPU:  8  HT  cores,  35  ECU  •  RAM:  68GB   •  RAM:  64GB  •  Network:  1Gbit   •  Network:  10Gbit  •  IOPS:  ~500   •  IOPS:  ~100,000  •  Throughput:  ~100Mbyte/s   •  Throughput:  ~1Gbyte/s  •  Cost:  $1.80/hr   •  Cost:  $3.10/hr  
  • 40. Cassandra  Disk  vs.  SSD  Benchmark   Same  Throughput,  Lower  Latency,  Half  Cost  
  • 41. Availability  and  Resilience  
  • 42. Chaos  Monkey  h;p://­‐monkey-­‐released-­‐into-­‐wild.html  •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  •  Test  to  make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact  •  Chaos  Monkey  hours   –  Monday-­‐Friday  9am-­‐3pm  random  instance  kill  •  ApplicaMon  configuraMon  opMon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  
  • 43. Responsibility  and  Experience  •  Make  developers  responsible  for  failures   –  Then  they  learn  and  write  code  that  doesn’t  fail  •  Use  Incident  Reviews  to  find  gaps  to  fix   –  Make  sure  its  not  about  finding  “who  to  blame”  •  Keep  Mmeouts  short,  fail  fast   –  Don’t  let  cascading  Mmeouts  stack  up  •  Make  configuraMon  opMons  dynamic   –  You  don’t  want  to  push  code  to  tweak  an  opMon  
  • 44. Resilient  Design  –  Circuit  Breakers  h;p://­‐tolerance-­‐in-­‐high-­‐volume.html  
  • 45. Distributed  OperaMonal  Model  •  Developers   –  Provision  and  run  their  own  code  in  producMon   –  Take  turns  to  be  on  call  if  it  breaks  (pagerduty)   –  Configure  autoscalers  to  handle  capacity  needs  •  DevOps  and  PaaS  (aka  NoOps)   –  DevOps  is  used  to  build  and  run  the  PaaS   –  PaaS  constrains  Dev  to  use  automaMon  instead   –  PaaS  puts  more  responsibility  on  Dev,  with  tools  
  • 46. Culture  
  • 47. UnconvenMonal  Culture   See  culture  deck  at  h;p://  •  Brave/Aggressive  from  the  top  down  •  Focus  on  talent  density  above  everything  •  Reduce  process,  remove  complexity  •  Freedom  and  Responsibility  •  One  product  focus  for  the  whole  company  •  (almost)  full  informaMon  sharing  across  co.  •  Simplified  managers  role  
  • 48. Managers  Role  •  Hiring,  Architecture,  Project  Management  •  No  vacaMon  policy  to  track  •  (Almost)  no  remote  employees  or  contractors  •  No  bonuses  to  allocate  •  No  expenses  to  approve  •  Pay  mark  to  market  handled  at  VP  level  
  • 49. Ne6lix  OrganizaMon   DevOps  Org  ReporMng  into  Product  Group,  not  ITops   CEO  –  Reed  HasMngs   CPO  –  Chief  Product  Officer  –  Neil  Hunt   VP  -­‐  Cloud  and  Pla6orm  Engineering  -­‐  Yury   Pla6orm  and   Cloud  Ops   PersonalizaMon   Persistence   Reliability   Pla6orm  and   Membership  and   Data  Science   Architecture   Cloud  SoluMons   Billing   Pla6orm   Engineering   Engineering   Performance  Eng  Future  planning   Base  Pla6orm   Monitoring   Metadata   Alert  RouMng   Data  sources   Business   Security  Arch   Zookeeper   Monkeys   Benchmarking   Intelligence   Incident  Lifecycle   Vault  processing   Efficiency   Cassandra  Ops   Build  Tools   Memcached   AWS  VPC   AWS  Instances   Hyperguard   AWS  Instances   PagerDuty   AWS  Instances   Cassandra   Hadoop  on  EMR   AWS  API   Powerpoint  J  
  • 50. Build  Your  Own  PaaS  
  • 51. Components  •  ConMnuous  build  framework  turns  code  into  AMIs  •  AWS  accounts  for  test,  producMon,  etc.  •  Cloud  access  gateway  •  Service  registry  •  ConfiguraMon  properMes  service  •  Persistence  services  •  Monitoring,  alert  forwarding  •  Backups,  archives  
  • 52. Ne6lix  Open  Source  Strategy  •  Release  PaaS  Components  git-­‐by-­‐git   –  Source  at  –  we  build  from  it…   –  Intros  and  techniques  at   –  Blog  post  or  new  code  every  few  weeks  •  MoMvaMons   –  Give  back  to  Apache  licensed  OSS  community   –  MoMvate,  retain,  hire  top  engineers   –  “Peer  pressure”  code  cleanup,  external  contribuMons  
  • 53. Instance  creaMon   Bakery  &  Build  tools   Asgard   Base  AMI   Instance   Autoscaling  ApplicaMon   Odin   scripts   Code   Image  baked   ASG  /  Instance  started   Instance  Running  
  • 54. ApplicaMon  Launch   Governator   Eureka   (Guice)   Async   logging   Archaius   Entrypoints   Servo   Registering,  ApplicaMon  iniMalizing   configuraMon  
  • 55. RunMme   Astyanax   Priam   Curator   Chaos  Monkey   Latency  Monkey   NIWS   Exhibitor   LB   Janitor  Monkey   REST   Cass  JMeter  Dependency   client   Command   Explorers   Calling  other   Managing   Resiliency  aids   services   service  
  • 56. Open  Source  Projects   Legend   Github  /  Techblog   Priam   Exhibitor   Servo  and  Autoscaling  Scripts  Apache  ContribuMons   Cassandra  as  a  Service   Zookeeper  as  a  Service   Astyanax   Curator   Honu   Techblog  Post   Cassandra  client  for  Java   Zookeeper  Pa;erns   Log4j  streaming  to  Hadoop   Coming  Soon   CassJMeter   EVCache   Circuit  Breaker   Cassandra  test  suite   Memcached  as  a  Service   Robust  service  pa;ern   Cassandra  MulM-­‐region  EC2   Eureka  /  Discovery   Asgard  -­‐  AutoScaleGroup   datastore  support   Service  Directory   based  AWS  console   Aegisthus   Archaius   Chaos  Monkey   Hadoop  ETL  for  Cassandra   Dynamics  ProperMes  Service   Robustness  verificaMon   Explorers   EntryPoints   Latency  Monkey   Governator  -­‐  Library  lifecycle   Server-­‐side  latency/error   and  dependency  injecMon   injecMon   Janitor  Monkey   Odin   REST  Client  +  mid-­‐Mer  LB   Bakeries  and  AMI   Workflow  orchestraMon   Async  logging   ConfiguraMon  REST  endpoints   Build  dynaslaves  
  • 57. Roadmap  for  2012  •  More  resiliency  and  improved  availability  •  More  automaMon,  orchestraMon  •  “Hardening”  the  pla6orm,  code  clean-­‐up  •  Lower  latency  for  web  services  and  devices  •  IPv6  –  now  running  in  prod,  rollout  in  process  •  More  open  sourced  components  •  See  you  at  AWS  Re:Invent  in  November…  
  • 58. Takeaway     Ne?lix  has  built  and  deployed  a  scalable  global  Pla?orm  as  a  Service.    Key  components  of  the  Ne?lix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     h;p://   h;p://   h;p://     h;p://     @adrianco  #ne6lixcloud  
  • 59. Amazon Cloud Terminology Reference See This is not a full list of Amazon Web Service features•  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)  •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaMon  code)  •  EC2  –  ElasMc  Compute  Cloud   –  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraMons.   –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.   –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage   –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosMng  cloud  instances   –  Region  –  group  of  Avail  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan,  SA-­‐Brazil,  US-­‐Gov  •  ASG  –  Auto  Scaling  Group  (instances  booMng  from  the  same  AMI)  •  S3  –  Simple  Storage  Service  (h;p  access)  •  EBS  –  ElasMc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)  •  RDS  –  RelaMonal  Database  Service  (managed  MySQL  master  and  slaves)  •  DynamoDB/SDB  –  Simple  Data  Base  (hosted  h;p  based  NoSQL  datastore,  DynamoDB  replaces  SDB)  •  SQS  –  Simple  Queue  Service  (h;p  based  message  queue)  •  SNS  –  Simple  NoMficaMon  Service  (h;p  and  email  based  topics  and  messages)  •  EMR  –  ElasMc  Map  Reduce  (automaMcally  managed  Hadoop  cluster)  •  ELB  –  ElasMc  Load  Balancer  •  EIP  –  ElasMc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)  •  VPC  –  Virtual  Private  Cloud  (single  tenant,  more  flexible  network  and  security  constructs)  •  DirectConnect  –  secure  pipe  from  AWS  VPC  to  external  datacenter  •  IAM  –  IdenMty  and  Access  Management  (fine  grain  role  based  security  keys)