Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Wrs8.
Tom Gianos and Dan Weeks discuss Netflix' overall big data platform architecture, focusing on Storage and Orchestration, and how they use Parquet on AWS S3 as their data warehouse storage layer. Filmed at qconsf.com.
Daniel Weeks manages the Big Data Compute team at Netflix and is responsible for integrating and enhancing open source big data processing technologies including Spark, Presto, Hadoop. Tom Gianos is a Senior Software Engineer, Big Data Platform at Netflix. He leads development of Genie and has a passion for merging web and big data technologies to solve interesting distributed systems problems.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Petabytes Scale Analytics Infrastructure @Netflix
1. Daniel C. Weeks
Tom Gianos
Netflix: Petabyte Scale
Analytics Infrastructure in
the Cloud
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-big-data-infrastructure
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. ● Data at Netflix
● Netflix Scale
● Platform Architecture
● Data Warehouse
● Genie
● Q&A
Overview
27. Parquet File Format
Column Oriented
● Store column data contiguously
● Improve compression
● Column projection
Strong Community Support
● Spark, Presto, Hive, Pig, Drill, Impala, etc.
● Works well with S3
28. Footer
RowGroup Metadata
row count, size, etc.
schema, version, etc.
Column Chunk Metadata
[encoding, size, min, max]
Column Chunk Metadata
[encoding, size, min, max]
Column Chunk Metadata
[encoding, size, min, max]
Dict Page
Data Page
Data Page
Column Chunk
Data Page
Data Page
Data Page
Column Chunk
Dict Page
Data Page
Data Page
Column Chunk
Dict Page
Data Page
Data Page
Column Chunk
Data Page
Data Page
Data Page
Column Chunk
Dict Page
Data Page
Data Page
Column Chunk
RowGroupRowGroup
41. Problems Netflix Data Platform
Faces
• For Administrators
– Coordination of many moving parts
• ~15 clusters
• ~45 different client executables and versions for those clusters
– Heavy load
• ~45-50k jobs per day
– Hundreds of users with different problems
• For Users
– Don’t want to know details
– All clusters and client applications need to be available for use
– Need to provide tools to make doing their jobs easy
43. An administrator wants a tool to…
• Simplify configuration management and
deployment
• Minimize impact of changes to users
• Track and respond to problems with system
quickly
• Scale client resources as load increases
44. Genie Configuration Data Model
• Metadata about cluster
– [sched:sla, type:yarn, ver:2.7.1]
• Executable(s)
– [type:spark-submit, ver:1.6.0]
• Dependencies for an executable
Cluster
Command
ApplicaLon
1
0..*
1
0..*
47. Updating a Cluster
• Start up a new cluster
• Register Cluster with Genie
• Run tests
• Move tags from old to new cluster in Genie
– New cluster begins taking load immediately
• Let old jobs finish on old cluster
• Shut down old cluster
• No down time!
48. Load Balance Between Clusters
• Different loads at different times of day
• Copy tags from one cluster to another to
split load
• Remove tags when done
• Transparent to all clients!
49. Update Application Binaries
• Copy new binaries to central download
location
• Genie cache will invalidate old binaries on
next invocation and download new ones
• Instant change across entire Genie cluster
51. User wants a tool to…
• Discover a cluster to run job on
• Run the job client
• Handle all dependencies and configuration
• Monitor the job
• View history of jobs
• Get job results
59. Data Warehouse
• S3 for Scale
• Decouple Compute & Storage
• Parquet for Speed
60. Genie at Netflix
• Runs the OSS code
• Runs ~45k jobs per day in production
• Runs on ~25 i2.4xl instances at any given
time
• Keeps ~3 months of jobs (~3.1 million) in
history
61. Resources
• http://netflix.github.io/genie/
– Work in progress for 3.0.0
• https://github.com/Netflix/genie
– Demo instructions in README
• https://hub.docker.com/r/netflixoss/genie-
app/
– Docker Container