SlideShare une entreprise Scribd logo
1  sur  21
Big Data @

         Using Big Data to Grow our Business
               & Retain our Customers.

                         Jerome Boulon
           Lead Architect, Hadoop Big Data Infrastructure

                         February 15, 2012
jboulon@netflix.com
Big Data @ Netflix
Offline analysis:
•  Honu: Scalable log analysis system to gain business
   insights:
   –  Errors logs (unstructured logs)
   –  Statistical logs & Performance logs
   –  Etc

Online analysis:
•  Cassandra for all online activities and user facing
   data
   –  A/B testing (test allocation, metadata)
   –  Service level Configuration
   –  etc
                                  2
Overview
                            Data collection pipeline


Applica'on	
                                           Collectors	
  




                 Hive	
                                    M/R	
  




                            Data processing pipeline
                                       3
Honu - Structured Log API
Using	
  Annota+ons	
           Using the Key/Value API
•  Convert Java Class to Hive   •  Produce the same result as
   Table dynamically               Annotation
•  Add/Remove column            •  Avoid unnecessary object
•  Supported java types:           creation
        •  All primitives       •  Fully dynamic
        •  Map                  •  Thread Safe
        •  Object using the
           toString method
Honu, What you get:




log.logEvent(myObject)
                                        Hive table
                         movieId customerId timestamp hostname



      Select customerId, count(1) from MyTable group by customerId;
December 2009
                                                                          Collectors	
  
–    POC for Streaming analysis                Applica'on	
  
–    Single AWS zone
–    1 application
–    60 Millions events/Day
–    50 clients
–    Small Hadoop cluster         Oracle	
  
–    1 Map/Reduce
–    1 Table
                                                                M/R	
  
Feb 2012
                                                40+ Billion events/Day
                                                 8+ tables with 1+TB/Day
                                                100+ smaller tables
                                                Self-serve:
                                                à No DBA
                                                à No Pre-provisioning


                                 	
            	
  
                                                à Fully integrated with Hive
- Multi Regions deployments
- Transparent to our engineers
- Streaming based solution
- Zero configuration
- 7000+ clients
- Built-in:                                           Netflix Hive warehouse
  - Fail-Over
  - Load balancing
                                        	
       à One central Data warehouse
                                                 à Hourly/Daily reports
                                                 à Data retention/expiration
Traceability & Performance
              analysis
•  Track service level call
   –  Instrument low level HTTP client
   –  Calls graph
   –  Request processing vs Perceive latency
   –  Payload marshalling/unmarshalling
      - duration, size, etc
   –  Service Result
      - Status, Error code, Exception, etc
Diagnostic Information
•  Collect latency information for all external
   operations
•  If Latency > threshold log to Honu:
    –  AWS Region & Zone
    –  Instance
    –  Service details
•  Open Jira/Ticket & Attach diagnostic info
Mix Offline and Online Data
Offline data                             Specific conditions
- Fire & forget                          - Online Data availability is not mandatory
- Scale to very large volumes            - If exist, data could be useful online
- Cost effective                         - Only a subset useful Online
                                         - Ready to pay a little bit more




 Special collectors                        Customer support
 - All data goes to Hive                   - Browsing history
 - A subset goes to a real-time system     - Historical & non-critical actions
 - Still cost effective                    Debug
                                           - Push validation
                                           - Root cause analysis
Honu Realtime usages
•  Movie playback experience        •  Customer Support
   –  Video quality                     –  Historical usage
   –  Network issue                     –  Last activity



•  Errors Summary                   •  Launch Reports
    –  Error tracking per service       –  Push validation
    –  Error tracking per device        –  Root cause analysis
Honu Realtime - Architecture
                 Realtime Data collection pipeline


Applica'on	
                                         Collectors	
  




   Real'me	
  
    Access	
  
                             Realtime
                             System                         M/R	
  
A/B Testing
 Test: An experiment where several
 competing behaviors are
 implemented and compared.

 Cell: different experiences within a
 test that are being compared against
 each other.

 Allocation: a customer-specific
 assignment to a cell within a test

Online data:                            Tracking       1 M customers per Test
- Cell Allocation > 1 Billion records   information    8 tracking events per Day
- Test config: 1 entry/test/customer    (example)     ------------------------------------
                                        100 Tests =   800 M events/ Day
                                        3 Months =       72 B events
Movie Presentation A/B Test
A/B Testing - Architecture
         Online Data            Offline Data




- Customer test allocation   - Test tracking
- Metadata about the test    Ex:
Ex:                          - Retention
- Start/End date             - Engagement metrics
- UI directives
- Logging directives
Beacon Server

User behavior
- Client side interactions
- Search/Play/Stop/Pause
                                           Ajax calls
Device monitoring
- Heartbeat
- Status & Key metrics        Beacon	
      Beacon	
     Beacon	
  
BI Integration
Three main technologies

•  Teradata (Data center)
•  Hive (Cloud)
•  Cassandra (Cloud)
Hive ß à BI
–  Dimension tables (daily export from Teradata)
–  Hourly/Daily Hive summary queries
–  Hourly/Daily export from Hive to BI
  •  Queries runs in the cloud
  •  Aggregated result goes back to our BI solution
Hive Reports
Cassandra à BI

•  Use Cassandra backups to run analytics
•  Export SSTable to Hadoop
•  Pig to:
  –  Parse SSTable
  –  Extract/Group required information
•  Load the result back to Teradata
jboulon@gmail.com
www.linkedin.com/in/jboulon

Contenu connexe

Tendances

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience Martin Zapletal
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewMonal Daxini
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
 
The Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownThe Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownData Con LA
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Timothy Spann
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiKevin McEntee
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaA unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaHostedbyConfluent
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechHostedbyConfluent
 

Tendances (20)

ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
 
Instrumenting your Instruments
Instrumenting your Instruments Instrumenting your Instruments
Instrumenting your Instruments
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
 
The Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownThe Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt Brown
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwiki
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaA unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
 

En vedette

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programminghye-jin-lee
 
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESAdrija Chowdhury
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
 
Mobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionMobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionnagarajc007
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsBlake Irvine
 

En vedette (6)

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
 
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Mobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionMobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detection
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
 

Similaire à Cloud Connect 2012, Big Data @ Netflix

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Resistance is futile, resilience is crucial
Resistance is futile, resilience is crucialResistance is futile, resilience is crucial
Resistance is futile, resilience is crucialHristo Iliev
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsDataWorks Summit
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 

Similaire à Cloud Connect 2012, Big Data @ Netflix (20)

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Resistance is futile, resilience is crucial
Resistance is futile, resilience is crucialResistance is futile, resilience is crucial
Resistance is futile, resilience is crucial
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 

Dernier

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Cloud Connect 2012, Big Data @ Netflix

  • 1. Big Data @ Using Big Data to Grow our Business & Retain our Customers. Jerome Boulon Lead Architect, Hadoop Big Data Infrastructure February 15, 2012 jboulon@netflix.com
  • 2. Big Data @ Netflix Offline analysis: •  Honu: Scalable log analysis system to gain business insights: –  Errors logs (unstructured logs) –  Statistical logs & Performance logs –  Etc Online analysis: •  Cassandra for all online activities and user facing data –  A/B testing (test allocation, metadata) –  Service level Configuration –  etc 2
  • 3. Overview Data collection pipeline Applica'on   Collectors   Hive   M/R   Data processing pipeline 3
  • 4. Honu - Structured Log API Using  Annota+ons   Using the Key/Value API •  Convert Java Class to Hive •  Produce the same result as Table dynamically Annotation •  Add/Remove column •  Avoid unnecessary object •  Supported java types: creation •  All primitives •  Fully dynamic •  Map •  Thread Safe •  Object using the toString method
  • 5. Honu, What you get: log.logEvent(myObject) Hive table movieId customerId timestamp hostname Select customerId, count(1) from MyTable group by customerId;
  • 6. December 2009 Collectors   –  POC for Streaming analysis Applica'on   –  Single AWS zone –  1 application –  60 Millions events/Day –  50 clients –  Small Hadoop cluster Oracle   –  1 Map/Reduce –  1 Table M/R  
  • 7. Feb 2012 40+ Billion events/Day 8+ tables with 1+TB/Day 100+ smaller tables Self-serve: à No DBA à No Pre-provisioning     à Fully integrated with Hive - Multi Regions deployments - Transparent to our engineers - Streaming based solution - Zero configuration - 7000+ clients - Built-in: Netflix Hive warehouse - Fail-Over - Load balancing   à One central Data warehouse à Hourly/Daily reports à Data retention/expiration
  • 8. Traceability & Performance analysis •  Track service level call –  Instrument low level HTTP client –  Calls graph –  Request processing vs Perceive latency –  Payload marshalling/unmarshalling - duration, size, etc –  Service Result - Status, Error code, Exception, etc
  • 9. Diagnostic Information •  Collect latency information for all external operations •  If Latency > threshold log to Honu: –  AWS Region & Zone –  Instance –  Service details •  Open Jira/Ticket & Attach diagnostic info
  • 10. Mix Offline and Online Data Offline data Specific conditions - Fire & forget - Online Data availability is not mandatory - Scale to very large volumes - If exist, data could be useful online - Cost effective - Only a subset useful Online - Ready to pay a little bit more Special collectors Customer support - All data goes to Hive - Browsing history - A subset goes to a real-time system - Historical & non-critical actions - Still cost effective Debug - Push validation - Root cause analysis
  • 11. Honu Realtime usages •  Movie playback experience •  Customer Support –  Video quality –  Historical usage –  Network issue –  Last activity •  Errors Summary •  Launch Reports –  Error tracking per service –  Push validation –  Error tracking per device –  Root cause analysis
  • 12. Honu Realtime - Architecture Realtime Data collection pipeline Applica'on   Collectors   Real'me   Access   Realtime System M/R  
  • 13. A/B Testing Test: An experiment where several competing behaviors are implemented and compared. Cell: different experiences within a test that are being compared against each other. Allocation: a customer-specific assignment to a cell within a test Online data: Tracking 1 M customers per Test - Cell Allocation > 1 Billion records information 8 tracking events per Day - Test config: 1 entry/test/customer (example) ------------------------------------ 100 Tests = 800 M events/ Day 3 Months = 72 B events
  • 15. A/B Testing - Architecture Online Data Offline Data - Customer test allocation - Test tracking - Metadata about the test Ex: Ex: - Retention - Start/End date - Engagement metrics - UI directives - Logging directives
  • 16. Beacon Server User behavior - Client side interactions - Search/Play/Stop/Pause Ajax calls Device monitoring - Heartbeat - Status & Key metrics Beacon   Beacon   Beacon  
  • 17. BI Integration Three main technologies •  Teradata (Data center) •  Hive (Cloud) •  Cassandra (Cloud)
  • 18. Hive ß à BI –  Dimension tables (daily export from Teradata) –  Hourly/Daily Hive summary queries –  Hourly/Daily export from Hive to BI •  Queries runs in the cloud •  Aggregated result goes back to our BI solution
  • 20. Cassandra à BI •  Use Cassandra backups to run analytics •  Export SSTable to Hadoop •  Pig to: –  Parse SSTable –  Extract/Group required information •  Load the result back to Teradata