TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Cloud Connect 2012, Big Data @ Netflix
1. Big Data @
Using Big Data to Grow our Business
& Retain our Customers.
Jerome Boulon
Lead Architect, Hadoop Big Data Infrastructure
February 15, 2012
jboulon@netflix.com
2. Big Data @ Netflix
Offline analysis:
• Honu: Scalable log analysis system to gain business
insights:
– Errors logs (unstructured logs)
– Statistical logs & Performance logs
– Etc
Online analysis:
• Cassandra for all online activities and user facing
data
– A/B testing (test allocation, metadata)
– Service level Configuration
– etc
2
3. Overview
Data collection pipeline
Applica'on
Collectors
Hive
M/R
Data processing pipeline
3
4. Honu - Structured Log API
Using
Annota+ons
Using the Key/Value API
• Convert Java Class to Hive • Produce the same result as
Table dynamically Annotation
• Add/Remove column • Avoid unnecessary object
• Supported java types: creation
• All primitives • Fully dynamic
• Map • Thread Safe
• Object using the
toString method
5. Honu, What you get:
log.logEvent(myObject)
Hive table
movieId customerId timestamp hostname
Select customerId, count(1) from MyTable group by customerId;
6. December 2009
Collectors
– POC for Streaming analysis Applica'on
– Single AWS zone
– 1 application
– 60 Millions events/Day
– 50 clients
– Small Hadoop cluster Oracle
– 1 Map/Reduce
– 1 Table
M/R
7. Feb 2012
40+ Billion events/Day
8+ tables with 1+TB/Day
100+ smaller tables
Self-serve:
à No DBA
à No Pre-provisioning
à Fully integrated with Hive
- Multi Regions deployments
- Transparent to our engineers
- Streaming based solution
- Zero configuration
- 7000+ clients
- Built-in: Netflix Hive warehouse
- Fail-Over
- Load balancing
à One central Data warehouse
à Hourly/Daily reports
à Data retention/expiration
9. Diagnostic Information
• Collect latency information for all external
operations
• If Latency > threshold log to Honu:
– AWS Region & Zone
– Instance
– Service details
• Open Jira/Ticket & Attach diagnostic info
10. Mix Offline and Online Data
Offline data Specific conditions
- Fire & forget - Online Data availability is not mandatory
- Scale to very large volumes - If exist, data could be useful online
- Cost effective - Only a subset useful Online
- Ready to pay a little bit more
Special collectors Customer support
- All data goes to Hive - Browsing history
- A subset goes to a real-time system - Historical & non-critical actions
- Still cost effective Debug
- Push validation
- Root cause analysis
11. Honu Realtime usages
• Movie playback experience • Customer Support
– Video quality – Historical usage
– Network issue – Last activity
• Errors Summary • Launch Reports
– Error tracking per service – Push validation
– Error tracking per device – Root cause analysis
12. Honu Realtime - Architecture
Realtime Data collection pipeline
Applica'on
Collectors
Real'me
Access
Realtime
System M/R
13. A/B Testing
Test: An experiment where several
competing behaviors are
implemented and compared.
Cell: different experiences within a
test that are being compared against
each other.
Allocation: a customer-specific
assignment to a cell within a test
Online data: Tracking 1 M customers per Test
- Cell Allocation > 1 Billion records information 8 tracking events per Day
- Test config: 1 entry/test/customer (example) ------------------------------------
100 Tests = 800 M events/ Day
3 Months = 72 B events
15. A/B Testing - Architecture
Online Data Offline Data
- Customer test allocation - Test tracking
- Metadata about the test Ex:
Ex: - Retention
- Start/End date - Engagement metrics
- UI directives
- Logging directives
16. Beacon Server
User behavior
- Client side interactions
- Search/Play/Stop/Pause
Ajax calls
Device monitoring
- Heartbeat
- Status & Key metrics Beacon
Beacon
Beacon
18. Hive ß à BI
– Dimension tables (daily export from Teradata)
– Hourly/Daily Hive summary queries
– Hourly/Daily export from Hive to BI
• Queries runs in the cloud
• Aggregated result goes back to our BI solution
20. Cassandra à BI
• Use Cassandra backups to run analytics
• Export SSTable to Hadoop
• Pig to:
– Parse SSTable
– Extract/Group required information
• Load the result back to Teradata