Providing a great media consumption experience to customers is crucial to maximizing audience engagement. To do that, it is important that you make content available for consumption anytime, anywhere, on any device, with a personalized and interactive experience. This session explores the power of big data log analytics (real-time and batched), using technologies like Spark, Shark, Kafka, Amazon Elastic MapReduce, Amazon Redshift and other AWS services. Such analytics are useful for content personalization, recommendations, personalized dynamic ad-insertions, interactivity, and streaming quality.
This session also includes a discussion from Netflix, which explores personalized content search and discovery with the power of metadata.
2. Consumers Want …
• To watch content that matters to them
– From anywhere
– For free
• Content to be easily accessible
– Nicely organized according to their taste
– Instantly accessible from all different devices anywhere
• Content to be of high quality
– Without any “irrelevant” interruptions
– Personalized ads
• To multitask while watching content
– Interactivity, second screens, social media, share
14. Data Sources
• Content discovery
– Meta data
– Session logs
• Content delivery
–
–
–
–
Video logs
Page-click event logs
CDN logs
Application logs
• Computed along several high cardinality dimensions
• Very large datasets for a specific time frame
15. Mountains of Raw Data …
Some numbers from Netflix
– Over 40 million customers
– More than 50 countries and
territories
– Translates to hundreds of billions of
events in a short period of time
– Over 100 Billion Meta-data
operations a day
– Over 1 Billion viewing hours per
month
16. … To Useful Information ASAP
• Historical data
– Batch Analysis
• Live data
– Real-time Analysis
17. Historic vs. Real-time Analytics
100% Dynamic
•
•
•
•
Always computing on the fly
Flexible but slow
Scale is very hard
Content Delivery
100% Pre Computed
•
•
•
•
Superfast lookups
Rigid
Do not cover all the use cases
Content Discovery
19. Agenda
• Ultimate Content Discovery
– How Netflix creates personalized content and the power of
Metadata
• Ultimate Content Delivery
– The toolset for real-time big data processing
21. Why Is Personalization Important?
• Key streaming metrics
– Median viewing hours
– Net subscribers
• Personalization consistently
improves these
• Over 75% of what people watch
comes from recommendations
22. What Is Video Metadata?
•
•
•
•
•
•
•
Genre
Cast
Rating
Contract information
Streaming and deployment information
Subtitles, dubbing, trailers, stills, actual content
Hundreds of such attributes
24. Our Solution
Data snapshots to Amazon S3 (~10)
Metadata publishing engine (one
EC2 instance per country) generates
Amazon S3 facets (~10 per country)
Metadata cache reads Amazon S3
periodically, servers high-availability
apps deployed on EC2 (~2000
m2.4xl and m2.2xl)
Relevant Organized Data for
Consumption
Mountains of Video Metadata
Batch Processing
25. Metadata Platform Architecture
Various Metadata Generation and Entry Tools
(com.netflix.mpl.test.coldstart.services)
Put Snapshot Files – One per Source
Amazon S3
Get Snapshot Files
Publishing Engine
(netflix.vms.blob.file.instancetype.region)
Put Facets (10 per Country per Cycle)
Amazon S3
Get Blobs (~7GB files, 10 Gets per Instance Refresh)
…..
Playback
Devices API
Algorithms
(EC2 Instances –
m2.2xl or m2.4xl)
Offline Metadata
Processing
26. Data Entry and
Encoding Tools
Persistent Storage
AmazonS3
Publishing Engine
In-memory Cache
Metadata entered
Hourly data
snapshots
Check for
snapshots
Generate, write
artifacts
• Efficient resource
utilization
• Quick data
propagation
Periodic cache refresh
• Low footprint
• Quick startup/refresh
Java API calls
Apps
28. Target Application Scale
• File size 2 GB–15 GB
• ~10 per country (20 total)
• ~2000 instances (m2.2x or m2.4xl) accessing these files once an
hour via cache refresh from Amazon S3
• Availability goal : Streaming: 99.99%, sign-ups: 99.9%
– 100% of metadata access in memory
– Autoscaling to efficiently manage, startup time
30. Target Application Scale
• File size 2 GB–15 GB
• ~10 per country (20 500 total)
• ~2000 6000 instances (m2.2x or m2.4xl)
accessing via cache refresh from Amazon S3
• 100% of access in-memory to achieve high
availability
• Autoscaling to efficiently manage, startup time
31. Effects
• Slower file writes
• Longer publish time
• Slower startup and cache refresh
32. Amazon S3 Tricks That Helped
• Fewer writes
– Region-based publishing engine instead of percountry
– Blob images rather than facets
– 10 Amazon S3 writes per cycle (down from 500)
• Smaller file sizes
– Deduping moved to prewrite processing
– Compression: Zipped data snapshot files from
source
• Multipart writes
33. Results
• Significant reduction in average memory
footprint
• Significant reduction in application startup times
• Shorter publish times
34. What We Learned
• In-memory cache
(NetflixGraph) effective for
high availability
• Startup time important when
using autoscaling
• Use Amazon S3 best practices
• Circuit breakers
39. Amazon Kinesis
Enabling Real-time ingestion & processing of streaming data
Amazon Kinesis
Enabled Application
Amazon Kinesis
User
Data
Sources
User App.1
[Aggregate & DeDuplicate]
User
Data
Sources
GET
PUT
AWS VIP
Amazon S3
DynamoDB
User
Data
Sources
User
Data
Sources
User App.2
[Metric
Extraction]
User App.3
[Sliding Window]
Control Plane
Amazon Redshift
40. A quick intro to
Amazon Kinesis
Producer
Producer
Producers
•
•
Producer
Kinesis
Cluster
S
0
W1
W2
EC2 Instance
S
1
S
2
W3
S
3
W4
S
4
W5
W6
EC2 Instance
Generate a Stream of data
Data records from producers are Put into a Stream
using a developer supplied Partition Key which that
are places records within a specific Shard
Kinesis Cluster
• A managed service captures and transports data
Streams.
• Multiple Shards. Each supports 1MB/sec
• Developer controls number of shards – all shards
stored for 24 hours.
• HA & DD by 3-way replication (3X AZs)
• Each data record has a Kinesis-assigned Sequence #
Workers
• Each Shard is processed by a Worker running on
EC2 instances that developer owns and controls
42. A Quick Intro to Storm
• Similar to Hadoop cluster
• Topolgy vs. dobs (Storm vs. Hadoop)
– A topology runs forever (unless you kill it)
storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2
• Streams – unbounded sequence of tuples
– A stream of tweets into a stream of trending topics
– Spout: Source of streams (e.g., connect to a log API and emit a stream of logs)
– Bolt: Consumes any number of input streams, some processing, emit new streams
(e.g. filters, unions, compute)
*Source: https://github.com/nathanmarz/storm/wiki/Tutorial
43. A Quick Intro to Storm
Example: Get the count of ads that were clicked on and watched in a stream
LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder (‘‘reach’’);
//Create a topology
builder.addBolt(new GetStream(), 3);
//Get the stream that showed an ad. Transforms a stream of [id, ad] to [id, stream]
builder.addBolt(new GetViewers(), 12).shuffleGrouping();
//Get the viewers for ads. Transforms [id, stream] to [id, viewer]
builder.addBolt(new PartialUniquer(), 6).fieldsGrouping(new Fields(‘id’, ‘viewer’));
//Group the viewers stream by viewer id. Unique count of subset viewers
builder.addBolt(new CountAggregator(), 2).fieldsGrouping(new Fields(‘id’));
//Compute the aggregates for unique viewers
*Adopted from Source: https://github.com/nathanmarz/storm/wiki/Tutorial
44. Putting it together …
Amazon Kinesis
User
Data
Source
GET
PUT
User App.1
[Aggregate & DeDuplicate]
AWS VIP
User
Data
Source
Amazon Kinesis Enabled Application
User App.2
[Metric
Extraction]
User App.3
[Sliding Window]
Control Plane
45. A Quick Intro to
• Language-integrated interface in Scala
• General purpose programming interface can be used for
interactive data mining on clusters
• Example (count buffer events from a streaming log)
lines = spark.textFile("hdfs://...") //define a data structure
errors = lines.filter(_.startsWith(‘BUFFER'))
errors.persist() //persist in memory
errors.count()
errors.filter(_.contains(‘Stream1’)).count()
errors.cache() //cache datasets in memory to speed up reuse
*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing
47. Logistic Regression Performance
30 GB dataset
80 core cluster
Up to 20x faster than Hadoop interactive jobs
Scan 1TB dataset with 5 – 7 sec latnecy
127 s / iteration
*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing
First iteration 174 s
Further iterations 6 s
48. Conviva GeoReport
Time (hours)
• Aggregations on many keys w/ same WHERE clause
• 40× gain comes from:
– Not rereading unused columns or filtered records
– Avoiding repeated decompression
– In-memory storage of deserialized objects
50. Batch Processing
• EMR
– Hive on EMR
– Custom UDF (user-defined functions) needs for data warehouse
Amazon
EMR
• Redshift
– More traditional data warehousing workload
Amazon
Redshift
53. What’s next …
• On the fly adaptive bit rate in the future frame rate
and resolution
• Dynamic personalized ad Insertions
• Session Analytics for more monetization oportunities
• Social Media Chat
54. Pick up the remote
Start watching
Ultimate entertainment experience…
55. Please give us your feedback on this
presentation
MED303
As a thank you, we will select prize
winners daily for completed surveys!