Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.
5. Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled app
Amazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift
6. Amazon Kinesis customer base diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad impressions/day,
30 ms response time | Ad
Tech
100 GB/day click streams
from 250+ sites | Enterprise
50 billion ad impressions/day
sub-50 ms responses | Ad
Tech
10 million events/day
| Retail
Amazon Kinesis as databus
Migrated from Kafka to Kinesis |
Enterprise
Funnel all production
events through
Amazon Kinesis |
High Tech
7. Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play
8. “I don't know how we could have made our clickstream data
pipeline work without Amazon Kinesis. It would have involved many
weeks of engineering. Kinesis Streams and Firehose make the
entire process extremely simple and reliable.”
Peter Jaffe
Data Scientist, Hearst Corporation
9. Netflix Uses Kinesis Streams
to Analyze Billions of Network
Traffic Flows in Real-Time
27. ● OLAP-style (Online Analytical Processing)
● Rollup
● ex. All apps deployed to the same region roll up to that region
● Drill down
● ex. Which apps deployed to a region generate the most traffic?
● Slicing and dicing
● ex. Which apps generate the most traffic in a region by day?
Queries
28. ● Large dataset (billions of events per day)
● Multiple dimensions and metrics
● Ad-hoc OLAP queries
● Fast aggregations
● Real-time
New source for network analytics
29. Dredge
Ingest
Network data from the entire system
Enrich
Traffic logs with application metadata
Aggregate
Multi-dimensional metrics
34. Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App foo sent 426718 bytes to app bar today
35. Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App bar received 8278392 bytes from apps foo and baz in
the last week
36. Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App baz has outbound network dependencies on apps foo,
bar, etc.
48. ● Separation of concerns for reading and writing
● Changelog stream is a 1st class citizen
● Consume and join streams instead of querying DB
● Maintain materialized views
● Pre-computed cache
Unbundle the database
52. ● Integration with AWS services
● Kinesis Client Library (KCL)
● Auto scaling for elastic throughput
● Total Cost of Ownership (TCO)
Kinesis Over Kafka
54. ● Worker per EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library
55. VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Elastic throughput
56. VPC Flow Logs IncomingBytes per minute
Example account and region over 3 hours
Elastic throughput
57. ● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Run and manage auto-scaling util
● No consultation from internal Kafka team
○ Capacity planning
○ Monitoring, failover, and replication
TCO
58. ● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations
59. ● Kinesis enables us to focus
● Cross-account log sharing simplifies the system
● KCL does the boring stuff
● Auto scaling improves efficiency
● Lower TCO
Ingest: Lessons
66. ● Hash table of sorted lists
● Key is IP, Value is metadata sorted by timestamp
● Recent updates (within capture window) or last
● Join with flow log events stream
Address Metadata Changelog