BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Bennett
Sr. Software Engineer, Netflix
Damian Wylie
Amazon Kinesis Product Management
The Connection Game
How Netflix Uses Kinesis Streams to Analyze
Billions of Network Traffic Flows in Real-Time
April 19, 2017

What is your decision latency?
OR

The value of data
Recent data is valuable
If you act on it in real time
Capture all of the value
from your data

Amazon Kinesis
Streams
Custom real-time
processing
Amazon Kinesis
Firehose
Load and transform your
data
Amazon Kinesis
Analytics
Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy

Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled app
Amazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift

Amazon Kinesis customer base diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad impressions/day,
30 ms response time | Ad
Tech
100 GB/day click streams
from 250+ sites | Enterprise
50 billion ad impressions/day
sub-50 ms responses | Ad
Tech
10 million events/day
| Retail
Amazon Kinesis as databus
Migrated from Kafka to Kinesis |
Enterprise
Funnel all production
events through
Amazon Kinesis |
High Tech

Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play

“I don't know how we could have made our clickstream data
pipeline work without Amazon Kinesis. It would have involved many
weeks of engineering. Kinesis Streams and Firehose make the
entire process extremely simple and reliable.”
Peter Jaffe
Data Scientist, Hearst Corporation

Netflix Uses Kinesis Streams
to Analyze Billions of Network
Traffic Flows in Real-Time

What is Netflix’s decision latency?

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Bennett
Sr. Software Engineer, Netflix
The Connection Game
How Netflix Uses Kinesis Streams to Analyze
Billions of Network Traffic Flows in Real-Time
April 19, 2017

● 93 million customers
● Over 190 countries
● 37% of Internet traffic
● 125 million hours of video
Netflix is big

● 100s of microservices
● 1,000s of deployments
● More than 100,000 instances
And complex

How do we optimize the design and
use of the network at scale in a
dynamic environment?

How is the network being used?

● Immutable infrastructure
● Scaling events
● Internal (instances and containers)
● External (AWS S3, ELB, Internet, etc.)
IPs in a Dynamic Environment

● Slowly changing dimension
● Unpredictable
● Valid during a specific time interval
Metadata in a Dynamic Environment

Source
IP
Destination
IPat time t

Source
Metadata
Destination
Metadataat time t

Dredge
Transforms traffic logs into
enriched and aggregated
multi-dimensional data

● Account
● Region
● Availability Zone
● VPC, Subnet
● Protocol (TCP, UDP)
● Accept or Reject
Metadata Dimensions
● Application
● Cluster
● Type
• instance
• container
• AWS service

● Bytes transferred
● Packets sent
● Number of flows
● Latency
Aggregated Metrics

● OLAP-style (Online Analytical Processing)
● Rollup
● ex. All apps deployed to the same region roll up to that region
● Drill down
● ex. Which apps deployed to a region generate the most traffic?
● Slicing and dicing
● ex. Which apps generate the most traffic in a region by day?
Queries

● Large dataset (billions of events per day)
● Multiple dimensions and metrics
● Ad-hoc OLAP queries
● Fast aggregations
● Real-time
New source for network analytics

Dredge
Ingest
Network data from the entire system
Enrich
Traffic logs with application metadata
Aggregate
Multi-dimensional metrics

Flow Logs
AWS API for
network traffic flows

● Good: Wide coverage
● Good: Consolidated
● Good: Core info (src and dst IP, timestamp)
● Bad: 10-minute capture window
● Ugly: Stateless
Flow Logs Overview

{
accountID: 123456789010,
eniID: eni-abc123de,
srcIP: 172.31.16.139,
srcPort: 12345,
dstIP: 10.13.67.49,
dstPort: 80,
protocol: 6,
packets: 123,
bytes, 42,
start: 1490746336,
end: 1490746369,
action: ACCEPT,
...
}

Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App foo sent 426718 bytes to app bar today

{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
App bar received 8278392 bytes from apps foo and baz in
the last week

{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
App baz has outbound network dependencies on apps foo,
bar, etc.

Read This Book
The following diagrams are adapted from Kleppman’s talks
on patterns for real-time stream processing.

● Database indexes and secondary indexes
● Materialized views
● Caching
Derived Data, Read-Optimized

● Separation of concerns for reading and writing
● Changelog stream is a 1st class citizen
● Consume and join streams instead of querying DB
● Maintain materialized views
● Pre-computed cache
Unbundle the database

● Integration with AWS services
● Kinesis Client Library (KCL)
● Auto scaling for elastic throughput
● Total Cost of Ownership (TCO)
Kinesis Over Kafka

● Worker per EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library

VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Elastic throughput

VPC Flow Logs IncomingBytes per minute
Example account and region over 3 hours
Elastic throughput

● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Run and manage auto-scaling util
● No consultation from internal Kafka team
○ Capacity planning
○ Monitoring, failover, and replication
TCO

● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations

● Kinesis enables us to focus
● Cross-account log sharing simplifies the system
● KCL does the boring stuff
● Auto scaling improves efficiency
● Lower TCO
Ingest: Lessons

● Hash table of sorted lists
● Key is IP, Value is metadata sorted by timestamp
● Recent updates (within capture window) or last
● Join with flow log events stream
Address Metadata Changelog

Direction Src Port Dst Port
Inbound Ephemeral Non-Ephemeral
Outbound Ephemeral Non-Ephemeral
Return Non-Ephemeral Ephemeral
Derive TCP State

● Stream table join with changelog
● Log compaction for cold starts, bootstrapping
● Derive state from stateless
Enrich: Lessons

…
dataSchema: {
dataSource: flowlogs,
parser: {
dimensionsSpec: {
dimensions: [
srcApp,
srcAccount,
srcRegion,
…,
dstApp,
dstAccount,
dstRegion,
…
],
}
}
metricsSpec: [
{
type: longSum,
fieldName: packets
},
{
type: longSum,
fieldName: bytes
}
● Column-oriented
● Google BigQuery and PowerDrill
● Ad-hoc OLAP queries
● Fast aggregations
● Multi-dimensional metrics
● Scales to trillions of events

● Pre-aggregate into timestamp buckets
● Druid is a great fit for exploratory analytics
● Fast ad-hoc queries, < 1 second
Aggregate: Lessons

Pivot / Swiv Demo
Drag-and-drop UI

Pivot / Swiv Demo
Contextual exploration

Exploratory Analysis with Pivot / Swiv Demo
Bytes sent per application, table

Bytes sent per application, split by hour, line chart

Bytes sent by example application, split by hour, line chart

Comparison of bytes, flows, and packets, split by day, line chart

● Auditing AWS security groups (virtual firewalls)
● Anomaly and threat detection
● Deployment best practices
● Cost analysis
Other Use Cases

.
Example application
as a network graph

.
Example application
as a network graph
You are here

Enriched and aggregated traffic data
is a powerful source of information
for network design and optimization.

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Similaire à BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis