This document discusses Big Data technologies and Akamai's use of them. It describes Akamai's Big Data infrastructure of 150,000 collector nodes and 5,000 map/reduce nodes processing billions of jobs daily. It then covers the evolution of Big Data technologies and industries, common Big Data primitives like collection, parsing, partitioning, and analysis. The rest of the document discusses how Akamai uses these technologies for billing and reporting, system monitoring, media analytics, security products, and log archiving. It concludes with thoughts on building Big Data systems and Akamai's future plans to scale further with new R&D.
8. From the beginning…
• Akamai needed a billing system and scalable monitoring
• The Open Source community wanted a search engine
• Yahoo needed better product analytics for page views
• Google needed more scalable computation for ad
management
• Facebook needed real-time updates to social graph
• LinkedIn needed a real-time activity data pipeline
• Twitter needed hashtag and topic streams
• Amazon needed durable shopping carts
• Netflix needed a recommendation engine
9. Big Data timeline
Akamai
Wide area, real-time, in-memory system monitoring
1998 2001 2003 2005 2006 2007 2008 2010 2011 2012 2013 2014
Industry
Generalized map/reduce on 1 machine
Decentralized job scheduling
Multiple machines File System DB
Nutch
Google FS Google MapReduce
Neo4J
Amazon Dynamo
Yahoo spins off Hadoop
NoSql
Geographical redundancy
Real-time reporting
Columnar DB
Distributed File System DB
Wide-area MapReduce
ExaByte Query
HBASE
LinkedIn Kafka
Facebook Cassandra
Twitter Storm Facebook
Presto
11. Big Data modes
• Batch
– Computation over a large static data set
– Results are complete
• Online
– Computation on data as it’s generated
– Localized results, must be aggregated
downstream
13. Collection
• What
– Logs
– Metadata
– System stats
– Application
events
– Application stats
– Network data
• How
– Email
– SPDY
– HTTP POST
– SCP
– Scribe
– Avro
– Custom
14. Parsing
• Read lines or blocks and split into fields
• Transform, e.g. protobuf
• Map keys to values
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423
itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/
r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
1359486900 1423 a440.phobos.apple.com 1 3158
1359486900 1423 200 1 30128
1359486900 1423 1 209158
15. Partitioning
• Bucketing
– Reduce to a single record per bucket
– e.g. 5 minutes, /24, etc.
• Hashing
– Bucket blocks or records of data by a hash
function
17. Throttling
• Limit on cardinality per partition
– Requires central management
– Drop records over max
• Remove or trim large fields
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423
itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/
r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 -
iPeV image/jpeg - - 44 3031 - - - - - W - ~
18. Aggregation
• Merge
– Merge-sort blocks in a partition
• Reduce
– Combine values for like keys
• Sum, Min, Max, Mask, etc.
• Shuffle
– Move the data to where its needed
or closer to like data
1359486900 1423 1 209158
1359486900 1423 1 209158
1359529800 1423 1 209158
Aggregate
1359486900 1423 2 418316
1359529800 1423 1 209158
{1423, 1359486900}
2 418316
{1423, 1359529800}
1 209158
Shuffle
19. Tracking
• Tracking
– Embed GUID in each data unit sent
– Publish GUIDs independent from data flow
– Completeness is expected (published GUIDs)
vs. actual (embedded GUID)
20. Data integrity
• Watermark
– Producer watermarks every n-lines with a
crypto key
– Receiver checks watermarks
• Checksum
– Block checksums
– Line CRC
– Etc.
22. Big Data at Akamai
• Billing and Reporting
• System monitoring
• Media Analytics
• Security
• Log archive
23. Billing and reporting
Logs
Akamai Edge
Networks and
Products
Q Parse
Pipelines
Aggregate
Shuffle Split
Billing DB
Reporting
Reporting
Parsing Reporting
• splits lines into fields
• maps keys to values per pipeline
• each log generates many pipelines
• each pipeline represents a streaming table
Evolution
• Logs were emailed (up to 1PB/day)
• Now delivered via SPDY (3PB/day)
Customers
3 PB/day
Doubles every year
Reporting
ReIpnotertrinagl
Apps
24. System monitoring
Akamai
Networks and
Products
Client SQL
Parser TLA Agg
Agg
Agg
Alert
Trend
50M jobs/day
TLA: top level aggregator
pulls data from aggregators
which pull data from producers
at the time of the request
Produces rewrite data locally
Evolution
Single machine memory for table joins
Future: distributed memory for table joins
25. Media analytics
Pipelines
Akamai
Products
Front
end
Column Store
Index Reporting
RRepeoprotirntign g
API / UI
Customers
Events
Indexes are recreated for each update
Supports insert and update
Reads are flexible and fast
Evolution:
Index now fingerprint to lower cost
Hyperloglog for uniqueness counting
26. Security products
Akamai Edge
Networks Front
Pipelines
end
HDFS
20 TB/day
Events
Akamai
Web Firewall
Map/Reduce
HBASE
Hive
Cloudera Graphite
Operations Center
Reputation
Scoring
Threat Analysis
Intelligence
Reports
Risk Based
Authentication
Payment Fraud
External Data
External Data
External Data
Evolution:
Replacing HBASE with custom aggregator
Replacing Hive with custom SQL processor
27. Log archive
Logs
Q Archive
Parse
180 PB, 450 Trillion records
Doubles every year
Log cache 10%
Client IP Sketch
Archive Index (10TB) Pipelines
HDFS
Spark
Spark SQL
Client
Request
Archive
Front End
Cache first
Then archive
Get Index and/or CIP
Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years
Evolution: Was flat file for index, now HDFS/Spark
29. Hadoop / Yarn
HDFS
Building a system
If you need fast access to massive amounts of data where queries
are constrained to an index (read optimized):
• Start with HDFS or Cassandra
• Add HBASE column store
• Add Hive for SQL-like access
• Add Pig for scripting
HBASE
Get, Put
Hive
Select *
Pig
{ … }
30. Building a system
If you need to search logs:
• Start with HDFS
• Add Flume for log data integration
• Add Avro for data serialization
• Add Solr for search
Hadoop / Yarn
HDFS
Solr
Search, e.g.
Ip = 1.1.1.1
Flume
Agent Avro Sink
Flume
Collector Avro Source
31. Hadoop / Yarn
HDFS
Building a system
If you need flexible and shared access to unlimited amounts of
data:
• Start with HDFS or Cassandra
• Add Hadoop for Map/Reduce or
• Add Hive for SQL-like access or
• Add Pig for scripting
• Add Mesos for resource sharing
• Add Ambari for cluster management and provisioning
• Add map/reduce programs for business logic
Pig
{…}
Hive
Flume Select * Ambari
Mesos
Map/Reduce
Java { … }
32. Building a system
If you need fast, flexible access to in-memory data:
• Start with HDFS
• Add Spark
• Add Spark SQL for SQL-like access or
• Create Spark programs for other business logic
SparkSQL
Select * from
Spark
Hadoop / Yarn
HDFS
Spark Progs
Java { … }
33. Building a system
If you need real-time stream event processing:
• Start with HDFS
• Add Kafka for messaging and pub/sub
• Add Storm for event processing
• Develop Java Bolts for processing logic
Kafka
Storm
Bolts
{ … }
Hadoop / Yarn
HDFS
34. Future at Akamai
• 100x
– Everything bigger and faster
– Requires new R&D across many Big Data
components
• Scaling Big Data Eco across wide-area
• Internet Security
• Positive reputation scoring
• Automatic DDoS mitigation
• Low latency data collection
– 2^53 unique keys, <1 minute latency
• Support DevOps
– Near real-time monitoring and control