SlideShare a Scribd company logo
1 of 35
Download to read offline
Real… Big… Data… 
and it’s constant evolution 
Scott MacGregor
Who is this guy?
Akamai Big Data Infrastructure 
150,000 collector nodes 
5000 map/reduce nodes 
Billions of jobs per day
What is Big Data?
The V’s
Data that is Big 
From Hortonworks
What’s it really about?
From the beginning… 
• Akamai needed a billing system and scalable monitoring 
• The Open Source community wanted a search engine 
• Yahoo needed better product analytics for page views 
• Google needed more scalable computation for ad 
management 
• Facebook needed real-time updates to social graph 
• LinkedIn needed a real-time activity data pipeline 
• Twitter needed hashtag and topic streams 
• Amazon needed durable shopping carts 
• Netflix needed a recommendation engine
Big Data timeline 
Akamai 
Wide area, real-time, in-memory system monitoring 
1998 2001 2003 2005 2006 2007 2008 2010 2011 2012 2013 2014 
Industry 
Generalized map/reduce on 1 machine 
Decentralized job scheduling 
Multiple machines File System DB 
Nutch 
Google FS Google MapReduce 
Neo4J 
Amazon Dynamo 
Yahoo spins off Hadoop 
NoSql 
Geographical redundancy 
Real-time reporting 
Columnar DB 
Distributed File System DB 
Wide-area MapReduce 
ExaByte Query 
HBASE 
LinkedIn Kafka 
Facebook Cassandra 
Twitter Storm Facebook 
Presto
How it works…
Big Data modes 
• Batch 
– Computation over a large static data set 
– Results are complete 
• Online 
– Computation on data as it’s generated 
– Localized results, must be aggregated 
downstream
Big Data primitives 
• Collection 
• Parsing 
• Partitioning 
• Filtering 
• Throttling 
• Aggregation 
• Tracking 
• Validation 
• Analysis
Collection 
• What 
– Logs 
– Metadata 
– System stats 
– Application 
events 
– Application stats 
– Network data 
• How 
– Email 
– SPDY 
– HTTP POST 
– SCP 
– Scribe 
– Avro 
– Custom
Parsing 
• Read lines or blocks and split into fields 
• Transform, e.g. protobuf 
• Map keys to values 
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 
itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/ 
r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com 
1359486900 1423 a440.phobos.apple.com 1 3158 
1359486900 1423 200 1 30128 
1359486900 1423 1 209158
Partitioning 
• Bucketing 
– Reduce to a single record per bucket 
– e.g. 5 minutes, /24, etc. 
• Hashing 
– Bucket blocks or records of data by a hash 
function
Filtering 
• Statistical Methods 
– Top-k (HierarchicalCountSketch) 
– Set membership (Bloom filters) 
– Cardinality counting (HyperLogLog) 
– Frequency estimates (CountSketch) 
– Change detection (Deltoid) 
• Sampling 
– Random 
– Reservoir
Throttling 
• Limit on cardinality per partition 
– Requires central management 
– Drop records over max 
• Remove or trim large fields 
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 
itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/ 
r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com 
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - 
iPeV image/jpeg - - 44 3031 - - - - - W - ~
Aggregation 
• Merge 
– Merge-sort blocks in a partition 
• Reduce 
– Combine values for like keys 
• Sum, Min, Max, Mask, etc. 
• Shuffle 
– Move the data to where its needed 
or closer to like data 
1359486900 1423 1 209158 
1359486900 1423 1 209158 
1359529800 1423 1 209158 
Aggregate 
1359486900 1423 2 418316 
1359529800 1423 1 209158 
{1423, 1359486900} 
2 418316 
{1423, 1359529800} 
1 209158 
Shuffle
Tracking 
• Tracking 
– Embed GUID in each data unit sent 
– Publish GUIDs independent from data flow 
– Completeness is expected (published GUIDs) 
vs. actual (embedded GUID)
Data integrity 
• Watermark 
– Producer watermarks every n-lines with a 
crypto key 
– Receiver checks watermarks 
• Checksum 
– Block checksums 
– Line CRC 
– Etc.
Analysis 
• Online 
– Precomputed reports 
• Batch 
– Spark Programs 
– Map/Reduce 
– Hive: HQL 
– SQL
Big Data at Akamai 
• Billing and Reporting 
• System monitoring 
• Media Analytics 
• Security 
• Log archive
Billing and reporting 
Logs 
Akamai Edge 
Networks and 
Products 
Q Parse 
Pipelines 
Aggregate 
Shuffle Split 
Billing DB 
Reporting 
Reporting 
Parsing Reporting 
• splits lines into fields 
• maps keys to values per pipeline 
• each log generates many pipelines 
• each pipeline represents a streaming table 
Evolution 
• Logs were emailed (up to 1PB/day) 
• Now delivered via SPDY (3PB/day) 
Customers 
3 PB/day 
Doubles every year 
Reporting 
ReIpnotertrinagl 
Apps
System monitoring 
Akamai 
Networks and 
Products 
Client SQL 
Parser TLA Agg 
Agg 
Agg 
Alert 
Trend 
50M jobs/day 
TLA: top level aggregator 
pulls data from aggregators 
which pull data from producers 
at the time of the request 
Produces rewrite data locally 
Evolution 
Single machine memory for table joins 
Future: distributed memory for table joins
Media analytics 
Pipelines 
Akamai 
Products 
Front 
end 
Column Store 
Index Reporting 
RRepeoprotirntign g 
API / UI 
Customers 
Events 
Indexes are recreated for each update 
Supports insert and update 
Reads are flexible and fast 
Evolution: 
Index now fingerprint to lower cost 
Hyperloglog for uniqueness counting
Security products 
Akamai Edge 
Networks Front 
Pipelines 
end 
HDFS 
20 TB/day 
Events 
Akamai 
Web Firewall 
Map/Reduce 
HBASE 
Hive 
Cloudera Graphite 
Operations Center 
Reputation 
Scoring 
Threat Analysis 
Intelligence 
Reports 
Risk Based 
Authentication 
Payment Fraud 
External Data 
External Data 
External Data 
Evolution: 
Replacing HBASE with custom aggregator 
Replacing Hive with custom SQL processor
Log archive 
Logs 
Q Archive 
Parse 
180 PB, 450 Trillion records 
Doubles every year 
Log cache 10% 
Client IP Sketch 
Archive Index (10TB) Pipelines 
HDFS 
Spark 
Spark SQL 
Client 
Request 
Archive 
Front End 
Cache first 
Then archive 
Get Index and/or CIP 
Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years 
Evolution: Was flat file for index, now HDFS/Spark
Hadoop / Yarn 
HDFS 
The Ecosystem 
Script 
Pig 
SQL 
Hive 
NoSQL 
HBASE 
Stream 
Kafka 
Storm 
Search 
Solr 
In-Mem 
Spark 
Integration 
Flume 
Avro 
Operations 
Ambari 
Zookeeper 
Oozie 
Monitoring 
Graphite 
Sharing 
Mesos
Hadoop / Yarn 
HDFS 
Building a system 
If you need fast access to massive amounts of data where queries 
are constrained to an index (read optimized): 
• Start with HDFS or Cassandra 
• Add HBASE column store 
• Add Hive for SQL-like access 
• Add Pig for scripting 
HBASE 
Get, Put 
Hive 
Select * 
Pig 
{ … }
Building a system 
If you need to search logs: 
• Start with HDFS 
• Add Flume for log data integration 
• Add Avro for data serialization 
• Add Solr for search 
Hadoop / Yarn 
HDFS 
Solr 
Search, e.g. 
Ip = 1.1.1.1 
Flume 
Agent Avro Sink 
Flume 
Collector Avro Source
Hadoop / Yarn 
HDFS 
Building a system 
If you need flexible and shared access to unlimited amounts of 
data: 
• Start with HDFS or Cassandra 
• Add Hadoop for Map/Reduce or 
• Add Hive for SQL-like access or 
• Add Pig for scripting 
• Add Mesos for resource sharing 
• Add Ambari for cluster management and provisioning 
• Add map/reduce programs for business logic 
Pig 
{…} 
Hive 
Flume Select * Ambari 
Mesos 
Map/Reduce 
Java { … }
Building a system 
If you need fast, flexible access to in-memory data: 
• Start with HDFS 
• Add Spark 
• Add Spark SQL for SQL-like access or 
• Create Spark programs for other business logic 
SparkSQL 
Select * from 
Spark 
Hadoop / Yarn 
HDFS 
Spark Progs 
Java { … }
Building a system 
If you need real-time stream event processing: 
• Start with HDFS 
• Add Kafka for messaging and pub/sub 
• Add Storm for event processing 
• Develop Java Bolts for processing logic 
Kafka 
Storm 
Bolts 
{ … } 
Hadoop / Yarn 
HDFS
Future at Akamai 
• 100x 
– Everything bigger and faster 
– Requires new R&D across many Big Data 
components 
• Scaling Big Data Eco across wide-area 
• Internet Security 
• Positive reputation scoring 
• Automatic DDoS mitigation 
• Low latency data collection 
– 2^53 unique keys, <1 minute latency 
• Support DevOps 
– Near real-time monitoring and control
Thank You

More Related Content

What's hot

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduceAmazon Web Services
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 

What's hot (20)

Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Advanced Visualization of Spark jobs
Advanced Visualization of Spark jobsAdvanced Visualization of Spark jobs
Advanced Visualization of Spark jobs
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 

Similar to JDD2014: Real Big Data - Scott MacGregor

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaAttunity
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Hortonworks
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream ProcessingJorge Hirtz
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Micron Technology
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream csching
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambdadarach
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 

Similar to JDD2014: Real Big Data - Scott MacGregor (20)

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream Processing
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambda
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 

Recently uploaded

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 

Recently uploaded (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 

JDD2014: Real Big Data - Scott MacGregor

  • 1. Real… Big… Data… and it’s constant evolution Scott MacGregor
  • 2. Who is this guy?
  • 3. Akamai Big Data Infrastructure 150,000 collector nodes 5000 map/reduce nodes Billions of jobs per day
  • 4. What is Big Data?
  • 6. Data that is Big From Hortonworks
  • 8. From the beginning… • Akamai needed a billing system and scalable monitoring • The Open Source community wanted a search engine • Yahoo needed better product analytics for page views • Google needed more scalable computation for ad management • Facebook needed real-time updates to social graph • LinkedIn needed a real-time activity data pipeline • Twitter needed hashtag and topic streams • Amazon needed durable shopping carts • Netflix needed a recommendation engine
  • 9. Big Data timeline Akamai Wide area, real-time, in-memory system monitoring 1998 2001 2003 2005 2006 2007 2008 2010 2011 2012 2013 2014 Industry Generalized map/reduce on 1 machine Decentralized job scheduling Multiple machines File System DB Nutch Google FS Google MapReduce Neo4J Amazon Dynamo Yahoo spins off Hadoop NoSql Geographical redundancy Real-time reporting Columnar DB Distributed File System DB Wide-area MapReduce ExaByte Query HBASE LinkedIn Kafka Facebook Cassandra Twitter Storm Facebook Presto
  • 11. Big Data modes • Batch – Computation over a large static data set – Results are complete • Online – Computation on data as it’s generated – Localized results, must be aggregated downstream
  • 12. Big Data primitives • Collection • Parsing • Partitioning • Filtering • Throttling • Aggregation • Tracking • Validation • Analysis
  • 13. Collection • What – Logs – Metadata – System stats – Application events – Application stats – Network data • How – Email – SPDY – HTTP POST – SCP – Scribe – Avro – Custom
  • 14. Parsing • Read lines or blocks and split into fields • Transform, e.g. protobuf • Map keys to values S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/ r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com 1359486900 1423 a440.phobos.apple.com 1 3158 1359486900 1423 200 1 30128 1359486900 1423 1 209158
  • 15. Partitioning • Bucketing – Reduce to a single record per bucket – e.g. 5 minutes, /24, etc. • Hashing – Bucket blocks or records of data by a hash function
  • 16. Filtering • Statistical Methods – Top-k (HierarchicalCountSketch) – Set membership (Bloom filters) – Cardinality counting (HyperLogLog) – Frequency estimates (CountSketch) – Change detection (Deltoid) • Sampling – Random – Reservoir
  • 17. Throttling • Limit on cardinality per partition – Requires central management – Drop records over max • Remove or trim large fields S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/ r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - iPeV image/jpeg - - 44 3031 - - - - - W - ~
  • 18. Aggregation • Merge – Merge-sort blocks in a partition • Reduce – Combine values for like keys • Sum, Min, Max, Mask, etc. • Shuffle – Move the data to where its needed or closer to like data 1359486900 1423 1 209158 1359486900 1423 1 209158 1359529800 1423 1 209158 Aggregate 1359486900 1423 2 418316 1359529800 1423 1 209158 {1423, 1359486900} 2 418316 {1423, 1359529800} 1 209158 Shuffle
  • 19. Tracking • Tracking – Embed GUID in each data unit sent – Publish GUIDs independent from data flow – Completeness is expected (published GUIDs) vs. actual (embedded GUID)
  • 20. Data integrity • Watermark – Producer watermarks every n-lines with a crypto key – Receiver checks watermarks • Checksum – Block checksums – Line CRC – Etc.
  • 21. Analysis • Online – Precomputed reports • Batch – Spark Programs – Map/Reduce – Hive: HQL – SQL
  • 22. Big Data at Akamai • Billing and Reporting • System monitoring • Media Analytics • Security • Log archive
  • 23. Billing and reporting Logs Akamai Edge Networks and Products Q Parse Pipelines Aggregate Shuffle Split Billing DB Reporting Reporting Parsing Reporting • splits lines into fields • maps keys to values per pipeline • each log generates many pipelines • each pipeline represents a streaming table Evolution • Logs were emailed (up to 1PB/day) • Now delivered via SPDY (3PB/day) Customers 3 PB/day Doubles every year Reporting ReIpnotertrinagl Apps
  • 24. System monitoring Akamai Networks and Products Client SQL Parser TLA Agg Agg Agg Alert Trend 50M jobs/day TLA: top level aggregator pulls data from aggregators which pull data from producers at the time of the request Produces rewrite data locally Evolution Single machine memory for table joins Future: distributed memory for table joins
  • 25. Media analytics Pipelines Akamai Products Front end Column Store Index Reporting RRepeoprotirntign g API / UI Customers Events Indexes are recreated for each update Supports insert and update Reads are flexible and fast Evolution: Index now fingerprint to lower cost Hyperloglog for uniqueness counting
  • 26. Security products Akamai Edge Networks Front Pipelines end HDFS 20 TB/day Events Akamai Web Firewall Map/Reduce HBASE Hive Cloudera Graphite Operations Center Reputation Scoring Threat Analysis Intelligence Reports Risk Based Authentication Payment Fraud External Data External Data External Data Evolution: Replacing HBASE with custom aggregator Replacing Hive with custom SQL processor
  • 27. Log archive Logs Q Archive Parse 180 PB, 450 Trillion records Doubles every year Log cache 10% Client IP Sketch Archive Index (10TB) Pipelines HDFS Spark Spark SQL Client Request Archive Front End Cache first Then archive Get Index and/or CIP Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years Evolution: Was flat file for index, now HDFS/Spark
  • 28. Hadoop / Yarn HDFS The Ecosystem Script Pig SQL Hive NoSQL HBASE Stream Kafka Storm Search Solr In-Mem Spark Integration Flume Avro Operations Ambari Zookeeper Oozie Monitoring Graphite Sharing Mesos
  • 29. Hadoop / Yarn HDFS Building a system If you need fast access to massive amounts of data where queries are constrained to an index (read optimized): • Start with HDFS or Cassandra • Add HBASE column store • Add Hive for SQL-like access • Add Pig for scripting HBASE Get, Put Hive Select * Pig { … }
  • 30. Building a system If you need to search logs: • Start with HDFS • Add Flume for log data integration • Add Avro for data serialization • Add Solr for search Hadoop / Yarn HDFS Solr Search, e.g. Ip = 1.1.1.1 Flume Agent Avro Sink Flume Collector Avro Source
  • 31. Hadoop / Yarn HDFS Building a system If you need flexible and shared access to unlimited amounts of data: • Start with HDFS or Cassandra • Add Hadoop for Map/Reduce or • Add Hive for SQL-like access or • Add Pig for scripting • Add Mesos for resource sharing • Add Ambari for cluster management and provisioning • Add map/reduce programs for business logic Pig {…} Hive Flume Select * Ambari Mesos Map/Reduce Java { … }
  • 32. Building a system If you need fast, flexible access to in-memory data: • Start with HDFS • Add Spark • Add Spark SQL for SQL-like access or • Create Spark programs for other business logic SparkSQL Select * from Spark Hadoop / Yarn HDFS Spark Progs Java { … }
  • 33. Building a system If you need real-time stream event processing: • Start with HDFS • Add Kafka for messaging and pub/sub • Add Storm for event processing • Develop Java Bolts for processing logic Kafka Storm Bolts { … } Hadoop / Yarn HDFS
  • 34. Future at Akamai • 100x – Everything bigger and faster – Requires new R&D across many Big Data components • Scaling Big Data Eco across wide-area • Internet Security • Positive reputation scoring • Automatic DDoS mitigation • Low latency data collection – 2^53 unique keys, <1 minute latency • Support DevOps – Near real-time monitoring and control