SlideShare une entreprise Scribd logo
1  sur  34
Real Time Analytics for Big Data Lessons from Facebook..
The Real Time Boom.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved  Google Real Time  Web Analytics Google Real Time Search Facebook Real Time  Social Analytics  Twitter paid tweet analytics SaaS Real Time User Tracking New Real Time Analytics  Startups..
Analytics @ Twitter
Note the Time dimension
The data resolution & processing models
Traditional analytics applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
CEP – Complex Event Processing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
In Memory Data Grid ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
NoSQL ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Hadoop MapReudce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Hadoop Map/Reduce – Reality check.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
So what’s the bottom line? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Facebook Real-time Analytics System ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
The actual analytics.. ,[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Technology Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
The solution.. PTail Scribe Puma Hbase HDFS Real Time Long Term Batch 1.5 Sec 10,000 write/sec per server FACEBOOK Log FACEBOOK Log FACEBOOK Log
Checking the assumptions.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Facebook Analytics.Next.. ,[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved  ,[object Object],[object Object],[object Object]
Step 1: Use memory.. ,[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved  Events Memory Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK
Step 1: Use memory.. ,[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved  Events Any API Data Grid FACEBOOK FACEBOOK FACEBOOK
Step 2 – Collocate ,[object Object],Events Processing Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK
Step 2 – Collocate ,[object Object],Events Processing Grid Data Grid Data Grid Data Grid FACEBOOK FACEBOOK FACEBOOK @EventDriven   @Polling public   class   SimpleListener   { @EventTemplate Data   unprocessedData ()   { Data   template   =   new   Data (); template . setProcessed ( false ); return   template ; } @SpaceDataEvent public   Data   eventListener ( Data   event )   { //process Data here } }
Step 3 – Write behind to SQL/NoSQL Events Processing Grid Open Long Term  persistency Write  Behind FACEBOOK FACEBOOK FACEBOOK Data Grid Data Grid Data Grid
Economic Data Scaling ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved  Memory Disk
Economic Scaling ,[object Object],[object Object],[object Object],[object Object],® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Putting it all together Analytic Application Event Sources Write  behind ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Generate Patterns
Putting it all together Analytic Application Event Sources Write  behind ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Generate Patterns Real Time Map/Reduce R Script  script = new StaticScritpt( “groovy”,”println hi; return 0”) Query  q = em.createNativeQuery( “execute ?”); q.setParamter(1, script); Integer  result = query.getSingleResult();
5x better performance per server! ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Event injector Up to 128 threads GigaSpaces/  (Other Msg Server)  App Services Up to 128 threads  Other Giga 50,000 write/sec per server
Live demo Inter Day Activity (Real Time) Monthly Trend Analysis
5 Big Data Predictions  ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Summary Big Data  Development Made Simple:   Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability ,.. Its Open: Use Any Stack : Avoid Lockin Any  database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks .  All While Minimizing Cost Use Memory & Disk  for optimum cost/performance  .  Built-in Automation and management - Reduces operational costs Elasticity – reduce over provisioning cost
Further reading.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Thank YOU! @natishalom http://blog.gigaspaces.com

Contenu connexe

Tendances

DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDATAVERSITY
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsEMC
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료
[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료
[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료Sejeong Kim 김세정
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Introduction to Business Intelligence
Introduction to Business IntelligenceIntroduction to Business Intelligence
Introduction to Business IntelligenceAlmog Ramrajkar
 
Using neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirementsUsing neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirementsNeo4j
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Harald Erb
 
Big Data Analytics and it's use by Apple.pptx
Big Data Analytics and it's use by Apple.pptxBig Data Analytics and it's use by Apple.pptx
Big Data Analytics and it's use by Apple.pptxRakshit Shrestha
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습동현 강
 

Tendances (20)

DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data Warehouse
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료
[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료
[한국IBM] 엔터프라이즈 AI 검색엔진 Watson Discovery 소개자료
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Introduction to Business Intelligence
Introduction to Business IntelligenceIntroduction to Business Intelligence
Introduction to Business Intelligence
 
Using neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirementsUsing neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirements
 
Big data and analytics
Big data and analyticsBig data and analytics
Big data and analytics
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020
 
Big Data Analytics and it's use by Apple.pptx
Big Data Analytics and it's use by Apple.pptxBig Data Analytics and it's use by Apple.pptx
Big Data Analytics and it's use by Apple.pptx
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 

En vedette

Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Eugene O'Loughlin
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data VisualizationEamonn Maguire
 
Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기SangWoo Kim
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualizationZach Gemignani
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 

En vedette (6)

Data Visualization Tools
Data Visualization ToolsData Visualization Tools
Data Visualization Tools
 
Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data Visualization
 
Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualization
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 

Similaire à Big Data Real Time Analytics - A Facebook Case Study

Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyNati Shalom
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...Paul Hofmann
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Tendencias Storage
Tendencias StorageTendencias Storage
Tendencias StorageFran Navarro
 
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018Amazon Web Services
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock
 
Exploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisExploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisNetAppUK
 
Big data tim
Big data timBig data tim
Big data timT Weir
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Big and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsBig and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsFred Melo
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 

Similaire à Big Data Real Time Analytics - A Facebook Case Study (20)

Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
Big Data
Big DataBig Data
Big Data
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Tendencias Storage
Tendencias StorageTendencias Storage
Tendencias Storage
 
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Final deck
Final deckFinal deck
Final deck
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
Exploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisExploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis Kapsalis
 
Big data tim
Big data timBig data tim
Big data tim
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsBig and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable Systems
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 

Plus de Nati Shalom

Cloudify and terraform integration
Cloudify and terraform integrationCloudify and terraform integration
Cloudify and terraform integrationNati Shalom
 
Why NFV and Digital Transformation Projects Fail!
Why NFV and Digital Transformation Projects Fail! Why NFV and Digital Transformation Projects Fail!
Why NFV and Digital Transformation Projects Fail! Nati Shalom
 
Cloudify and terraform integration
Cloudify and terraform integrationCloudify and terraform integration
Cloudify and terraform integrationNati Shalom
 
1 cloud, 2 clouds, 3 clouds, tons...
1 cloud, 2 clouds, 3 clouds, tons...1 cloud, 2 clouds, 3 clouds, tons...
1 cloud, 2 clouds, 3 clouds, tons...Nati Shalom
 
Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Nati Shalom
 
What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like Nati Shalom
 
Running OpenStack in Production
Running OpenStack in Production Running OpenStack in Production
Running OpenStack in Production Nati Shalom
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...Nati Shalom
 
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackReal World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackNati Shalom
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Nati Shalom
 
OpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitOpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitNati Shalom
 
Application and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaApplication and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaNati Shalom
 
Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Nati Shalom
 
Software Defined Operator
Software Defined OperatorSoftware Defined Operator
Software Defined OperatorNati Shalom
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeNati Shalom
 
Is Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsIs Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsNati Shalom
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)Nati Shalom
 
Application Centric Approach to Devops
Application Centric Approach to DevopsApplication Centric Approach to Devops
Application Centric Approach to DevopsNati Shalom
 
Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Nati Shalom
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOpsNati Shalom
 

Plus de Nati Shalom (20)

Cloudify and terraform integration
Cloudify and terraform integrationCloudify and terraform integration
Cloudify and terraform integration
 
Why NFV and Digital Transformation Projects Fail!
Why NFV and Digital Transformation Projects Fail! Why NFV and Digital Transformation Projects Fail!
Why NFV and Digital Transformation Projects Fail!
 
Cloudify and terraform integration
Cloudify and terraform integrationCloudify and terraform integration
Cloudify and terraform integration
 
1 cloud, 2 clouds, 3 clouds, tons...
1 cloud, 2 clouds, 3 clouds, tons...1 cloud, 2 clouds, 3 clouds, tons...
1 cloud, 2 clouds, 3 clouds, tons...
 
Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017
 
What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like
 
Running OpenStack in Production
Running OpenStack in Production Running OpenStack in Production
Running OpenStack in Production
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
 
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackReal World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
 
OpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitOpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the Summit
 
Application and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaApplication and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & Tosca
 
Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users
 
Software Defined Operator
Software Defined OperatorSoftware Defined Operator
Software Defined Operator
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
Is Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsIs Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOps
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)
 
Application Centric Approach to Devops
Application Centric Approach to DevopsApplication Centric Approach to Devops
Application Centric Approach to Devops
 
Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOps
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Big Data Real Time Analytics - A Facebook Case Study

  • 1. Real Time Analytics for Big Data Lessons from Facebook..
  • 2. The Real Time Boom.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Google Real Time Web Analytics Google Real Time Search Facebook Real Time Social Analytics Twitter paid tweet analytics SaaS Real Time User Tracking New Real Time Analytics Startups..
  • 4. Note the Time dimension
  • 5. The data resolution & processing models
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Hadoop Map/Reduce – Reality check.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 12. So what’s the bottom line? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 13. Facebook Real-time Analytics System ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 14.
  • 15.
  • 16.
  • 17. The solution.. PTail Scribe Puma Hbase HDFS Real Time Long Term Batch 1.5 Sec 10,000 write/sec per server FACEBOOK Log FACEBOOK Log FACEBOOK Log
  • 18. Checking the assumptions.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24. Step 3 – Write behind to SQL/NoSQL Events Processing Grid Open Long Term persistency Write Behind FACEBOOK FACEBOOK FACEBOOK Data Grid Data Grid Data Grid
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. Live demo Inter Day Activity (Real Time) Monthly Trend Analysis
  • 31. 5 Big Data Predictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 32. Summary Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability ,.. Its Open: Use Any Stack : Avoid Lockin Any database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks . All While Minimizing Cost Use Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costs Elasticity – reduce over provisioning cost
  • 33. Further reading.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 34. Thank YOU! @natishalom http://blog.gigaspaces.com

Notes de l'éditeur

  1. http://developers.facebook.com/blog/post/476/
  2. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html MySQL DB Counters Have a row with a key and a counter. Results in lots of database activity. Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over.  When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention. Tried to spread the work by taking into account time zones.  Tried to shard things differently. The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy. Solution not well tailored to the problem. In-Memory Counters If you are worried about bottlenecks in IO then throw it all in-memory. No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard. Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate.  They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on. MapReduce Used Hadoop/Hive for previous solution.  Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried. Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals. Cassandra HBase seemed a better solution based on availability and the write rate. Write rate was the huge bottleneck being solved.
  3. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html The Winner: HBase + Scribe + Ptail + Puma At a high level: HBase stores data across distributed machines. Use a tailing architecture, new events are stored in log files, and the logs are tailed. A system rolls the events up and writes them into storage. A UI pulls the data out and displays it to users. Data Flow User clicks Like on a web page. Fires AJAX request to Facebook. Request is written to a log file using Scribe.  Scribe handles issues like file roll over. Scribe is built on the same HTFS file store Hadoop is built on. Write extremely lean log lines. The more compact the log lines the more can be stored in memory. Ptail Data is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out. Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters. Plugin impression News feed impressions Actions (plugin + news feed) Puma Batch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better. Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable. Wait for last flush to complete for starting new batch to avoid lock contention issues. UI  Renders Data Frontends are all written in PHP. The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services. Caching solutions are used to make the web pages display more quickly. Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds.  The more and longer data is cached the less realtime it is. Set different caching TTLs in memcache. MapReduce The data is then sent to MapReduce servers so it can be queried via Hive. This also serves as a backup plan as the data can be recovered from Hive. Raw logs are removed after a period of time. HBase is a distribute column store.  Database interface to Hadoop. Facebook has people working internally on HBase.  Unlike a relational database you don't create mappings between tables. You don't create indexes. The only index you have a primary row key. From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime. Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur.  Based on the key, data is sharded to a region server.  Written to WAL first. Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk. If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss. Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably.  HBase handles failure detection and automatically routes across failures. Currently HBase resharding is done manually. Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet. Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan. Schema  Store on a per URL basis a bunch of counters. A row key, which is the only lookup key, is the MD5 hash of the reverse domain Selecting the proper key structure helps with scanning and sharding. A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there.  For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding. A reverse domain,  com.facebook/  for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains.  Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis.  Per server they can handle 10,000 writes per second.  Checkpointing is used to prevent data loss when reading data from log files.  Tailers save log stream check points  in HBase. Replayed on startup so won't lose data. Useful for detecting click fraud, but it doesn't have fraud detection built in. Tailer Hot Spots In a distributed system there's a chance one part of the system can be hotter than another. One example are region servers that can be hot because more keys are being directed that way. One tailer can be lag behind another too. If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI? For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour. Solution is to figure out the least up to date tailer and use that when querying metrics.
  4. A Potential for Improvement There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook's working system: We can simplify the design. If memory can be seen as transactional - and it can - we can use them without transforming them as they proceed along our analytics workflow. This makes our design and implementation much simpler to implement and test, and performance improves as well. We can strengthen the design. With a polling semantic, such systems are brittle, relying on systems that pull data in order to generate realtime analytics data. We should be able to reduce the fragility of the system, even while making it faster. We can strengthen the implementation. With batching subsystems, there are limits shouldn’t exist. For example, one concern in Facebook's implementation is the use of an in-memory hash table that stores intermediate data; the in-memory aspect isn’t a concern until you realize that the batch sizes are chosen partially to make sure that this hash table doesn’t overflow available space. We can allow deployments to change databases based on their requirements. There's nothing wrong with HBase, but it's got specific characteristics that aren't appropriate for all enterprises. We can design a system which you’d be able to deploy on various and flexible platforms, and we can migrate the underlying long-term data store to a different database if needed. We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented  in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult. What we want to do, then, is create a general model for an application that can accomplish the same goals as Facebook’s realtime analytics system, while leveraging the capabilities that in-memory data grids offer where available, potentially offering improvement in the areas of scalability, manageability, latency, platform neutrality, and simplicity, all while increasing ease of data access. That sounds like quite a tall order, but it’s doable. The key is to remember that at heart, realtime analytics represent an events system. Facebook’s entire architecture is designed to funnel events through various channels, such that they can safely and sequentially manage event updates. Therefore, they receive a massive set of events that “look like” marbles, which they line up in single file; they then sort the marbles by color, you might say, and for each color they create a bundle of sticks; the sticks are lit on fire, and when the heat goes up past a certain temperature, steam is generated, which turns a turbine. It’s a real-life Rube Goldberg machine, which is admirable in that it works, but much of it is still unnecessary if the assumptions about memory ("unreliable") and database ("HBase is the only target that counts") are changed. Looking at the analogy from the previous paragraph, there’s no need to change a marble into anything. The marble is enough.
  5. Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
  6. Value Data Grid like GigaSpaces comes with rich set of API that provides not only the mean to store data fast and reliably but also access the data, query it just as you would do with a database. Specifically for GigaSpaces we support both JPA and Document API and the way to mix and match between those API’s Unlike Scribe and log system we can now look at the data as it comes in and not only once it is stored into the database The later makes it possible to partition data based on time – First day in memory and the rest through the database etc.
  7. Collocating the processing with the data can provides the biggest gain in terms of scalbility and performance as we reduce the amount of network hops as well as serialization overhead. We also reduce the number of moving parts which in itself simplifies our runtime architecture and our ability to scale. The other benefit is that we decentralize the Puma services from the facebook example and thus make the entire architecture significantly more scalable.
  8. He snippet of code shows the part of the code that generate the statistical information as the events comes in The template defines the fliter for the events. In the above example we will filter any event that is of type Data that has a false value in its “processed” attribute. For every event that match this filter the method eventListener will be called with the appropriate data object.
  9. Value gained: Avoid lockin to specific NoSQL API Performance – reduced network hops, serialization overhead Simplicity – less moving parts Scalability without compromising on consistency (Strict consistency at the front, eventual consistency for the long term data) JPA/Stanard API
  10. content based routing, workflow