SlideShare une entreprise Scribd logo
1  sur  14
SRE
Bruno Connelly
#LinkedInWIT
The Past, Present and Future of Big
Data @ LinkedIn
People You May Know
Suja Viswesan
SR ENGINEERING MANAGER, BIG DATA PLATFORM
MEMBERS COMPANIES JOBS SKILLS SCHOOLS KNOWLEDGE
Scale of Processing @
2.3 Trillion
Messages per Day
0.6 PB in 2.3 PB out
per Day (compressed)
16 Million
Messages per Second at peaks!
4.6K users
125 TB ingested per day
120 PB of HDFS
224K jobs per day across
13 clusters (9 K nodes)
220+ Applications
Most Applications require
Stateful Processing ~
several TBs (overall)
800+ nodes across 9
clusters
samza
Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Process
- Highly Reliable and
Fault-tolerant Processing of
Events
- Offline Batch Processing
- Near-realtime Stream
Processing
- Seamlessly Transport Results
from Offline Processing to
Online Services
Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Process
- Highly Reliable and
Fault-tolerant Processing of
Events
- Offline Batch Processing
- Near-realtime Stream
Processing
- Seamlessly Transport Results
from Offline Processing to
Online Services
Access
- Persist Data Durably
- High availability for Serving
Online Services
- Data should be Searchable
Analytics Infrastructure
Gobblin
Espresso
Data
Sources
3rd Party
Services
Data
Ingestion
Oracle DB
HDFS
Voldemort
Data
Storage
Dataset
Management
Dali
Datasets
Analytics Infrastructure
A/B
testing
Cluster
Management
Compute
Engines
Workflow
Orchestration
Usecases
Relevance
Analytics
Reporting
YARN Azkaban
Analytics Infrastructure Challenges
Computation
Cluster Management
System
Scaling up computation
● Limited shared computation resources
● Efficient computation to cut down cost of jobs
Scaling up cluster management
● Thousands of daily active cluster users
● Hundreds of thousands of daily jobs
● A mix of SLA requirements
Scaling up system
● Tens of thousands of nodes
● Tens of PT of data
THESCALINGPYRAMID
Our Solutions
Scaling up system
● Federated HDFS
● Dali - Logical Data Access Layer for Hadoop
Scaling up cluster management
● Hadoop OrgQueue
● Elasticity Tuner
Scaling up computation
● Dr. Elephant
● Better computation strategy for handling large datasets
LinkedIn Open Source Projects
Pinot
Dr Elephant
Cubert
Streaming
Near Realtime
Stream Processing
Data Management Performance Tuning OLAP Storage
Computation EngineWorkflow Manager
samza
Photon - ML
Bruno Connelly
See you at Grace Hopper Celebration!

Contenu connexe

Tendances

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at FacebookDatabricks
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Database-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable TablespacesDatabase-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable TablespacesMarkus Flechtner
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Sandesh Rao
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 

Tendances (20)

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Database-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable TablespacesDatabase-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable Tablespaces
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 

Similaire à The Past, Present and Future of Big Data @LinkedIn

Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise WorkloadsDAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise WorkloadsAmazon Web Services
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ NetflixIdo Shilon
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadOntico
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and MigrationAWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and MigrationAmazon Web Services
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...Amazon Web Services
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantRoman Nikitchenko
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...MapR Technologies
 

Similaire à The Past, Present and Future of Big Data @LinkedIn (20)

Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise WorkloadsDAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture Highload
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and MigrationAWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
 

Dernier

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Dernier (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

The Past, Present and Future of Big Data @LinkedIn

  • 1. SRE Bruno Connelly #LinkedInWIT The Past, Present and Future of Big Data @ LinkedIn
  • 2. People You May Know Suja Viswesan SR ENGINEERING MANAGER, BIG DATA PLATFORM
  • 3. MEMBERS COMPANIES JOBS SKILLS SCHOOLS KNOWLEDGE
  • 4.
  • 5. Scale of Processing @ 2.3 Trillion Messages per Day 0.6 PB in 2.3 PB out per Day (compressed) 16 Million Messages per Second at peaks! 4.6K users 125 TB ingested per day 120 PB of HDFS 224K jobs per day across 13 clusters (9 K nodes) 220+ Applications Most Applications require Stateful Processing ~ several TBs (overall) 800+ nodes across 9 clusters samza
  • 6. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec)
  • 7. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec) Process - Highly Reliable and Fault-tolerant Processing of Events - Offline Batch Processing - Near-realtime Stream Processing - Seamlessly Transport Results from Offline Processing to Online Services
  • 8. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec) Process - Highly Reliable and Fault-tolerant Processing of Events - Offline Batch Processing - Near-realtime Stream Processing - Seamlessly Transport Results from Offline Processing to Online Services Access - Persist Data Durably - High availability for Serving Online Services - Data should be Searchable
  • 9. Analytics Infrastructure Gobblin Espresso Data Sources 3rd Party Services Data Ingestion Oracle DB HDFS Voldemort Data Storage Dataset Management Dali Datasets
  • 11. Analytics Infrastructure Challenges Computation Cluster Management System Scaling up computation ● Limited shared computation resources ● Efficient computation to cut down cost of jobs Scaling up cluster management ● Thousands of daily active cluster users ● Hundreds of thousands of daily jobs ● A mix of SLA requirements Scaling up system ● Tens of thousands of nodes ● Tens of PT of data THESCALINGPYRAMID
  • 12. Our Solutions Scaling up system ● Federated HDFS ● Dali - Logical Data Access Layer for Hadoop Scaling up cluster management ● Hadoop OrgQueue ● Elasticity Tuner Scaling up computation ● Dr. Elephant ● Better computation strategy for handling large datasets
  • 13. LinkedIn Open Source Projects Pinot Dr Elephant Cubert Streaming Near Realtime Stream Processing Data Management Performance Tuning OLAP Storage Computation EngineWorkflow Manager samza Photon - ML
  • 14. Bruno Connelly See you at Grace Hopper Celebration!