The Big Data Ecosystem at LinkedIn

•Télécharger en tant que PPTX, PDF•

25 j'aime•8,144 vues

OSCON Byrum

Technologie Business

The Big Data Ecosystem at LinkedIn Jay Kreps

Me Background in data not infrastructure LinkedIn’s SNA team Original co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)

This Talk We are in a renaissance of data infrastructure. How do all these pieces fit together?

Why the current obsession with “Big Data”?

The goal of modern data infrastructure is to make many small computers act like one big one.

Infrastructure Icebergs 90k lines of tooling and monitoring, 30k lines of logic Dedicated engineers, operations Training First three nines come from operations

This is (still) a very immature space. Which systems should we have?

Infrastructure is sculpted by applications and constraints Projects are defined by trade-offs

Constraints Hardware Jeff Dean: Numbers everyone should know David Patterson: Latency lags bandwidth $$$ Other Path dependence Complexity Resources

Common categories of non-CRUD Recommendations & Matching Graphs Search Data Normalization News feed Analysis & Monitoring

Infrastructure Search Lucene Bobo (facets), Zoie (real-time indexing), Sensei (distribution) Social Graph Storage Oracle Voldemort Espresso Streams Databus Kafka Offline Hadoop & friends (Pig, Hive, Azkaban, etc)

Three Major Paradigms Request/Response Search Social Graph Storage Streams Kafka Batch Hadoop

Request/Response Search Social Graph Storage Voldemort Espresso

Request/Response Patterns Broker, scatter-gather Storage systems: only Partitioning strategy Latency oriented

Batch: Hadoop Uses Ad hoc Production batch Ecosystem Hive, Pig Azkaban (workflow) Avro data Data in: Kafka Data out: Voldemort, Kafka

Why do batch if you have real-time? Batch advantages Safety Easy Throughput Simplicity Economics Tricky bit: engineering the data cycle

Why do streaming? You have to glue all these systems together Throughput as good as batch Latency much better Metaphor more natural for low latency than Hadoop

What makes successful infrastructure systems? Operability and Operations Monitoring Simplicity Documentation Broad adoption Lazy users Open source

Open Source Data > Infrastructure Open source creates better code—even with few outside contributors Commercial infrastructure not interesting

Open Source Projects We made Voldemort: Key/Value storage Sensei, Bobo, Zoie: Elastic, faceted, real-time search with Lucene Kafka: Persistent, distributed data streams Norbert: Cluster aware RPC, load balancing, and group membership And others… We stole Hadoop, Pig, Hive Lucene Netty, Jetty Zookeeper Avro Apache Traffic Server

The End jay.kreps@gmail.com http://www.linkedin.com/in/jaykreps http://twitter.com/jaykreps http://sna-projects.com

Recommandé

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

Big Data ArchitectureGuido Schmutz

The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das

Big Data Architecture and DeploymentCisco Canada

Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko

Big Data Use Casesboorad

Big Data Analytics for Real Time SystemsKamalika Dutta

Recommandé

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

Big Data ArchitectureGuido Schmutz

The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das

Big Data Architecture and DeploymentCisco Canada

Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko

Big Data Use Casesboorad

Big Data Analytics for Real Time SystemsKamalika Dutta

What is an Open Data Lake? - Data Sheets | WhitepaperVasu S

My other computer is a datacentre - 2012 editionSteve Loughran

Big Data Tech StackAbdullah Çetin ÇAVDAR

High Performance Computing and Big Data Geoffrey Fox

Big Data Computing ArchitectureGang Tao

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Data lake-itweekend-sharif university-vahid amirydatastack

Big Data Architecture Workshop - Vahid Amiridatastack

The "Big Data" Ecosystem at LinkedInSam Shah

Data & analytics challenges in a microservice architectureNiels Naglé

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks

Big Data with AzureAaron (Ari) Bornstein

The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit

Big data on Azure for ArchitectsTomasz Kopacz

Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher

Microsoft Azure Big Data AnalyticsMark Kromer

Lecture4 big data technology foundationshktripathy

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain

Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services

Predictive analytics SAS Singapore Institute Pte Ltd

Contenu connexe

Tendances

What is an Open Data Lake? - Data Sheets | WhitepaperVasu S

My other computer is a datacentre - 2012 editionSteve Loughran

Big Data Tech StackAbdullah Çetin ÇAVDAR

High Performance Computing and Big Data Geoffrey Fox

Big Data Computing ArchitectureGang Tao

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Data lake-itweekend-sharif university-vahid amirydatastack

Big Data Architecture Workshop - Vahid Amiridatastack

The "Big Data" Ecosystem at LinkedInSam Shah

Data & analytics challenges in a microservice architectureNiels Naglé

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks

Big Data with AzureAaron (Ari) Bornstein

The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit

Big data on Azure for ArchitectsTomasz Kopacz

Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher

Microsoft Azure Big Data AnalyticsMark Kromer

Lecture4 big data technology foundationshktripathy

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain

Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo

Tendances (20)

What is an Open Data Lake? - Data Sheets | Whitepaper

My other computer is a datacentre - 2012 edition

Big Data Tech Stack

High Performance Computing and Big Data

Big Data Computing Architecture

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Data lake-itweekend-sharif university-vahid amiry

Big Data Architecture Workshop - Vahid Amiri

The "Big Data" Ecosystem at LinkedIn

Data & analytics challenges in a microservice architecture

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Big Data with Azure

The key to unlocking the Value in the IoT? Managing the Data!

Big data on Azure for Architects

Anatomy of a data driven architecture - Tamir Dresher

Microsoft Azure Big Data Analytics

Lecture4 big data technology foundations

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...

Big Data: Architecture and Performance Considerations in Logical Data Lakes

En vedette

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services

Predictive analytics SAS Singapore Institute Pte Ltd

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Real-Time Analytics for IndustriesAvadhoot Patwardhan

Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo

Aligning BPM and EASandy Kemsley

The Big Data Ecosystem for Financial ServicesDataStax

Big Data Ecosystem - 1000 Simulated DronesEspeo Software

Big Data Application Architectures - IoTDataWorks Summit/Hadoop Summit

Your Garden and Global WarmingSchool Vegetable Gardening - Victory Gardens

A look back bkelly duo farewell - june 2015Brian Kelly

LAST Conference - The Mickey Mouse model of leadership for software delivery ...Nish Mahanty

Quick Introduction to the Semantic Web, RDFa & MicroformatsUniversity of California, San Diego

4º básico a semana 25 abril al 29 abrilColegio Camilo Henríquez

Patrick Shields Digitising the Public SectorSoftware AG South Africa

Product Discovery and Delivery by Odd-e (Thailand). Build the right thing at ...Chanita Anuwong

Lo Mejor de Cibeles Madrid Fashion Week - Otoño/Invierno 2010 - 2011Compulsiva Accesorios

Driving schoolRajsafe Drivingschool

Comic AnalysisWestminster MassComm

同玩節海報事件lalacamp07

En vedette (20)

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Predictive analytics

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Real-Time Analytics for Industries

Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...

Aligning BPM and EA

The Big Data Ecosystem for Financial Services

Big Data Ecosystem - 1000 Simulated Drones

Big Data Application Architectures - IoT

Your Garden and Global Warming

A look back bkelly duo farewell - june 2015

LAST Conference - The Mickey Mouse model of leadership for software delivery ...

Quick Introduction to the Semantic Web, RDFa & Microformats

4º básico a semana 25 abril al 29 abril

Patrick Shields Digitising the Public Sector

Product Discovery and Delivery by Odd-e (Thailand). Build the right thing at ...

Lo Mejor de Cibeles Madrid Fashion Week - Otoño/Invierno 2010 - 2011

Driving school

Comic Analysis

同玩節海報事件

Similaire à The Big Data Ecosystem at LinkedIn

Microsoft DryadColin Clark

Python's Role in the Future of Data AnalysisPeter Wang

GrandataStefano Paluello

unit-4-notes.pdfssuser9838f7

Cloud and Bid data Dr.VK.pdfkalai75

Startds9.19.17sdThinkful

GraphTour 2020 - Graphs & AI: A Path for Data ScienceNeo4j

Sycamore Quantum Computer 2019 developed.pptxshujee381

Cloud Computing & Big DataMrinal Kumar

Data sci sd-11.6.17Thinkful

Big Data Basic Concepts | Presented in 2014Kenneth Igiri

Intro to Neo4j WebinarNeo4j

Jon cohn exton pa corporate data architectureJon Cohn

Sem tech 2011 v8dallemang

Entity-Centric Data ManagementeXascale Infolab

The Evolving Landscape of Data EngineeringAndrei Savu

Business_Analytics_Presentation_Luke_CaratanLuke Caratan

ITCamp 2018 - Magnus Mårtensson - Azure Global Application PerspectivesITCamp

Machine Learning and HadoopJosh Patterson

2018 learning approach-digitaltrendsAbhilash Gopalakrishnan

Similaire à The Big Data Ecosystem at LinkedIn (20)

Microsoft Dryad

Python's Role in the Future of Data Analysis

Grandata

unit-4-notes.pdf

Cloud and Bid data Dr.VK.pdf

Startds9.19.17sd

GraphTour 2020 - Graphs & AI: A Path for Data Science

Sycamore Quantum Computer 2019 developed.pptx

Cloud Computing & Big Data

Data sci sd-11.6.17

Big Data Basic Concepts | Presented in 2014

Intro to Neo4j Webinar

Jon cohn exton pa corporate data architecture

Sem tech 2011 v8

Entity-Centric Data Management

The Evolving Landscape of Data Engineering

Business_Analytics_Presentation_Luke_Caratan

ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives

Machine Learning and Hadoop

2018 learning approach-digitaltrends

Plus de OSCON Byrum

OSCON 2013 - Planning an OpenStack Cloud - Tom FifieldOSCON Byrum

Protecting Open Innovation with the Defensive Patent LicenseOSCON Byrum

Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum

Finite State Machines - Why the fear?OSCON Byrum

Open Source Automotive DevelopmentOSCON Byrum

How we built our community using Github - Uri CohenOSCON Byrum

The Vanishing Pattern: from iterators to generators in PythonOSCON Byrum

Distributed Coordination with PythonOSCON Byrum

An overview of open source in East Asia (China, Japan, Korea)OSCON Byrum

Oscon 2013 Jesse AndersonOSCON Byrum

US Patriot Act OSCON2012 David MertzOSCON Byrum

OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON Byrum

Big Data for each one of usOSCON Byrum

BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking OSCON Byrum

Declarative web data visualization using ClojureScriptOSCON Byrum

Using and Building Open Source in Google Corporate Engineering - Justin McWil...OSCON Byrum

A Look at the Network: Searching for Truth in Distributed ApplicationsOSCON Byrum

Life After Sharding: Monitoring and Management of a Complex Data CloudOSCON Byrum

Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum

Comparing open source private cloud platformsOSCON Byrum

Plus de OSCON Byrum (20)

OSCON 2013 - Planning an OpenStack Cloud - Tom Fifield

Protecting Open Innovation with the Defensive Patent License

Using Cascalog to build an app with City of Palo Alto Open Data

Finite State Machines - Why the fear?

Open Source Automotive Development

How we built our community using Github - Uri Cohen

The Vanishing Pattern: from iterators to generators in Python

Distributed Coordination with Python

An overview of open source in East Asia (China, Japan, Korea)

Oscon 2013 Jesse Anderson

US Patriot Act OSCON2012 David Mertz

OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...

Big Data for each one of us

BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking

Declarative web data visualization using ClojureScript

Using and Building Open Source in Google Corporate Engineering - Justin McWil...

A Look at the Network: Searching for Truth in Distributed Applications

Life After Sharding: Monitoring and Management of a Complex Data Cloud

Faster! Faster! Accelerate your business with blazing prototypes

Comparing open source private cloud platforms

Dernier

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Exploring Multimodal Embeddings with MilvusZilliz

FWD Group - Insurer Innovation Award 2024The Digital Insurer

MINDCTI Revenue Release Quarter One 2024MIND CTI

DBX First Quarter 2024 Investor PresentationDropbox

Manulife - Insurer Transformation Award 2024The Digital Insurer

Corporate and higher education May webinar.pptxRustici Software

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Dernier (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Artificial Intelligence Chap.5 : Uncertainty

Exploring Multimodal Embeddings with Milvus

FWD Group - Insurer Innovation Award 2024

MINDCTI Revenue Release Quarter One 2024

DBX First Quarter 2024 Investor Presentation

Manulife - Insurer Transformation Award 2024

Corporate and higher education May webinar.pptx

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

AWS Community Day CPH - Three problems of Terraform

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

The Big Data Ecosystem at LinkedIn

1. The Big Data Ecosystem at LinkedIn Jay Kreps

2. Me Background in data not infrastructure LinkedIn’s SNA team Original co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)

3. This Talk We are in a renaissance of data infrastructure. How do all these pieces fit together?

4. Why the current obsession with “Big Data”?

5. The goal of modern data infrastructure is to make many small computers act like one big one.

6. The Old Picture

7. The New Picture

8. Polyglot persistence?

9. Infrastructure Icebergs 90k lines of tooling and monitoring, 30k lines of logic Dedicated engineers, operations Training First three nines come from operations

10. This is (still) a very immature space. Which systems should we have?

11. Infrastructure is sculpted by applications and constraints Projects are defined by trade-offs

12. Constraints Hardware Jeff Dean: Numbers everyone should know David Patterson: Latency lags bandwidth $$$ Other Path dependence Complexity Resources

13. Applications

14. Common categories of non-CRUD Recommendations & Matching Graphs Search Data Normalization News feed Analysis & Monitoring

15. Social Graph

16. Search

17. Recommendations: People

18. Recommendations: Jobs

19. Recommendations: Newsfeed

20. Data Normalization

21. Analytics

22. Infrastructure Search Lucene Bobo (facets), Zoie (real-time indexing), Sensei (distribution) Social Graph Storage Oracle Voldemort Espresso Streams Databus Kafka Offline Hadoop & friends (Pig, Hive, Azkaban, etc)

23. Three Major Paradigms Request/Response Search Social Graph Storage Streams Kafka Batch Hadoop

24. Most features are multi-paradigm

25. Request/Response Search Social Graph Storage Voldemort Espresso

26. Request/Response Patterns Broker, scatter-gather Storage systems: only Partitioning strategy Latency oriented

27. Batch: Hadoop Uses Ad hoc Production batch Ecosystem Hive, Pig Azkaban (workflow) Avro data Data in: Kafka Data out: Voldemort, Kafka

28. Why do batch if you have real-time? Batch advantages Safety Easy Throughput Simplicity Economics Tricky bit: engineering the data cycle

29. Why do streaming? You have to glue all these systems together Throughput as good as batch Latency much better Metaphor more natural for low latency than Hadoop

30. What makes successful infrastructure systems? Operability and Operations Monitoring Simplicity Documentation Broad adoption Lazy users Open source

31. Open Source Data > Infrastructure Open source creates better code—even with few outside contributors Commercial infrastructure not interesting

32. Open Source Projects We made Voldemort: Key/Value storage Sensei, Bobo, Zoie: Elastic, faceted, real-time search with Lucene Kafka: Persistent, distributed data streams Norbert: Cluster aware RPC, load balancing, and group membership And others… We stole Hadoop, Pig, Hive Lucene Netty, Jetty Zookeeper Avro Apache Traffic Server

33. The End jay.kreps@gmail.com http://www.linkedin.com/in/jaykreps http://twitter.com/jaykreps http://sna-projects.com

Notes de l'éditeur

Good news for users, bad news for distributed systems nerdsFilesystems take a decade to mature. Don’t expect this will be easier.