SlideShare une entreprise Scribd logo
1  sur  34
Proprietary & Confidential 6/7/2020 1
Large Scale Computing
@ Linkedin
Bhupesh Bansal
Software Engineer
Linkedin
The Plan
• What Is Linkedin
• Large Scale problems @ Linkedin
• Scalable Architectures
• Hadoop : The mighty elephant
• Questions
What is Linkedin?
 The largest Professional social network:
 Some high level statistics:
– 50M active users, in 147 Industries and 244 Countries (580+ Regions)
– ~10M unique daily visitors
– ~25M weekly searches (150 QPS at peak)
– ~50M weekly profile views
– ~2M connections per day
Why should you care?
 Control your brand
 Market yourself
 Find inside connections
 Learn from Wisdom of the crowd
Build Your Brand
What is your Identity?
How do you want the world to see and know about you?
My Network My Strength
1. Explore opportunities before its too late
2. Make your network go far. (well really really far !!)
3. Seek expert advice
4. Stand on the shoulder of giants.
Wisdom of your network
1. Slideshare : See slides uploaded by your network
2. Answers : See expert answers & who they are
3. Groups
4. …
Large Scale Problems @ Linkedin
9
Linkedin Network Cache
• Third Degree network
• Friends of Friends of Friends
• Scales exponentially
• first degree: 100
• second-degree : first-degree 2
• third-degree : second-degree 2
• Ranges from 10,000  20 M.
• Is Unique for every user
• How do we solve it.
• Custom graph engine app.
• Cache entire third degree for logged
in members
• Graph Algorithms
• Minimize memory footprint of cache
• Bit vector optimizations
• compressions : p4Delta ..
Linkedin Search
• Linkedin Search
• 50 M documents
• Business need to filter/rank
with third degree cache
• Response time : 30-50 ms.
• How do we solve it.
• Distributed search engine.
• Documents partitioned
• 0 -5M, 5-10M , ..
• Each partition fetch network in
range
• Filter/Rank and return results.
Search & Filter on Third Degree
Data Analytics
• Data is Linkedin primary asset
• Linkedin Data has immense value
• Career trends
• Bubble bursts
• Hot keywords
• find experts/ Influencers
• Big Data VS Better Algorithm
• Big fight in machine learning
community
• personal opinion : big data rocks
if you know your data
Image1 source : http://otec.uoregon.edu/images/data-to-wisdom.gif
Image2 source : http://www.nature.com/ki/journal/v62/n5/images/4493262f1b.gif
Data Analytics Examples
People You May Know
• Shows People you should be
connected to ?
• For 50 M member
• Potential 50 2 = 2500,000 Billion
• Can be Narrowed down
• Triangle closing
• School/Company overlap
• .. Other factors
• secret sauce 
Scalable Architectures
• What is a scalable architecture ?
• Ability to do more if desired
• With minimum effort
• w/o software rearchitecture
• How to scale ?
• Scale Vertically (easy but
costly)
• Scale Horizontally (Harder
but very scalable if done right)
• Divide & Conquer is the only
way to scale horizontally
• Sharding/Partitioning
Scalable Architectures
Respect the elephant
Hadoop Origin?
 “For the last several years, every company involved in building large
web-scale systems has faced some of the same fundamental
challenges. While nearly everyone agrees that the "divide-and-
conquer using lots of cheap hardware" approach to breaking down
large problems is the only way to scale, doing so is not easy “
 Hadoop was born to fulfill this very need
Proprietary & Confidential 6/7/2020 18
What is Hadoop?
• An Open-source Java based implementation of distributed map-
reduce paradigm
 Apache project with very active community
– heavy support from Yahoo
 Distributed Storage + Distributed Processing + Job Management
 Runs on Commodity Hardware (Heterogeneous cluster)
 Scalable, Reliable , Fault tolerant and very easy to learn.
 Yahoo currently running clusters with 10,000 nodes that processes
10TB of compressed data for their production search relevance
processing
 Written by Lucene author Doug Cutting (and others)
 Plethora of valuable sub-projects
• Pig, Hive, HBase, Mahout, Katta
What is Hadoop used for?
• Search Relevance: Yahoo, Amazon, Powerset,
Zevents
• Log Processing: Facebook, Yahoo, Joost, Last.fm
• Recommendation System: Facebook
• Data Warehouse: Facebook, AOL
• Video and Image Analysis: New york times, eyealike
• Prediction models: Quantcast, Goldman Sachs
• Genome Sequencing: University of Maryland.
• Academia : University of Washington, Cornell ,
Stanford, Carnegie Mellon,
Proprietary & Confidential 6/7/2020 20
Why Use Hadoop
• Distributed/parallel programs are very very hard to
write
• Tip of iceberg is the actual program/business logic
• Map/Reduce code
• The hidden iceberg is
• Parallelization (Divide & Conquer)
• Data Management (TB to PB)
• Fault Tolerance
• Scalability
• Data local optimizations
• Monitoring
• Job isolation
Hadoop core components
• HDFS distributed file system
• User space distributed data management
• Not a real file system (no POSIX, not in the
kernel)
• Replication
• Rebalancing
• Very high aggregate throughput
• Files are immutable
• MapReduce layer
• Very simple but powerful programming model
• Move computation to the data
• Cope with hardware failures
• Non-goal: low latency
Map Reduce
• Google paper started it all:
•“MapReduce: Simplified Data Processing on Large Clusters”, OSDI
2004
• Map = per “record” computation (extract/transform phase)
• Each map task gets a block or piece of input data
• Shuffle = copy/collect data together (hidden from users)
• Collect all values for same key together.
• Can provide custom sort implementations.
• Reduce = final computation (aggregation/summarize phase)
• All key, value pairs for one key goes to one reducer.
• Reduce is optional
Map Reduce data flow
Proprietary & Confidential 6/7/2020 24
Map Reduce Example : WordCount
Proprietary & Confidential 6/7/2020 25
Image source : http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Map Reduce Implementation
• Jobs are divided into small pieces (Tasks)
• Improves load balancing
• Faster recovery from failed nodes
• Master
• What to run?
• Where to run?
• Slaves
• One tasktracker per machine
• Can configure how many mappers / reducers per tasktracker
• Manage task failures by re-starting on different node.
• Speculative execution if a task is slow or have failed earlier
• Maps and reduces are single threaded Individual JVMs
• Kill -9
• Resource Isolation
• Complete cleanup
Map Reduce
• Job tracker
• splits input and assigns to various map tasks
• Schedules and monitors map tasks (heartbeat)
• On completion, schedules reduce tasks
• Task tracker
• Execute map tasks – call mapper for every input record
• Execute reduce tasks – call reducer for every intermediate key, list of
values pair
• Handle partitioning of map outputs
• Handle sorting and grouping of reducer input
Jobtracker
tasktrackertasktrackertasktracker
Input Job (mapper, reducer, input)
Data
transfer
Assign tasks
HDFS II
 Namenode (master)
– Maps file Name to set of blocks
– Maps blocks to list of datanodes where it resides.
– Replication engine for blocks
– Checksum based data coherency checker.
– Issues
 Metadata in memory (single point of failure)
 Secondary name node checkpoints metadata.
 Datanodes (slaves)
– Stores blocks on local file system.
– Stores checksum of blocks
– Read/write data to clients directly.
– Periodically send a report of all existing blocks to Namenode.
Proprietary & Confidential 6/7/2020 28
HDFS I
Hadoop @ Linkedin
• People You May Know
• 40 individual jobs:
• Graph analysis
• School & Company overlap
• Collaborative filtering
• Misc. other factors
• Regression model combines all values
• Content scoring
• Collaborative Filtering
• Sessionization and cause analysis
• Phrase analysis
• Other derived data: company data, matching fields, seniority, etc.
• Search relevance
• Peoplerank
• Link keywords to results based on user behavior
• User bucketing
Data Visualization
• Can be very very useful.
• Saves ton of time
• Very insightful
• Pictures = 1000+ words
• Challenges
• Amount of data is huge
• Need to write visualization for each problem.
• Time spent per visualization is huge
• Hard to make it distributed/scalable
Data Visualization
Questions
33
References
1. Hadoop. http://hadoop.apache.org/
2. Jeffrey Dean and Sanjay Ghemawat. MapReduce:
Simplified Data Processing on Large Clusters.
http://labs.google.com/papers/mapreduce.html
3. http://code.google.com/edu/parallel/index.html
4. http://www.youtube.com/watch?v=yjPBkvYh-ss
5. http://www.youtube.com/watch?v=-vD6PUdf3Js
6. S. Ghemawat, H. Gobioff, and S. Leung. The Google
File System. http://labs.google.com/papers/gfs.html
7. Linkedin.com
8. DJ Patil Talk at Scale Unlimited 2009

Contenu connexe

Tendances

NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessInfiniteGraph
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsStéphane Fréchette
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)barcelonajug
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4jM. David Allen
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j FundamentalsMax De Marzi
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 

Tendances (20)

NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-less
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT project
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT project
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT project
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 

Similaire à Large scale computing

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data miningAhmad Ammari
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 

Similaire à Large scale computing (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 

Dernier

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Large scale computing

  • 1. Proprietary & Confidential 6/7/2020 1 Large Scale Computing @ Linkedin Bhupesh Bansal Software Engineer Linkedin
  • 2. The Plan • What Is Linkedin • Large Scale problems @ Linkedin • Scalable Architectures • Hadoop : The mighty elephant • Questions
  • 3. What is Linkedin?  The largest Professional social network:  Some high level statistics: – 50M active users, in 147 Industries and 244 Countries (580+ Regions) – ~10M unique daily visitors – ~25M weekly searches (150 QPS at peak) – ~50M weekly profile views – ~2M connections per day
  • 4. Why should you care?  Control your brand  Market yourself  Find inside connections  Learn from Wisdom of the crowd
  • 6. What is your Identity? How do you want the world to see and know about you?
  • 7. My Network My Strength 1. Explore opportunities before its too late 2. Make your network go far. (well really really far !!) 3. Seek expert advice 4. Stand on the shoulder of giants.
  • 8. Wisdom of your network 1. Slideshare : See slides uploaded by your network 2. Answers : See expert answers & who they are 3. Groups 4. …
  • 9. Large Scale Problems @ Linkedin 9
  • 10. Linkedin Network Cache • Third Degree network • Friends of Friends of Friends • Scales exponentially • first degree: 100 • second-degree : first-degree 2 • third-degree : second-degree 2 • Ranges from 10,000  20 M. • Is Unique for every user • How do we solve it. • Custom graph engine app. • Cache entire third degree for logged in members • Graph Algorithms • Minimize memory footprint of cache • Bit vector optimizations • compressions : p4Delta ..
  • 11. Linkedin Search • Linkedin Search • 50 M documents • Business need to filter/rank with third degree cache • Response time : 30-50 ms. • How do we solve it. • Distributed search engine. • Documents partitioned • 0 -5M, 5-10M , .. • Each partition fetch network in range • Filter/Rank and return results. Search & Filter on Third Degree
  • 12. Data Analytics • Data is Linkedin primary asset • Linkedin Data has immense value • Career trends • Bubble bursts • Hot keywords • find experts/ Influencers • Big Data VS Better Algorithm • Big fight in machine learning community • personal opinion : big data rocks if you know your data Image1 source : http://otec.uoregon.edu/images/data-to-wisdom.gif Image2 source : http://www.nature.com/ki/journal/v62/n5/images/4493262f1b.gif
  • 14. People You May Know • Shows People you should be connected to ? • For 50 M member • Potential 50 2 = 2500,000 Billion • Can be Narrowed down • Triangle closing • School/Company overlap • .. Other factors • secret sauce 
  • 15. Scalable Architectures • What is a scalable architecture ? • Ability to do more if desired • With minimum effort • w/o software rearchitecture • How to scale ? • Scale Vertically (easy but costly) • Scale Horizontally (Harder but very scalable if done right) • Divide & Conquer is the only way to scale horizontally • Sharding/Partitioning
  • 18. Hadoop Origin?  “For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and- conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy “  Hadoop was born to fulfill this very need Proprietary & Confidential 6/7/2020 18
  • 19. What is Hadoop? • An Open-source Java based implementation of distributed map- reduce paradigm  Apache project with very active community – heavy support from Yahoo  Distributed Storage + Distributed Processing + Job Management  Runs on Commodity Hardware (Heterogeneous cluster)  Scalable, Reliable , Fault tolerant and very easy to learn.  Yahoo currently running clusters with 10,000 nodes that processes 10TB of compressed data for their production search relevance processing  Written by Lucene author Doug Cutting (and others)  Plethora of valuable sub-projects • Pig, Hive, HBase, Mahout, Katta
  • 20. What is Hadoop used for? • Search Relevance: Yahoo, Amazon, Powerset, Zevents • Log Processing: Facebook, Yahoo, Joost, Last.fm • Recommendation System: Facebook • Data Warehouse: Facebook, AOL • Video and Image Analysis: New york times, eyealike • Prediction models: Quantcast, Goldman Sachs • Genome Sequencing: University of Maryland. • Academia : University of Washington, Cornell , Stanford, Carnegie Mellon, Proprietary & Confidential 6/7/2020 20
  • 21. Why Use Hadoop • Distributed/parallel programs are very very hard to write • Tip of iceberg is the actual program/business logic • Map/Reduce code • The hidden iceberg is • Parallelization (Divide & Conquer) • Data Management (TB to PB) • Fault Tolerance • Scalability • Data local optimizations • Monitoring • Job isolation
  • 22. Hadoop core components • HDFS distributed file system • User space distributed data management • Not a real file system (no POSIX, not in the kernel) • Replication • Rebalancing • Very high aggregate throughput • Files are immutable • MapReduce layer • Very simple but powerful programming model • Move computation to the data • Cope with hardware failures • Non-goal: low latency
  • 23. Map Reduce • Google paper started it all: •“MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004 • Map = per “record” computation (extract/transform phase) • Each map task gets a block or piece of input data • Shuffle = copy/collect data together (hidden from users) • Collect all values for same key together. • Can provide custom sort implementations. • Reduce = final computation (aggregation/summarize phase) • All key, value pairs for one key goes to one reducer. • Reduce is optional
  • 24. Map Reduce data flow Proprietary & Confidential 6/7/2020 24
  • 25. Map Reduce Example : WordCount Proprietary & Confidential 6/7/2020 25 Image source : http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
  • 26. Map Reduce Implementation • Jobs are divided into small pieces (Tasks) • Improves load balancing • Faster recovery from failed nodes • Master • What to run? • Where to run? • Slaves • One tasktracker per machine • Can configure how many mappers / reducers per tasktracker • Manage task failures by re-starting on different node. • Speculative execution if a task is slow or have failed earlier • Maps and reduces are single threaded Individual JVMs • Kill -9 • Resource Isolation • Complete cleanup
  • 27. Map Reduce • Job tracker • splits input and assigns to various map tasks • Schedules and monitors map tasks (heartbeat) • On completion, schedules reduce tasks • Task tracker • Execute map tasks – call mapper for every input record • Execute reduce tasks – call reducer for every intermediate key, list of values pair • Handle partitioning of map outputs • Handle sorting and grouping of reducer input Jobtracker tasktrackertasktrackertasktracker Input Job (mapper, reducer, input) Data transfer Assign tasks
  • 28. HDFS II  Namenode (master) – Maps file Name to set of blocks – Maps blocks to list of datanodes where it resides. – Replication engine for blocks – Checksum based data coherency checker. – Issues  Metadata in memory (single point of failure)  Secondary name node checkpoints metadata.  Datanodes (slaves) – Stores blocks on local file system. – Stores checksum of blocks – Read/write data to clients directly. – Periodically send a report of all existing blocks to Namenode. Proprietary & Confidential 6/7/2020 28
  • 30. Hadoop @ Linkedin • People You May Know • 40 individual jobs: • Graph analysis • School & Company overlap • Collaborative filtering • Misc. other factors • Regression model combines all values • Content scoring • Collaborative Filtering • Sessionization and cause analysis • Phrase analysis • Other derived data: company data, matching fields, seniority, etc. • Search relevance • Peoplerank • Link keywords to results based on user behavior • User bucketing
  • 31. Data Visualization • Can be very very useful. • Saves ton of time • Very insightful • Pictures = 1000+ words • Challenges • Amount of data is huge • Need to write visualization for each problem. • Time spent per visualization is huge • Hard to make it distributed/scalable
  • 34. References 1. Hadoop. http://hadoop.apache.org/ 2. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/mapreduce.html 3. http://code.google.com/edu/parallel/index.html 4. http://www.youtube.com/watch?v=yjPBkvYh-ss 5. http://www.youtube.com/watch?v=-vD6PUdf3Js 6. S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. http://labs.google.com/papers/gfs.html 7. Linkedin.com 8. DJ Patil Talk at Scale Unlimited 2009

Notes de l'éditeur

  1. What is batch computing? Cloud example
  2. Data driven features - Examples from google, facebook, etc
  3. Data driven features - Examples from google, facebook, etc
  4. Data driven features - Examples from google, facebook, etc
  5. Data driven features - Examples from google, facebook, etc
  6. Data driven features - Examples from google, facebook, etc
  7. Data driven features - Examples from google, facebook, etc
  8. Data driven features - Examples from google, facebook, etc
  9. Data driven features - Examples from google, facebook, etc
  10. Transition:
  11. http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
  12. Show blocks
  13. Talk through other people’s example
  14. Talk through other people’s example
  15. Talk through other people’s example
  16. Talk through other people’s example