SlideShare a Scribd company logo
1 of 24
©2013 LinkedIn Corporation. All Rights Reserved.
Hive at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
 LinkedIn Data and its Ecosystem
 Performance Improvements – Avro
 User experiences
3
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Data Sources
 Event Data
– Page Views
– Clicks
– Search queries
 Database Data
– Profile (Users & Companies)
– Connections
 External Data
– Salesforce, DoubleClick
4
©2013 LinkedIn Corporation. All Rights Reserved.
Member Data
(Profiles)
Espresso
and RDBMS
External
Partner Data
Member Activity
(Page views,
button clicks)
Kafka Topics
Front-end
Serving
Systems
Member-facing
systems
Lots of cool stuff
not in this picture!
Where's the Data at LinkedIn?
© 2013 LinkedIn 24 June 2013
Data Ecosystem at LinkedIn
5
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
6
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
7
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
8
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
9
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data in Hadoop
 Almost all LinkedIn data is stored in Hadoop
 Tools used
– Hive/HCatalog
– Pig
– Java MapReduce
– Azkaban
10
©2013 LinkedIn Corporation. All Rights Reserved.
Hive Usage
 Use-cases
– Ad-hoc query
– Reporting
– Building Platforms
 Segmentation Engine
 Experimentations Engine
 Users
– Data Scientist
– Business Analytics
– Security team
– Product team
11
©2013 LinkedIn Corporation. All Rights Reserved.
Hive Challenges
 Performance
– Faster query execution
 Performance
– Faster query execution
 Efficient MR* execution plan
– Effective resource usage
– Ensure cluster stability
12
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Hive Initiatives
 Make HCatalog work and deploy [OnGoing]
 Hive Performance Improvement (Avro data reading) [On
Going]
 Stabilize Hive Server 2 at LI [About to Start]
 Expand the scope of HCatalog metadata [Planning]
13
©2013 LinkedIn Corporation. All Rights Reserved.
HCatalog Initiatives
 Expand scope of meta-data
– Who creates this data?
– What are the inputs?
 Helpful to create data lineage
– Who is the maintainer of data?
14
©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved.
What is the Problem?
 Reading Avro record takes long time.
– 52 micro-second/record
 Found the hotspot using VisualVm
16
©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #1
 Reduce the number of Schema.equals() calls
 Schema equality checks required primarily for evolved
schema.
 Solution includes caching to avoid unnecessary
expensive calls
 Results
– Trunk read overhead : 52 μs/record
– After this patch read overhead : 32 μs/record
17
©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #2
 Reduce extra data transformations
 Solution is to provide custom object inspectors
 Results
– Current read overhead : 52 μs/record
– After this patch read overhead : 30 μs/record
18
©2013 LinkedIn Corporation. All Rights Reserved.
Final Results
19
55
32
30
11
0
10
20
30
40
50
60
Trunk Improvement #1 Improvement #2 Combined
©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved.
56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoop users:
Hive User Base at LinkedIn
21
of Hive jobs were from ad-hoc queries32%
©2013 LinkedIn Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t
22
Data Scientists
Engineers
Product Managers
Customer Support Specialists
Analysts
Hive adoption among Hadoop users by job title
©2013 LinkedIn Corporation. All Rights Reserved.
Top concerns about Hive
23
Not friendly for long/complex workflows
Performance, especially for ad-hoc queries
Steep learning curve for tuning
Data/UDFs unavailability
Hive at LinkedIn

More Related Content

Similar to Hive at LinkedIn

How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesCA | Automic Software
 
Big Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInBig Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInMinh-Hoang Nguyen
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
 
Big data arch_analytics
Big data arch_analyticsBig data arch_analytics
Big data arch_analyticsSrinu Adira
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInrajappaiyer
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua ArquiteturaFernando Galdino
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
 
webMethods World: How Can You Innovate Even Faster With the Latest webMethods...
webMethods World: How Can You Innovate Even Faster With the Latest webMethods...webMethods World: How Can You Innovate Even Faster With the Latest webMethods...
webMethods World: How Can You Innovate Even Faster With the Latest webMethods...Software AG
 
Microservices product development blueprint
Microservices product development blueprintMicroservices product development blueprint
Microservices product development blueprintKyle Sandburg
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lGanesan Narayanasamy
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
 
The Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous WorldThe Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous WorldMaria Colgan
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation frameworkJoseph Adler
 
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data AnalyticsMotadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data Analyticsnovsela
 
Open Source, The Natural Fit for Content Management in the Enterprise
Open Source, The Natural Fit for Content Management in the EnterpriseOpen Source, The Natural Fit for Content Management in the Enterprise
Open Source, The Natural Fit for Content Management in the EnterpriseMatt Hamilton
 
JavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLCJavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLCSteve Speicher
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 

Similar to Hive at LinkedIn (20)

How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data Processes
 
Big Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInBig Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedIn
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
Big data arch_analytics
Big data arch_analyticsBig data arch_analytics
Big data arch_analytics
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
webMethods World: How Can You Innovate Even Faster With the Latest webMethods...
webMethods World: How Can You Innovate Even Faster With the Latest webMethods...webMethods World: How Can You Innovate Even Faster With the Latest webMethods...
webMethods World: How Can You Innovate Even Faster With the Latest webMethods...
 
Microservices product development blueprint
Microservices product development blueprintMicroservices product development blueprint
Microservices product development blueprint
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
The Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous WorldThe Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous World
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation framework
 
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data AnalyticsMotadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
 
Open Source, The Natural Fit for Content Management in the Enterprise
Open Source, The Natural Fit for Content Management in the EnterpriseOpen Source, The Natural Fit for Content Management in the Enterprise
Open Source, The Natural Fit for Content Management in the Enterprise
 
JavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLCJavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLC
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Hive at LinkedIn

  • 1.
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. Hive at LinkedIn
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Agenda  LinkedIn Data and its Ecosystem  Performance Improvements – Avro  User experiences 3
  • 4. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Data Sources  Event Data – Page Views – Clicks – Search queries  Database Data – Profile (Users & Companies) – Connections  External Data – Salesforce, DoubleClick 4
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. Member Data (Profiles) Espresso and RDBMS External Partner Data Member Activity (Page views, button clicks) Kafka Topics Front-end Serving Systems Member-facing systems Lots of cool stuff not in this picture! Where's the Data at LinkedIn? © 2013 LinkedIn 24 June 2013 Data Ecosystem at LinkedIn 5 Member Facing Systems
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 6 Member Facing Systems
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 7 Member Facing Systems
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 8 Member Facing Systems
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 9 Member Facing Systems
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. Data in Hadoop  Almost all LinkedIn data is stored in Hadoop  Tools used – Hive/HCatalog – Pig – Java MapReduce – Azkaban 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Usage  Use-cases – Ad-hoc query – Reporting – Building Platforms  Segmentation Engine  Experimentations Engine  Users – Data Scientist – Business Analytics – Security team – Product team 11
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Challenges  Performance – Faster query execution  Performance – Faster query execution  Efficient MR* execution plan – Effective resource usage – Ensure cluster stability 12
  • 13. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Hive Initiatives  Make HCatalog work and deploy [OnGoing]  Hive Performance Improvement (Avro data reading) [On Going]  Stabilize Hive Server 2 at LI [About to Start]  Expand the scope of HCatalog metadata [Planning] 13
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. HCatalog Initiatives  Expand scope of meta-data – Who creates this data? – What are the inputs?  Helpful to create data lineage – Who is the maintainer of data? 14
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. What is the Problem?  Reading Avro record takes long time. – 52 micro-second/record  Found the hotspot using VisualVm 16
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #1  Reduce the number of Schema.equals() calls  Schema equality checks required primarily for evolved schema.  Solution includes caching to avoid unnecessary expensive calls  Results – Trunk read overhead : 52 μs/record – After this patch read overhead : 32 μs/record 17
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #2  Reduce extra data transformations  Solution is to provide custom object inspectors  Results – Current read overhead : 52 μs/record – After this patch read overhead : 30 μs/record 18
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. Final Results 19 55 32 30 11 0 10 20 30 40 50 60 Trunk Improvement #1 Improvement #2 Combined
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. 56%Never Used Hive 44%Use Hive 27%Primarily use Hive Out of all our Hadoop users: Hive User Base at LinkedIn 21 of Hive jobs were from ad-hoc queries32%
  • 22. ©2013 LinkedIn Corporation. All Rights Reserved. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Who uses Hive and who doesn’t 22 Data Scientists Engineers Product Managers Customer Support Specialists Analysts Hive adoption among Hadoop users by job title
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. Top concerns about Hive 23 Not friendly for long/complex workflows Performance, especially for ad-hoc queries Steep learning curve for tuning Data/UDFs unavailability

Editor's Notes

  1. Hive -Adhoc and reporting , business analyticsPig – ETL pipeline, production WFsMR - Highly specialized application Az - LI WF
  2. Which processData operation can detect root causeEmail, http address
  3. Context of the problem