SlideShare une entreprise Scribd logo
1  sur  63
Building Data Products using Hadoop at Linkedin ,[object Object],[object Object],[object Object]
Who am I?
[object Object]
People You May Know
Profile Stats: WVMP
Viewers of this profile also ...
Skills
InMaps
Data Products: Key Ideas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Products: Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Systems and Tools ,[object Object],[object Object],[object Object],[object Object]
Systems and Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Systems and Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Systems and Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Systems and Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
People You May Know Alice Bob Carol How do people know each other?
People You May Know Alice Bob Carol How do people know each other?
People You May Know Alice Bob Carol Triangle closing How do people know each other?
People You May Know Alice Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections How do people know each other?
Triangle Closing in Pig --  connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE  generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE  flatten(group) as (source_id, dest_id),  COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
Pig Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Triangle Closing in Pig --  connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE  generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE  flatten(group) as (source_id, dest_id),  COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
Triangle Closing in Pig --  connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE  generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE  flatten(group) as (source_id, dest_id),  COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
Triangle Closing in Pig --  connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE  generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE  flatten(group) as (source_id, dest_id),  COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
Triangle Closing in Pig --  connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE  generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE  flatten(group) as (source_id, dest_id),  COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
Triangle Closing in Pig --  connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE  generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE  flatten(group) as (source_id, dest_id),  COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],connections = LOAD `connections` USING PigStorage();
Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],group_conn = GROUP connections BY source_id;
Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);
Triangle Closing Example Alice Bob Carol ,[object Object],[object Object],[object Object],[object Object],common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;
Our Workflow triangle-closing
Our Workflow triangle-closing top-n
Our Workflow triangle-closing top-n push-to-prod
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Our Workflow triangle-closing top-n push-to-prod
Our Workflow triangle-closing top-n push-to-prod remove connections
Our Workflow triangle-closing top-n push-to-prod remove connections push-to-qa
PYMK Workflow
Workflow Requirements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Workflow Requirements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Azkaban
Sample Azkaban Job Spec ,[object Object],[object Object],[object Object],[object Object]
Azkaban Workflow
Azkaban Workflow
Azkaban Workflow
Our Workflow triangle-closing top-n push-to-prod remove connections
Our Workflow triangle-closing top-n push-to-prod remove connections
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Production Storage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Voldemort Storage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Cycle
Voldemort RO Store
Our Workflow triangle-closing top-n push-to-prod remove connections
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Quality ,[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Things Covered ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SNA Team ,[object Object],[object Object],[object Object]
Questions?
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contenu connexe

Plus de BigDataCloud

Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBigDataCloud
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBigDataCloud
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud PlatformBigDataCloud
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value BigDataCloud
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.BigDataCloud
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideBigDataCloud
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?BigDataCloud
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalBigDataCloud
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinBigDataCloud
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud
 
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...BigDataCloud
 
BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...
BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...
BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...BigDataCloud
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshBigDataCloud
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud
 

Plus de BigDataCloud (20)

Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & Apps
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud Platform
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud Platform
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural Guide
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
 
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
 
BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...
BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...
BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Ka...
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab Ghosh
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
 

Dernier

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Building Data Products using Hadoop at Linkedin - Mitul Tiwari

  • 1.
  • 3.
  • 6. Viewers of this profile also ...
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. People You May Know Alice Bob Carol How do people know each other?
  • 19. People You May Know Alice Bob Carol How do people know each other?
  • 20. People You May Know Alice Bob Carol Triangle closing How do people know each other?
  • 21. People You May Know Alice Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections How do people know each other?
  • 22. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 23.
  • 24. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 25. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 26. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 27. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 28. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 29.
  • 30.
  • 31.
  • 32.
  • 35. Our Workflow triangle-closing top-n push-to-prod
  • 36.
  • 37. Our Workflow triangle-closing top-n push-to-prod
  • 38. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 39. Our Workflow triangle-closing top-n push-to-prod remove connections push-to-qa
  • 41.
  • 42.
  • 43.
  • 47. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 48. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 49.
  • 50.
  • 51.
  • 54. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 62.
  • 63.

Notes de l'éditeur

  1. Hi, I am Mitul Tiwari. Today I am going to talk about building data driven products using Hadoop at LinkedIn.
  2. I am part of Search, Network, Analytics team at LinkedIn, and I work on data driven products such as People You May Know.
  3. let me illustrate through a few examples of data products at LinkedIn
  4. LinkedIn is the second largest social network for professionals with more than 100 million members. PYMK is a large scale recommendation system that helps you connect with others. Basically, PYMK is a link prediction problem, where we analyze billions of edges to recommend possible connections to you. A big big-data problem!
  5. Another example of a data product at LinkeIn is “Profile Stats” or “Who Viewed My Profile”. Profile Stats provides analytics about your profile on LinkedIn. It provides stats about who viewed your profile, what are the top search queries leading to your profile, the number of profile views per day/week, location of the visitor, etc. We have billions of pageviews per month, Profile Stats is another big data problem.
  6. Another example of Data Product at LinkedIn is “Viewers of this profile also viewed these profiles”. A collaborative filtering way of suggesting profile.
  7. topic pages for skills
  8. Visualize your connections. Cluster your connections based on their connection density among them.
  9. The key ideas behind these data products are Recommendations, Analytics, Insight, and Visualization.
  10. Some challenges behind building data driven products at linkedin. A naive implementation of PYMK may result in generating 120mX120m pairs, which is 14400 trillion pairs. So you have to be smart about it. So which data product would you like me to build during this talk?
  11. Here is a pig script to do triangle closing, that is, find the number of common connections between any pair of members.
  12. So how many of you are familiar with Pig? Let me refresh some Pig constructs for those who are not very familiar with Pig.
  13. First you load connections data that is in bidirectional pairs format. Representing each direction of an edge by a pair of member ids.
  14. Then we group connections pairs by source_ids to aggregate all connections for each member.
  15. From aggregated connections we generate pairs of members (id1, id2) which are friend-of-friend through a source_id
  16. Now we group by (id1, id2) to aggregate all common connections, and count to find the number of common connections.
  17. Finally, we store common connections data in HDFS.
  18. Let me illustrate the triangle closing through our running example. First, we load each direction edge represented by a pair.
  19. Then we group connections pairs by source_ids to aggregate all connections for each member.
  20. From aggregated connections we generate pairs of members (id1, id2) which are friend-of-friend through a source_id
  21. Finally we group by (id1, id2) to aggregate all common connections, and count to find the number of common connections.
  22. After we are done with triangle closing we can list each members’ friends-of-friends ordered by the number of common connections. .
  23. Since there might be too many people who are your friends-of-friends, you might want to select top n from that list. For example, there are more than hundred-fifty thousands people who are my friends-of-friends
  24. Next you need to push this data out in production. So that’s a simple workflow.
  25. I just described how you can build People You May Know by doing triangle-closing and finding out friends-of-friends, and the number of common connections between them. As you just saw there might be multiple jobs dependent on each other that you have to run in that order. So how do we manage our workflow?
  26. We need to ensure that we are showing good quality data to our members. First, we verify that data transfer between HDFS and production system is done properly. Second, we push data to a QA store with a viewer to check any blatant mistakes. Third, we have can explain any PYMK recommendation, how and why that recommendation is appearing. Fourth, we have ways to rollback in case something goes wrong. And finally, we have unit tests in place to check things are processes as we desire
  27. First, we can improve performance by 50% by utilizing symmetry in our triangle-closing. If Bob is a friend-of-friend of Carol then Carol is a friend-of-friend of Bob. Second, there are supernodes in our social graph. For examples, Barack Obama has more than 10000 connections on LinkedIn. If we generate f-o-f pairs from his connections involving Barack Obama, there will be more 100 million pairs of f-o-f. Third, we can sample certain number of connections to decrease the the number of pairs from f-of-f, and randomize so that we generate different pairs every day.
  28. Oscon Data award for open source contributions.