SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
TO INFINITY AND BEYOND
Pranav Prakash
in.linkedin.com/in/prakashpranav
Search @LinkedIn
Hari Prasanna
in.linkedin.com/in/mostlycached
BigData @LinkedIn
The story of how solving one problem the OpenSource way
opened doors to so much more
OpenSource Chain Reaction
How “it” begins
OpenSource Chain Reaction
How “it” begins
How “it” grows
OpenSource Chain Reaction
How “it” begins
How “it” grows
How “it” contributes
LUCENE
Information Retrieval Library
Started in 1999 as SourceForge.net project
Joins Apache in 2001 in Jakarta’s family
Top Level Project in 2005
LinkedIn, Twitter, Comcast
LUCENE
IR requirements
What would you do next?
Be better at searching
Crawl the web
Web Wrapper around Lucene
Full Text Search, NRT Indexing
Faceted Search, Clustering
NUTCH
Web Crawler
Billions of pages on the internet
Alternate to commercial engines
From a single tool to an ecosystem
• Breaking away from the initial problem statement
• The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to
HDFS, HBase and Giraph
• The thrill and chaos of working with alpha software - from dealing with
compatibility issues to being a part of active development
• Interoperability between various systems
• Ever widening scope of the project and leveraging other tools in the
ecosystem
Ecosystem
• Features:
• Distributed storage - HDFS
• Distributed processing - MapReduce
• Fault tolerance
• Horizontal scalability
• Comparisons
• RDBMS
• Grid computing
• Use Cases
• Analytics (trends, predictions, summaries etc.,)
• Searching and Indexing
Hadoop
• Features:
• Column based storage
• Horizontal scalability
• Low latency reads
• MapReduce support
• SQL Support with Phoenix
• Coprocessors and secondary indexes
• RDBMS vs HBase
• Use cases
• Facebook messages
• Monitoring with openTSDB
HBase
Vanilla MapReduce
!
!
!
!
!
Higher Abstractions
• Pig - data flow language
• Hive - SQL to MapReduce adapter
• Cascading - Pipeline primitives and other powerful abstractions
• Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like
datafu
Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to
run the job. The map function is represented by the Mapper class, which declares an
abstract map() method. Example 2-3 shows the implementation of our map method.
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
Figure 2-1. MapReduce logical data flow
Data Processing
• Data collection, aggregation and forwarding with
Kafka, Flume, Scribe.
• Real time stream processing with Storm to enable
online machine learning, real time analytics in
twitter, groupon.
• Graph processing a trillion edges in facebook with
Apache Giraph
• Quickstarting with the cloudera distribution
• Getting one step through the door - SlideShare’s journey
• Can your app survive without it? - Raising your bar
• Programmer, Administrator, DBA, Data Scientist - what
hat are you wearing today?
• The road ahead
• Keeping track of the developments and giving back
Leveraging “Big Data”
• Scientific Research - Scihadoop, decoding DNA
• Finance - Fraud Detection, Algorithmic trading, Risk
Management
• Web - Network Analysis, Recommendation Engines,
Personalization
• Government - Election campaigns, intelligence
systems
• Supply chain optimization, Weather forecasting
In the Wild
How an open source project led to new opportunities and an entire ecosystem

Contenu connexe

Tendances

Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopJason Plurad
 
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...Safe Software
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsEdureka!
 
Graph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinGraph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinJason Plurad
 
Asynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkAsynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkDatabricks
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017Juantomás García Molina
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Eva Tse
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & ScalaEdureka!
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectEdwardBloom
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Computing at scale
Computing at scaleComputing at scale
Computing at scalejerjou
 
Twisting Data into Cool Shapes
Twisting Data into Cool ShapesTwisting Data into Cool Shapes
Twisting Data into Cool ShapesShane Coughlan
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 

Tendances (14)

Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
 
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time Analytics
 
Graph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinGraph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and Gremlin
 
Asynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkAsynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache Spark
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at Netflix
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Computing at scale
Computing at scaleComputing at scale
Computing at scale
 
Twisting Data into Cool Shapes
Twisting Data into Cool ShapesTwisting Data into Cool Shapes
Twisting Data into Cool Shapes
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 

En vedette

How to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceHow to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceArun
 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peachesPranav Prakash
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Pranav Prakash
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation systemPranav Prakash
 
Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Inc
 

En vedette (12)

Solidry @ bakheda2
Solidry @ bakheda2Solidry @ bakheda2
Solidry @ bakheda2
 
#comments
#comments#comments
#comments
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Test document
Test documentTest document
Test document
 
How to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceHow to Create an Engaging Social Media Experience
How to Create an Engaging Social Media Experience
 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peaches
 
Banana peaches
Banana peachesBanana peaches
Banana peaches
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
 
Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote
 

Similaire à How an open source project led to new opportunities and an entire ecosystem

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Dr. Mohan K. Bavirisetty
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data21Style
 

Similaire à How an open source project led to new opportunities and an entire ecosystem (20)

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
 
963
963963
963
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Data Sciences Learning
Data Sciences LearningData Sciences Learning
Data Sciences Learning
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 

Plus de Pranav Prakash

Plus de Pranav Prakash (19)

Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
 
Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Oranges
OrangesOranges
Oranges
 
Oranges peaches
Oranges peachesOranges peaches
Oranges peaches
 
Banana
BananaBanana
Banana
 
Banana oranges
Banana orangesBanana oranges
Banana oranges
 
Banana oranges peaches
Banana oranges peachesBanana oranges peaches
Banana oranges peaches
 
Apple
AppleApple
Apple
 
Apple peaches
Apple peachesApple peaches
Apple peaches
 
Apple oranges
Apple orangesApple oranges
Apple oranges
 
Apple oranges peaches
Apple oranges peachesApple oranges peaches
Apple oranges peaches
 
Apple banana
Apple bananaApple banana
Apple banana
 
Apple banana peaches
Apple banana peachesApple banana peaches
Apple banana peaches
 
Apple banana oranges
Apple banana orangesApple banana oranges
Apple banana oranges
 
Peaches
PeachesPeaches
Peaches
 
MIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar reportMIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar report
 
Introduction to Category Theory for software engineers
Introduction to Category Theory for software engineersIntroduction to Category Theory for software engineers
Introduction to Category Theory for software engineers
 
PyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appenginePyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appengine
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

How an open source project led to new opportunities and an entire ecosystem

  • 1. TO INFINITY AND BEYOND Pranav Prakash in.linkedin.com/in/prakashpranav Search @LinkedIn Hari Prasanna in.linkedin.com/in/mostlycached BigData @LinkedIn The story of how solving one problem the OpenSource way opened doors to so much more
  • 3. OpenSource Chain Reaction How “it” begins How “it” grows
  • 4. OpenSource Chain Reaction How “it” begins How “it” grows How “it” contributes
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. LUCENE Information Retrieval Library Started in 1999 as SourceForge.net project Joins Apache in 2001 in Jakarta’s family Top Level Project in 2005 LinkedIn, Twitter, Comcast
  • 10. LUCENE IR requirements What would you do next? Be better at searching Crawl the web
  • 11. Web Wrapper around Lucene Full Text Search, NRT Indexing Faceted Search, Clustering
  • 12. NUTCH Web Crawler Billions of pages on the internet Alternate to commercial engines
  • 13. From a single tool to an ecosystem • Breaking away from the initial problem statement • The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to HDFS, HBase and Giraph • The thrill and chaos of working with alpha software - from dealing with compatibility issues to being a part of active development • Interoperability between various systems • Ever widening scope of the project and leveraging other tools in the ecosystem
  • 15. • Features: • Distributed storage - HDFS • Distributed processing - MapReduce • Fault tolerance • Horizontal scalability • Comparisons • RDBMS • Grid computing • Use Cases • Analytics (trends, predictions, summaries etc.,) • Searching and Indexing Hadoop
  • 16. • Features: • Column based storage • Horizontal scalability • Low latency reads • MapReduce support • SQL Support with Phoenix • Coprocessors and secondary indexes • RDBMS vs HBase • Use cases • Facebook messages • Monitoring with openTSDB HBase
  • 17. Vanilla MapReduce ! ! ! ! ! Higher Abstractions • Pig - data flow language • Hive - SQL to MapReduce adapter • Cascading - Pipeline primitives and other powerful abstractions • Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like datafu Java MapReduce Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method. Example 2-3 shows the implementation of our map method. Example 2-3. Mapper for maximum temperature example import java.io.IOException; Figure 2-1. MapReduce logical data flow Data Processing
  • 18. • Data collection, aggregation and forwarding with Kafka, Flume, Scribe. • Real time stream processing with Storm to enable online machine learning, real time analytics in twitter, groupon. • Graph processing a trillion edges in facebook with Apache Giraph
  • 19. • Quickstarting with the cloudera distribution • Getting one step through the door - SlideShare’s journey • Can your app survive without it? - Raising your bar • Programmer, Administrator, DBA, Data Scientist - what hat are you wearing today? • The road ahead • Keeping track of the developments and giving back Leveraging “Big Data”
  • 20. • Scientific Research - Scihadoop, decoding DNA • Finance - Fraud Detection, Algorithmic trading, Risk Management • Web - Network Analysis, Recommendation Engines, Personalization • Government - Election campaigns, intelligence systems • Supply chain optimization, Weather forecasting In the Wild