SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Architecting the Future of Big Data
            and Search
         Eric Baldeschwieler, Hortonworks
      e14@hortonworks.com, 19 October 2011
What I Will Cover
§  Architecting the Future of Big Data and
    Search
   •  Lucene, a technology for managing big data
   •  Hadoop, a technology built for search
   •  Could they work together?
§  Topics:
   •    What is Apache Hadoop?
   •    History and use Cases
   •    Current State
   •    Where Hadoop is Going
   •    Investigating Apache Hadoop and Lucene

                         3
What is Apache Hadoop


           4
Apache Hadoop is…

A set of open source projects owned
by the Apache Foundation that
transforms commodity computers
and network into a distributed service
•  HDFS – Stores petabytes of data
   reliably
•  MapReduce – Allows huge
   distributed computations

Key Attributes
•  Reliable and redundant – Doesn’t slow down or lose data even as
   hardware fails
•  Simple and flexible APIs – Our rocket scientists use it directly!
•  Very powerful – Harnesses huge clusters, supports best of breed analytics
•  Batch processing-centric – Hence its great simplicity and speed, not a fit
   for all use cases

                                         5
More Apache Hadoop Projects

                                                                                                           Programming
                                                             Pig                           Hive
                                                          (Data Flow)                       (SQL)           Languages


                                                                        MapReduce                           Computation
                        Zookeeper
         (Management)




                                    (Coordination)




                                                              (Distributed Programing Framework)
Ambari




                                                         HCatalog                         HBase            Table Storage
                                                          (Meta Data)                (Columnar Storage)




                                                                          HDFS                             Object Storage
                                                               (Hadoop Distributed File System)




                                                 Core Apache Hadoop              Related Apache Projects


                                                                                 6
Example Hardware & Network
r    Frameworks share commodity hardware
       r  Storage - HDFS
       r  Processing - MapReduce


                                                        Network Core

                              2 * 10GigE                                              2 * 10GigE
                                           2 * 10GigE                2 * 10GigE


                           Rack Switch     Rack Switch      Rack Switch               Rack Switch

 •    20-40 nodes / rack
 •    16 Cores
                           1-2U server     1-2U server       1-2U server              1-2U server
 •    48G RAM
 •    6-12 * 2TB disk                                                             …
 •    1-2 GigE to node
                               …




                                                …




                                                                 …




                                                                                           …
                                            7
MapReduce
§  MapReduce is a distributed computing programming model
§  It works like a Unix pipeline:
   •  cat input | grep | sort                 | uniq -c        
> output
   •  Input      | Map | Shuffle & Sort | Reduce | Output
§  Strengths:
     •  Easy to use! Developer just writes a couple of functions
     •  Moves compute to data
       §  Schedules work on HDFS node with data if possible
   •  Scans through data, reducing seeks
   •  Automatic reliability and re-execution on failure




                                         8
                                                                            8
HDFS: Scalable, Reliable, Managable
Scale IO, Storage, CPU                r        Fault Tolerant & Easy management
•  Add commodity servers & JBODs                 r  Built in redundancy
•  4K nodes in cluster, 80                       r  Tolerate disk and node failures
                                                 r  Automatically manage addition/
                                                     removal of nodes
           Core      Core                        r  One operator per 8K nodes!!
          Switch    Switch
                                      r        Storage server used for computation
 Switch    Switch            Switch
                                                 r  Move computation to data

                                      r        Not a SAN
                    …                            r  But high-bandwidth network access
                                                     to data via Ethernet
   …

             …



                               …



                                      r        Immutable file system
                                                 r  Read, Write, sync/flush
                                                     r    No random writes




                                            9
HBase
§  Hadoop ecosystem “NoSQL store”
   •  Very large tables interoperable with Hadoop
   •  Inspired by Google’s BigTable


§  Features
   •  Multidimensional sorted Map
      §  Table => Row => Column => Version => Value
   •  Distributed column-oriented store
   •  Scale – Sharding etc. done automatically
      §  No SQL, CRUD etc.
      §  billions of rows X millions of columns
   •  Uses HDFS for its storage layer
                                     10
History and use cases


            11
A Brief History
                           	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  , early adopters   2006 – present
                           Scale and productize Hadoop

Apache	
  Hadoop	
  
                          Other Internet Companies                                                                                                            2008 – present
                          Add tools / frameworks, enhance
                            Hadoop
                                                                                                                                                          …

                          Service Providers                                                                                                                   2010 – present
                          Provide training, support, hosting
                                                                                                                 Cloudera, MapR
                                                                                                                 Microsoft
                                                                                                                 IBM, EMC, Oracle
                                                                                                                                                              …


                         Wide Enterprise Adoption            Nascent / 2011
                         Funds further development, enhancements
                                                             12
Early Adopters & Uses


                                          data
              analyzing web logs        analytics
  advertising optimization    machine learning

 text mining web search mail anti-spam
                                  content optimization
  customer trend analysis
                             ad selection
video & audio processing
                                   data mining
       user interest prediction
                  social media




                       13
CASE STUDY
  YAHOO! WEBMAP
	
   §  What is a WebMap?
	
       •  Gigantic table of information about every web site,
              page and link Yahoo! knows about
	
       •  Directed graph of the web
	
       •  Various aggregated views (sites, domains, etc.)
         •  Various algorithms for ranking, duplicate detection,
	
  twice	
  tregion classification, spam detection, etc.
              he	
  engagement	
  
   §  Why was it ported to Hadoop?
          •  Custom C++ solution was not scaling
          •  Leverage scalability, load balancing and resilience of
             Hadoop infrastructure
          •  Focus on application vs. infrastructure
   © Yahoo 2011                     14
                                                                      14	
  
CASE STUDY
 WEBMAP PROJECT RESULTS
	
   §  33% time savings over previous system on the
	
       same cluster (and Hadoop keeps getting better)
	
   §  Was largest Hadoop application, drove scale
          •  Over 10,000 cores in system
	
        •  100,000+ maps, ~10,000 reduces
          •  ~70 hours runtime
	
  twice	
  t~300 engagement	
  
          •  he	
   TB shuffling
          •  ~200 TB compressed output

   §  Moving data to Hadoop increased number of
       groups using the data


   © Yahoo 2011                          15
                                                          15	
  
CASE STUDY
   YAHOO SEARCH ASSIST™
	
  
	
  
	
  
      •  Database	
  for	
  Search	
  Assist™	
  is	
  built	
  using	
  Apache	
  Hadoop	
  
	
   •  Several	
  years	
  of	
  log-­‐data	
  
      •  20-­‐steps	
  of	
  MapReduce	
               	
  	
  
	
  twice	
  the	
  engagement	
  
             "                           Before Hadoop                 After Hadoop

            Time                         26 days                       20 minutes

            Language                     C++                           Python

            Development Time             2-3 weeks                     2-3 days


      © Yahoo 2011                                        16
                                                                                                16	
  
HADOOP @ YAHOO!
               TODAY
                          40K+ Servers
                          170 PB Storage
                          5M+ Monthly Jobs
                          1000+ Active users




© Yahoo 2011         17
                                               17	
  
CASE STUDY
  YAHOO! HOMEPAGE
	
  
	
  
	
   Personalized	
  	
  
	
   for	
  each	
  visitor	
  
     	
  
	
  twice	
  the	
  engagement	
  
  Result:	
  	
  
  twice	
  the	
  engagement	
  
  	
  
                                    Recommended	
  links	
       News	
  Interests	
       Top	
  Searches	
  

                                   +79% clicks                 +160% clicks              +43% clicks
                                   vs. randomly selected       vs. one size fits all     vs. editor selected

         © Yahoo 2011                      18
                                                                                                                 18	
  
CASE STUDY
  YAHOO! HOMEPAGE

•  Serving Maps	
                                       SCIENCE      »	
  Machine learning to build
         •  Users	
  -­‐	
  Interests	
                  HADOOP        ever better categorization
	
                                                       CLUSTER
                                                                       models
•  Five	
  Minute	
                       USER	
                         CATEGORIZATION	
  
     ProducDon	
                      BEHAVIOR	
                         MODELS	
  (weekly)	
  
	
  
•  Weekly	
                                             PRODUCTION
     CategorizaDon	
                                       HADOOP
                                                                     »	
  Identify user interests
     models	
                        SERVING
                                                           CLUSTER
                                                                        using Categorization
                                        MAPS                            models
                             (every 5 minutes)
                                                           USER
                                                         BEHAVIOR



                                   SERVING	
  SYSTEMS                   ENGAGED	
  USERS
	
  
Build customized home pages with latest data (thousands / second)
       © Yahoo 2011                                      19
                                                                                                      19	
  
CASE STUDY
YAHOO! MAIL
               Enabling	
  quick	
  response	
  in	
  the	
  spam	
  arms	
  race	
  


                                                     •  450M	
  mail	
  boxes	
  	
  
                                                     •  5B+	
  deliveries/day	
  
                SCIENCE
                                                     	
  
                                                     •  AnDspam	
  models	
  retrained	
  
                                                          	
  every	
  few	
  hours	
  on	
  Hadoop	
  
                                                     	
  


                                                 “          40% less spam than
               PRODUCTION


                                                            Hotmail and 55%
                                                                                      “
                                                            less spam than
                                                            Gmail


© Yahoo 2011                                    20
                                                                                                          20	
  
Where Hadoop is Going


           21
Adoption Drivers
§  Business drivers
    •  ROI and business advantage from mastering big data
    •  High-value projects that require use of more data         Gartner predicts
                                                                800% data growth
    •  Opportunity to interact with customers at point of       over next 5 years
       procurement

§  Financial drivers
    •  Growing cost of data systems as percentage of IT
       spend
    •  Cost advantage of commodity hardware + open source

§  Technical drivers
                                                                80-90% of data
    •  Existing solutions not well suited for volume, variety   produced today
       and velocity of big data                                 is unstructured
    •  Proliferation of unstructured data




                                         22
Key Success Factors
§  Opportunity
   •  Apache Hadoop has the potential to become a center of the
      next generation enterprise data platform
   •  My prediction is that 50% of the world’s data will be stored in
      Hadoop within 5 years

§  In order to achieve this opportunity, there is work to do:
   •  Make Hadoop easier to install, use and manage
   •  Make Hadoop more robust (performance, reliability,
      availability, etc.)
   •  Make Hadoop easier to integrate and extend to enable a
      vibrant ecosystem
   •  Overcome current knowledge gaps

§  Hortonworks mission is to enable Apache Hadoop to
    become de facto platform and unified distribution for big data


                                  23
Our Roadmap

Phase 1 – Making Apache Hadoop Accessible                 2011
•  Release the most stable version of Hadoop ever
     •  Hadoop 0.20.205
•  Release directly usable code from Apache
     •  RPMs & .debs…
•  Improve project integration
     •  HBase support


Phase 2 – Next-Generation Apache Hadoop                   2012
•  Address key product gaps (HA, Management…)             (Alphas in Q4
                                                          2011)
     •  Ambari
•  Enable ecosystem innovation via open APIs
     •  HCatalog, WebHDFS, HBase
•  Enable community innovation via modular architecture
     •  Next Generation MapReduce, HDFS Federation
                                   24
Investigating Apache
  Hadoop and Lucene




         25
Developer Questions
§  We know we want to integrate Lucene into Hadoop
   •  How is this best done?


§  Log & merge problems (search indexes & HBase)
   •  Are there opportunities for Solr and HBase to share?
   •  Knowledge? Lessons learned? Code?


§  Hadoop is moving closer to online
   •  Lower latency and fast batch
      §  Outsource more indexing work to Hadoop?
   •  HBase maturing
      §  Better crawlers, document processing and serving?



                                       26
Business Questions
§  Users of Hadoop are natural users of Lucene
   •  How can we help them search all that data?


§  Are users of Solr natural users of Hadoop?
   •  How can we improve search with Hadoop?
   •  How many of you use both?


§  What are the opportunities?
   •  Integration points? New projects? Training?
   •  Win-Win if communities help each other


                          27
Thank You
§  www.hortonworks.com

§  Twitter: @jeric14




                          28

Contenu connexe

Tendances

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyRakuten Group, Inc.
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Tendances (20)

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in Ruby
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Similaire à Architecting the Future of Big Data & Search - Eric Baldeschwieler

Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 

Similaire à Architecting the Future of Big Data & Search - Eric Baldeschwieler (20)

Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Dernier (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Architecting the Future of Big Data & Search - Eric Baldeschwieler

  • 1. Architecting the Future of Big Data and Search Eric Baldeschwieler, Hortonworks e14@hortonworks.com, 19 October 2011
  • 2. What I Will Cover §  Architecting the Future of Big Data and Search •  Lucene, a technology for managing big data •  Hadoop, a technology built for search •  Could they work together? §  Topics: •  What is Apache Hadoop? •  History and use Cases •  Current State •  Where Hadoop is Going •  Investigating Apache Hadoop and Lucene 3
  • 3. What is Apache Hadoop 4
  • 4. Apache Hadoop is… A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •  HDFS – Stores petabytes of data reliably •  MapReduce – Allows huge distributed computations Key Attributes •  Reliable and redundant – Doesn’t slow down or lose data even as hardware fails •  Simple and flexible APIs – Our rocket scientists use it directly! •  Very powerful – Harnesses huge clusters, supports best of breed analytics •  Batch processing-centric – Hence its great simplicity and speed, not a fit for all use cases 5
  • 5. More Apache Hadoop Projects Programming Pig Hive (Data Flow) (SQL) Languages MapReduce Computation Zookeeper (Management) (Coordination) (Distributed Programing Framework) Ambari HCatalog HBase Table Storage (Meta Data) (Columnar Storage) HDFS Object Storage (Hadoop Distributed File System) Core Apache Hadoop Related Apache Projects 6
  • 6. Example Hardware & Network r  Frameworks share commodity hardware r  Storage - HDFS r  Processing - MapReduce Network Core 2 * 10GigE 2 * 10GigE 2 * 10GigE 2 * 10GigE Rack Switch Rack Switch Rack Switch Rack Switch •  20-40 nodes / rack •  16 Cores 1-2U server 1-2U server 1-2U server 1-2U server •  48G RAM •  6-12 * 2TB disk … •  1-2 GigE to node … … … … 7
  • 7. MapReduce §  MapReduce is a distributed computing programming model §  It works like a Unix pipeline: •  cat input | grep | sort | uniq -c > output •  Input | Map | Shuffle & Sort | Reduce | Output §  Strengths: •  Easy to use! Developer just writes a couple of functions •  Moves compute to data §  Schedules work on HDFS node with data if possible •  Scans through data, reducing seeks •  Automatic reliability and re-execution on failure 8 8
  • 8. HDFS: Scalable, Reliable, Managable Scale IO, Storage, CPU r  Fault Tolerant & Easy management •  Add commodity servers & JBODs r  Built in redundancy •  4K nodes in cluster, 80 r  Tolerate disk and node failures r  Automatically manage addition/ removal of nodes Core Core r  One operator per 8K nodes!! Switch Switch r  Storage server used for computation Switch Switch Switch r  Move computation to data r  Not a SAN … r  But high-bandwidth network access to data via Ethernet … … … r  Immutable file system r  Read, Write, sync/flush r  No random writes 9
  • 9. HBase §  Hadoop ecosystem “NoSQL store” •  Very large tables interoperable with Hadoop •  Inspired by Google’s BigTable §  Features •  Multidimensional sorted Map §  Table => Row => Column => Version => Value •  Distributed column-oriented store •  Scale – Sharding etc. done automatically §  No SQL, CRUD etc. §  billions of rows X millions of columns •  Uses HDFS for its storage layer 10
  • 10. History and use cases 11
  • 11. A Brief History                                                        , early adopters 2006 – present Scale and productize Hadoop Apache  Hadoop   Other Internet Companies 2008 – present Add tools / frameworks, enhance Hadoop … Service Providers 2010 – present Provide training, support, hosting Cloudera, MapR Microsoft IBM, EMC, Oracle … Wide Enterprise Adoption Nascent / 2011 Funds further development, enhancements 12
  • 12. Early Adopters & Uses data analyzing web logs analytics advertising optimization machine learning text mining web search mail anti-spam content optimization customer trend analysis ad selection video & audio processing data mining user interest prediction social media 13
  • 13. CASE STUDY YAHOO! WEBMAP   §  What is a WebMap?   •  Gigantic table of information about every web site, page and link Yahoo! knows about   •  Directed graph of the web   •  Various aggregated views (sites, domains, etc.) •  Various algorithms for ranking, duplicate detection,  twice  tregion classification, spam detection, etc. he  engagement   §  Why was it ported to Hadoop? •  Custom C++ solution was not scaling •  Leverage scalability, load balancing and resilience of Hadoop infrastructure •  Focus on application vs. infrastructure © Yahoo 2011 14 14  
  • 14. CASE STUDY WEBMAP PROJECT RESULTS   §  33% time savings over previous system on the   same cluster (and Hadoop keeps getting better)   §  Was largest Hadoop application, drove scale •  Over 10,000 cores in system   •  100,000+ maps, ~10,000 reduces •  ~70 hours runtime  twice  t~300 engagement   •  he   TB shuffling •  ~200 TB compressed output §  Moving data to Hadoop increased number of groups using the data © Yahoo 2011 15 15  
  • 15. CASE STUDY YAHOO SEARCH ASSIST™       •  Database  for  Search  Assist™  is  built  using  Apache  Hadoop     •  Several  years  of  log-­‐data   •  20-­‐steps  of  MapReduce        twice  the  engagement   " Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days © Yahoo 2011 16 16  
  • 16. HADOOP @ YAHOO! TODAY 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011 17 17  
  • 17. CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected © Yahoo 2011 18 18  
  • 18. CASE STUDY YAHOO! HOMEPAGE •  Serving Maps   SCIENCE »  Machine learning to build •  Users  -­‐  Interests   HADOOP ever better categorization   CLUSTER models •  Five  Minute   USER   CATEGORIZATION   ProducDon   BEHAVIOR   MODELS  (weekly)     •  Weekly   PRODUCTION CategorizaDon   HADOOP »  Identify user interests models   SERVING CLUSTER using Categorization MAPS models (every 5 minutes) USER BEHAVIOR SERVING  SYSTEMS ENGAGED  USERS   Build customized home pages with latest data (thousands / second) © Yahoo 2011 19 19  
  • 19. CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnDspam  models  retrained    every  few  hours  on  Hadoop     “ 40% less spam than PRODUCTION Hotmail and 55% “ less spam than Gmail © Yahoo 2011 20 20  
  • 20. Where Hadoop is Going 21
  • 21. Adoption Drivers §  Business drivers •  ROI and business advantage from mastering big data •  High-value projects that require use of more data Gartner predicts 800% data growth •  Opportunity to interact with customers at point of over next 5 years procurement §  Financial drivers •  Growing cost of data systems as percentage of IT spend •  Cost advantage of commodity hardware + open source §  Technical drivers 80-90% of data •  Existing solutions not well suited for volume, variety produced today and velocity of big data is unstructured •  Proliferation of unstructured data 22
  • 22. Key Success Factors §  Opportunity •  Apache Hadoop has the potential to become a center of the next generation enterprise data platform •  My prediction is that 50% of the world’s data will be stored in Hadoop within 5 years §  In order to achieve this opportunity, there is work to do: •  Make Hadoop easier to install, use and manage •  Make Hadoop more robust (performance, reliability, availability, etc.) •  Make Hadoop easier to integrate and extend to enable a vibrant ecosystem •  Overcome current knowledge gaps §  Hortonworks mission is to enable Apache Hadoop to become de facto platform and unified distribution for big data 23
  • 23. Our Roadmap Phase 1 – Making Apache Hadoop Accessible 2011 •  Release the most stable version of Hadoop ever •  Hadoop 0.20.205 •  Release directly usable code from Apache •  RPMs & .debs… •  Improve project integration •  HBase support Phase 2 – Next-Generation Apache Hadoop 2012 •  Address key product gaps (HA, Management…) (Alphas in Q4 2011) •  Ambari •  Enable ecosystem innovation via open APIs •  HCatalog, WebHDFS, HBase •  Enable community innovation via modular architecture •  Next Generation MapReduce, HDFS Federation 24
  • 24. Investigating Apache Hadoop and Lucene 25
  • 25. Developer Questions §  We know we want to integrate Lucene into Hadoop •  How is this best done? §  Log & merge problems (search indexes & HBase) •  Are there opportunities for Solr and HBase to share? •  Knowledge? Lessons learned? Code? §  Hadoop is moving closer to online •  Lower latency and fast batch §  Outsource more indexing work to Hadoop? •  HBase maturing §  Better crawlers, document processing and serving? 26
  • 26. Business Questions §  Users of Hadoop are natural users of Lucene •  How can we help them search all that data? §  Are users of Solr natural users of Hadoop? •  How can we improve search with Hadoop? •  How many of you use both? §  What are the opportunities? •  Integration points? New projects? Training? •  Win-Win if communities help each other 27