SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
The Search Is Over:
Integrating SOLR and Hadoop to
Simplify Big Data Analytics
©MapR Technologies - Confidential   1
Evolution of Search

                                                       Documents
                                                       •Models
                                                       •Feature Selection




                                                                            User
                                    Content                                 Interaction
                                    Relationships                           •Clicks
                                    •Page Rank, etc.                        •Ratings/Reviews
                                    •Organization                           •Learning to Rank
                                                                            •Social Graph




                                                          Queries
                                                          •Phrases
                                                          •NLP




©MapR Technologies - Confidential                              2
Search Discovery and Analytics


                                                Search




                                    Analytics            Discovery



©MapR Technologies - Confidential                 3
Data is Growing Quickly

        Business Analytics Requires a New Approach



                                                  Data Volume
                                                  Growing 44x
                    2010:
                      1.2
                  Zettabytes                                         2020: 35.2
                                                                     Zettabytes          IDC
                                                                                  Digital Universe
                                                                                    Study 2011
     Data is Growing Faster than Moore’s Law
        ©MapR Technologies - Confidential                        4
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
MapReduce: A Paradigm Shift
     Distributed computing platform
       –    Large clusters
       –    Commodity hardware
     Pioneered at Google
       –    Bigtable and Google File System
     Commercially available as Hadoop




    ©MapR Technologies - Confidential         5
Hadoop Explosion




©MapR Technologies - Confidential   6
                                        6
How does Map/Reduce work?
1.      Map
        –           Spread data across servers based on key/value pairs
        –           Each node independently scans local data
2.      Servers produce Map results
3.      Reduce - combine/merge Map results
4.      Process complete or Map a new function

           Like shuffling
           multiple decks
           of playing
           cards

 ©MapR Technologies - Confidential                 7
The Cost of Enterprise Storage




                SAN Storage            NAS Filers       Local Storage

         $2 - $10/Gigabyte          $1 - $5/Gigabyte   $0.05/Gigabyte

                 $1M gets:             $1M gets:          $1M gets:
               0.5Petabytes           1 Petabyte        20 Petabytes
               200,000 IOPS          400,000 IOPS      10,000,000 IOPS
                1Gbyte/sec            2Gbyte/sec       800 Gbytes/sec

©MapR Technologies - Confidential           8
Deep Object Store
    Billions and Billions of Files
    For some use cases it’s not the storage
     capacity it’s the number of objects
     – Messages
     – Attachments
     – Images
     – Recordings
    Provides a deep storage pool that is analytic ready
     – Store it until you need it
     – Derive secondary value from analytic processing
    Makes more sense to perform analytics on the data and
     send results over the network

©MapR Technologies - Confidential    9
                                                         9
Problems with Integrating Solr with Hadoop

     Simple to integrate with Hadoop as a data source
     Difficult to integrate distributed search and scale
     SolrCloud simplifies Sharding and Replication coordination
     Integration limitations based on capabilities of large scale storage
       –   High availability
       –   Data protection
       –   Ease of Access




©MapR Technologies - Confidential      10
Sharded text Indexing
                               Assign documents                  Index text to local disk
                                   to shards                     and then copy index to
                                                                  distributed file store




                                                                                       Clustered
                                                       Reducer                         index storage
               Input                       Map
           documents
                                              Copy to local disk
                                                  Local
                                          typically disk
                                                    required before   Local                 Search
                                            index can be loaded        disk                 Engine




©MapR Technologies - Confidential                          11
Problems with Solr and Hadoop

                                                                  Failure of search
                                                                  engine requires
                                    Failure of a reducer         another download
                                     causes garbage to           of the index from
                                     accumulate in the           clustered storage.
                                                                                    Clustered
                                          local disk   Reducer                      index storage
               Input                     Map
           documents
                                                Local
                                                 disk              Local                Search
                                                                    disk                Engine




©MapR Technologies - Confidential                        12
Limitations of HDFS

       HDFS is Append Only                              NAS
                                                       appliance

       Data Access is through the HDFS API
                                                        A    B
       High Availability is a challenge               NameNode

       Single points of failure
                                         DataNode      DataNode    DataNode
       Limited to 50-200 million files
       Performance bottleneck           DataNode      DataNode    DataNode




                                            DataNode   DataNode    DataNode




©MapR Technologies - Confidential   13
Logs, Flume, aggregates incoming events to Solr –
Requires Multi-Step, Batch Process


                                                       Hadoop
                                    Application        Cluster
                                      Server




                                    Application
                                      Server




                                    Application
                                      Server




©MapR Technologies - Confidential                 14
What’s Required for SDA?

     Ease of Data Access through Open Standards
                                                               Search

     Large Scale, Reliable Storage


     Ease of Integration                          Analytics            Discovery
       –   Management ( REST)
       –   Security (LDAP, NIS, Linux PAM…)
       –   Analytics (NFS, ODBC, HDFS)




©MapR Technologies - Confidential             15
Ease of Data Access




                             HDFS        ENTERPRISE
                              API        NFS Access




©MapR Technologies - Confidential   16
Multiple Architectures Possible

     Export to the world
      – NFS gateway runs on selected gateway hosts
     Local server
      – NFS gateway runs on local host
      – Enables local compression and check summing
     Export to self
      – NFS gateway runs on all data nodes, mounted from localhost




©MapR Technologies - Confidential   17
Data Access through Standard Protocols




                                     NFS
                                       NFS
                                    Server
                                        NFS
                                     Server
                                          NFS
                                       Server
                 NFS                    Server
                Client




©MapR Technologies - Confidential            18
NFS Access through a Local server



                                      Application

                                              NFS
                                             Server
                                    Client




                                                      Cluster
                                                      Nodes




©MapR Technologies - Confidential                19
Universal export to self

                                                          Cluster Nodes




                                         Task

                                             NFS
                                    Cluster Server
                                    Node




©MapR Technologies - Confidential                    20
Nodes are identical

                                    Task
                                                                Task
                                  NFS
                                                                    NFS
                         Cluster Server
                         Node                              Cluster Server
                                                           Node



                                           Task

                                              NFS
                                     Cluster Server
                                     Node



©MapR Technologies - Confidential                     21
Simplifies Solr Hadoop Integration



                                                                                     Search
                                                                                     Engine
                                                       Reducer
               Input                      Map                    Clustered
           documents
                                                                 index storage
                                    Failure of a reducer                Search engine
                                      is cleaned up by                 reads mirrored
                                         map-reduce                     index directly.
                                         framework




©MapR Technologies - Confidential                          22
How Does this Integration Happen?

     Elegantly simple
     Direct Integration a result of leveraging architectures
     Data in the Hadoop cluster is written to a Volume
     Solr Crawler discovers content being entered into
      Hadoop
     Accesses the data in the cluster through NFS
     Builds Search Index
     Users access Solr to find data directly into Hadoop


©MapR Technologies - Confidential   23
Distributed Shard Indexing




                                           shard#1,doc
                         doc1
                                           1
                         doc2                                 shard#1,[doc3,doc1]
                                           shard#2,doc
                         doc3                                 shard#2,[doc2] index/s1
                                           2
                                                              shard#3, [doc5]index/s2
                                           shard#1,doc
                                                              …              index/s3
                                           3
                                           shard#3,doc                      …
                               Input   Map 4 Combine
                                               Shuffle        Reduce              Output
                                              and sort
                                           shard#3,doc
                                           5              Reduce
                                           …


©MapR Technologies - Confidential                        24
                                                                                 24
How Does this Work at Scale with
Distributed Indices?
 MapReduce jobs analyze distributed, disparate data in a cluster
 In distributed indexing, the input is split arbitrarily into chunks
  and each chunk is handled separately. There can be many more
  chunks than there are shards to be created.
 Mapper assigns document to shard
       –   Shard is usually hash of document id
     Reducer indexes all documents for a shard
       –   Indexes created on local disk
       –   On success, copy index to DFS
 Zookeeper is used to manage Solr instances
 A large Solr Search is distributed across multiple shards


©MapR Technologies - Confidential            25
What about HA and Data Protection?

    Cluster Capabilities can Extend to Integrated Search and Discovery


           Reliable Compute                                  Dependable Storage


    Automated re-replication                   Business continuity with snapshots
                                                 and mirrors
    Self-healing from HW and SW failures
                                                Recover to a point in time
    Load balancing
                                                End-to-end check summing
    Rolling upgrades
                                                Strong consistency
    No lost jobs or data
                                                Mirror across sites to meet
    99999’s of uptime
                                                 Recovery Time Objectives


©MapR Technologies - Confidential       26
MapReduce failure to write the Index

 Highly Available JobTracker and TaskTracker ensures
  that any failures are recovered with state to
  completion
 MapReduce will clean up partially written indexes
 No administrator intervention required




©MapR Technologies - Confidential   27
Solr Node Fails


     Other Solr nodes start
      serving shards that
      were being served by
      failed node




©MapR Technologies - Confidential   28
Node Containing the Index Fails

     Data is already replicated across the cluster
     Zookeeper assigns Solr instance on the replicated node to the
      replicated shard




©MapR Technologies - Confidential     29
Additional High Availability and Replication

     Snapshots are available
     Administrator sets frequency at the Volume
 Snapshots with automatic
  de-duplication
 Saves space by sharing blocks
 Redirect on write, fast with no performance or
  storage penalty
 Zero performance loss on writing to original
 Scheduled, or on-demand
 Easy recovery with drag and drop



©MapR Technologies - Confidential   30
Mirroring Support in Hadoop Cluster
                                                              Business Continuity
                                                              and Efficiency
          Production                        Research

                                                              Efficient design
                                                                 Differential deltas are updated
     Datacenter 1
                                    WAN
                                          Datacenter 2           Compressed and
                                                                  check-summed

                                                              Easy to manage
                                    WAN
           Production                                            Scheduled or on-demand
                                              EC2
                                                                 WAN, Remote Seeding
                                                                 Consistent point-in-time

©MapR Technologies - Confidential                        31
Simplified NFS data flows for Distributed
Search
                                                                          Search
                                           Mirroring allows               Engine
                                           exact placement
                                            of index data



                                            Reducer
                Input                Map
            documents                                                     Search
                                                                          Engine
                                            Aribitrary levels
                                             of replication
                                             also possible      Mirrors




 ©MapR Technologies - Confidential                32
Improving Search Relevancy

     Requires a continuous Feedback
      Loop                                            Search

      – The quality of the search is
        influenced by the end-user
        selections                        Analytics            Discovery

      – Fully automated process that
        improves with use
      – Does not require manual tags or
        classification


©MapR Technologies - Confidential   33
Recommendations

     Often referred to as collaborative filtering
     Actors interact with items
       –   observe successful interaction
     We want to suggest additional successful interactions
     Observations inherently very sparse




©MapR Technologies - Confidential           34
Examples

     Customers buying books (Linden et al)
     Web visitors rating music (Shardanand and Maes) or movies (Riedl,
      et al), (Netflix)
     Internet radio listeners not skipping songs (Musicmatch)
     Internet video watchers watching >30 s




©MapR Technologies - Confidential    35
Examples

     Query for Friends results in links to Seinfeld
     Search for kittens, get results for baby otters




©MapR Technologies - Confidential      36
Dyadic Structure

     Functional
       –   Interaction: actor -> item*
     Relational
       –   Interaction ⊆ Actors x Items
     Matrix
       –   Rows indexed by actor, columns by item
       –   Value is count of interactions
     Predict missing observations




©MapR Technologies - Confidential           37
Fundamental Algorithmics

     Co-occurrence
     A is actors x items, K is items x items

     Product has general shape of matrix

     K tells us “users who interacted with x also interacted with y”




©MapR Technologies - Confidential       38
Why not Expand it?

     Users enter queries (A)
       –   (actor = user, item=query)
     Users view videos (B)
       –   (actor = user, item=video)
     A’A gives query recommendation
       –   “did you mean to ask for”
     B’B gives video recommendation
       –   “you might like these videos”




©MapR Technologies - Confidential          39
The punch-line

     B’A recommends videos in response to a query
       –   (isn’t that a search engine?)
       –   (not quite, it doesn’t look at content or meta-data)




©MapR Technologies - Confidential             40
Real-life example

     Query: “Paco de Lucia”
     Conventional meta-data search results:
       –   “hombres del paco” times 400
       –   not much else
     Recommendation based search:
       –   Flamenco guitar and dancers
       –   Spanish and classical guitar
       –   Van Halen doing a classical/flamenco riff




©MapR Technologies - Confidential             41
Real-life example




©MapR Technologies - Confidential   42
The Search for Relevancy
     Updating Search to Reflect Relevancy
       –   Big Map Reduce jobs can use behaviorial traces in logs to improve results
           and identify Importance

                                                Search




                                    Analytics            Discovery




     The power of this virtuous loop depends on ease of frictionless
      data access, high availability, performance


©MapR Technologies - Confidential                43
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

Contenu connexe

Tendances

Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 

Tendances (20)

Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Presentation
PresentationPresentation
Presentation
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 

Similaire à The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Cloudera, Inc.
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarPlatfora
 
Paris live eddiesatterly_022013
Paris live eddiesatterly_022013Paris live eddiesatterly_022013
Paris live eddiesatterly_022013jenny_splunk
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
Mobile Development Meets Semantic Technology
Mobile Development Meets Semantic TechnologyMobile Development Meets Semantic Technology
Mobile Development Meets Semantic TechnologyBlue Slate Solutions
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 

Similaire à The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics (20)

Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Paris live eddiesatterly_022013
Paris live eddiesatterly_022013Paris live eddiesatterly_022013
Paris live eddiesatterly_022013
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Mobile Development Meets Semantic Technology
Mobile Development Meets Semantic TechnologyMobile Development Meets Semantic Technology
Mobile Development Meets Semantic Technology
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Dernier (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

  • 1. The Search Is Over: Integrating SOLR and Hadoop to Simplify Big Data Analytics ©MapR Technologies - Confidential 1
  • 2. Evolution of Search Documents •Models •Feature Selection User Content Interaction Relationships •Clicks •Page Rank, etc. •Ratings/Reviews •Organization •Learning to Rank •Social Graph Queries •Phrases •NLP ©MapR Technologies - Confidential 2
  • 3. Search Discovery and Analytics Search Analytics Discovery ©MapR Technologies - Confidential 3
  • 4. Data is Growing Quickly Business Analytics Requires a New Approach Data Volume Growing 44x 2010: 1.2 Zettabytes 2020: 35.2 Zettabytes IDC Digital Universe Study 2011 Data is Growing Faster than Moore’s Law ©MapR Technologies - Confidential 4 Source: IDC Digital Universe Study, sponsored by EMC, May 2010
  • 5. MapReduce: A Paradigm Shift  Distributed computing platform – Large clusters – Commodity hardware  Pioneered at Google – Bigtable and Google File System  Commercially available as Hadoop ©MapR Technologies - Confidential 5
  • 7. How does Map/Reduce work? 1. Map – Spread data across servers based on key/value pairs – Each node independently scans local data 2. Servers produce Map results 3. Reduce - combine/merge Map results 4. Process complete or Map a new function Like shuffling multiple decks of playing cards ©MapR Technologies - Confidential 7
  • 8. The Cost of Enterprise Storage SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec ©MapR Technologies - Confidential 8
  • 9. Deep Object Store  Billions and Billions of Files  For some use cases it’s not the storage capacity it’s the number of objects – Messages – Attachments – Images – Recordings  Provides a deep storage pool that is analytic ready – Store it until you need it – Derive secondary value from analytic processing  Makes more sense to perform analytics on the data and send results over the network ©MapR Technologies - Confidential 9 9
  • 10. Problems with Integrating Solr with Hadoop  Simple to integrate with Hadoop as a data source  Difficult to integrate distributed search and scale  SolrCloud simplifies Sharding and Replication coordination  Integration limitations based on capabilities of large scale storage – High availability – Data protection – Ease of Access ©MapR Technologies - Confidential 10
  • 11. Sharded text Indexing Assign documents Index text to local disk to shards and then copy index to distributed file store Clustered Reducer index storage Input Map documents Copy to local disk Local typically disk required before Local Search index can be loaded disk Engine ©MapR Technologies - Confidential 11
  • 12. Problems with Solr and Hadoop Failure of search engine requires Failure of a reducer another download causes garbage to of the index from accumulate in the clustered storage. Clustered local disk Reducer index storage Input Map documents Local disk Local Search disk Engine ©MapR Technologies - Confidential 12
  • 13. Limitations of HDFS  HDFS is Append Only NAS appliance  Data Access is through the HDFS API A B  High Availability is a challenge NameNode  Single points of failure DataNode DataNode DataNode  Limited to 50-200 million files  Performance bottleneck DataNode DataNode DataNode DataNode DataNode DataNode ©MapR Technologies - Confidential 13
  • 14. Logs, Flume, aggregates incoming events to Solr – Requires Multi-Step, Batch Process Hadoop Application Cluster Server Application Server Application Server ©MapR Technologies - Confidential 14
  • 15. What’s Required for SDA?  Ease of Data Access through Open Standards Search  Large Scale, Reliable Storage  Ease of Integration Analytics Discovery – Management ( REST) – Security (LDAP, NIS, Linux PAM…) – Analytics (NFS, ODBC, HDFS) ©MapR Technologies - Confidential 15
  • 16. Ease of Data Access HDFS ENTERPRISE API NFS Access ©MapR Technologies - Confidential 16
  • 17. Multiple Architectures Possible  Export to the world – NFS gateway runs on selected gateway hosts  Local server – NFS gateway runs on local host – Enables local compression and check summing  Export to self – NFS gateway runs on all data nodes, mounted from localhost ©MapR Technologies - Confidential 17
  • 18. Data Access through Standard Protocols NFS NFS Server NFS Server NFS Server NFS Server Client ©MapR Technologies - Confidential 18
  • 19. NFS Access through a Local server Application NFS Server Client Cluster Nodes ©MapR Technologies - Confidential 19
  • 20. Universal export to self Cluster Nodes Task NFS Cluster Server Node ©MapR Technologies - Confidential 20
  • 21. Nodes are identical Task Task NFS NFS Cluster Server Node Cluster Server Node Task NFS Cluster Server Node ©MapR Technologies - Confidential 21
  • 22. Simplifies Solr Hadoop Integration Search Engine Reducer Input Map Clustered documents index storage Failure of a reducer Search engine is cleaned up by reads mirrored map-reduce index directly. framework ©MapR Technologies - Confidential 22
  • 23. How Does this Integration Happen?  Elegantly simple  Direct Integration a result of leveraging architectures  Data in the Hadoop cluster is written to a Volume  Solr Crawler discovers content being entered into Hadoop  Accesses the data in the cluster through NFS  Builds Search Index  Users access Solr to find data directly into Hadoop ©MapR Technologies - Confidential 23
  • 24. Distributed Shard Indexing shard#1,doc doc1 1 doc2 shard#1,[doc3,doc1] shard#2,doc doc3 shard#2,[doc2] index/s1 2 shard#3, [doc5]index/s2 shard#1,doc … index/s3 3 shard#3,doc … Input Map 4 Combine Shuffle Reduce Output and sort shard#3,doc 5 Reduce … ©MapR Technologies - Confidential 24 24
  • 25. How Does this Work at Scale with Distributed Indices?  MapReduce jobs analyze distributed, disparate data in a cluster  In distributed indexing, the input is split arbitrarily into chunks and each chunk is handled separately. There can be many more chunks than there are shards to be created.  Mapper assigns document to shard – Shard is usually hash of document id  Reducer indexes all documents for a shard – Indexes created on local disk – On success, copy index to DFS  Zookeeper is used to manage Solr instances  A large Solr Search is distributed across multiple shards ©MapR Technologies - Confidential 25
  • 26. What about HA and Data Protection?  Cluster Capabilities can Extend to Integrated Search and Discovery Reliable Compute Dependable Storage  Automated re-replication  Business continuity with snapshots and mirrors  Self-healing from HW and SW failures  Recover to a point in time  Load balancing  End-to-end check summing  Rolling upgrades  Strong consistency  No lost jobs or data  Mirror across sites to meet  99999’s of uptime Recovery Time Objectives ©MapR Technologies - Confidential 26
  • 27. MapReduce failure to write the Index  Highly Available JobTracker and TaskTracker ensures that any failures are recovered with state to completion  MapReduce will clean up partially written indexes  No administrator intervention required ©MapR Technologies - Confidential 27
  • 28. Solr Node Fails  Other Solr nodes start serving shards that were being served by failed node ©MapR Technologies - Confidential 28
  • 29. Node Containing the Index Fails  Data is already replicated across the cluster  Zookeeper assigns Solr instance on the replicated node to the replicated shard ©MapR Technologies - Confidential 29
  • 30. Additional High Availability and Replication  Snapshots are available  Administrator sets frequency at the Volume  Snapshots with automatic de-duplication  Saves space by sharing blocks  Redirect on write, fast with no performance or storage penalty  Zero performance loss on writing to original  Scheduled, or on-demand  Easy recovery with drag and drop ©MapR Technologies - Confidential 30
  • 31. Mirroring Support in Hadoop Cluster Business Continuity and Efficiency Production Research Efficient design  Differential deltas are updated Datacenter 1 WAN Datacenter 2  Compressed and check-summed Easy to manage WAN Production  Scheduled or on-demand EC2  WAN, Remote Seeding  Consistent point-in-time ©MapR Technologies - Confidential 31
  • 32. Simplified NFS data flows for Distributed Search Search Mirroring allows Engine exact placement of index data Reducer Input Map documents Search Engine Aribitrary levels of replication also possible Mirrors ©MapR Technologies - Confidential 32
  • 33. Improving Search Relevancy  Requires a continuous Feedback Loop Search – The quality of the search is influenced by the end-user selections Analytics Discovery – Fully automated process that improves with use – Does not require manual tags or classification ©MapR Technologies - Confidential 33
  • 34. Recommendations  Often referred to as collaborative filtering  Actors interact with items – observe successful interaction  We want to suggest additional successful interactions  Observations inherently very sparse ©MapR Technologies - Confidential 34
  • 35. Examples  Customers buying books (Linden et al)  Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)  Internet radio listeners not skipping songs (Musicmatch)  Internet video watchers watching >30 s ©MapR Technologies - Confidential 35
  • 36. Examples  Query for Friends results in links to Seinfeld  Search for kittens, get results for baby otters ©MapR Technologies - Confidential 36
  • 37. Dyadic Structure  Functional – Interaction: actor -> item*  Relational – Interaction ⊆ Actors x Items  Matrix – Rows indexed by actor, columns by item – Value is count of interactions  Predict missing observations ©MapR Technologies - Confidential 37
  • 38. Fundamental Algorithmics  Co-occurrence  A is actors x items, K is items x items  Product has general shape of matrix  K tells us “users who interacted with x also interacted with y” ©MapR Technologies - Confidential 38
  • 39. Why not Expand it?  Users enter queries (A) – (actor = user, item=query)  Users view videos (B) – (actor = user, item=video)  A’A gives query recommendation – “did you mean to ask for”  B’B gives video recommendation – “you might like these videos” ©MapR Technologies - Confidential 39
  • 40. The punch-line  B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data) ©MapR Technologies - Confidential 40
  • 41. Real-life example  Query: “Paco de Lucia”  Conventional meta-data search results: – “hombres del paco” times 400 – not much else  Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff ©MapR Technologies - Confidential 41
  • 43. The Search for Relevancy  Updating Search to Reflect Relevancy – Big Map Reduce jobs can use behaviorial traces in logs to improve results and identify Importance Search Analytics Discovery  The power of this virtuous loop depends on ease of frictionless data access, high availability, performance ©MapR Technologies - Confidential 43