SlideShare une entreprise Scribd logo
1  sur  12
Leveraging Solr and Mahout for Next
                                Gen Data Access and Insight

                                Grant Ingersoll
                                Chief Scientist




Confidential © Copyright 2012
Search is Dead, Long Live Search

• Modern Data Challenges are multi-structured

• Search is a system building block                      Content

    - Text is only a part of the story


• If the algorithms fit,
                                           Content
             use them!                   Relationships
                                                                   Users




• Embrace fuzziness!
                                                         Access

• Scoring features are everywhere

Confidential and Proprietary
© 2012 LucidWorks
Topics

    • Intros

    • Search (R)Evolution

    • Apache Solr
    • Apache Mahout

    • Search and Machine Learning

    • Scaling


    Confidential and Proprietary
3   © 2012 LucidWorks
Grant’s Background

• Co-founder:
    - LucidWorks – Chief Scientist
    - Apache Mahout
• Long time Lucene/Solr committer
• Author: Taming Text
    - www.manning.com/ingersoll
• Background in IR and NLP
    - Built CLIR, QA and a variety of other search-based apps




Confidential and Proprietary
© 2012 LucidWorks
Search (R)evolution

• Search use leads to search abuse
    - Denormalization frees your mind
    - Scoring is just a sparse matrix multiply

• Lucene/Solr evolution
    -   Non-free text usages abound
    -   Many DB-like features
    -   NoSQL before NoSQL was cool
    -   Flexible indexing
    -   Finite State Transducers FTW!

• Scale

• “This ain’t your father’s relevance anymore”

Confidential and Proprietary
© 2012 LucidWorks
Apache Solr?

• “Solr is an open source enterprise search server based
  on the Lucene Java search library, with XML/HTTP and
  JSON APIs, hit highlighting, faceted
  search, caching, replication, a web administration
  interface and many more features. It runs in a Java
  servlet container such as Tomcat. “
    - http://lucene.apache.org/solr


• Did I mention free?




Confidential and Proprietary
© 2012 LucidWorks
Apache Mahout

• Goal: create library of scalable machine learning
  algorithms

• Mahout’s 3 “C”s provide tools for helping across many
  aspects of discovery
    - Collaborative Filtering
    - Classification
    - Clustering
• Also:
    - Collocations (Statistically Interesting Phrases)
    - SVD
    - Java math, primitives libraries and more

Confidential and Proprietary
© 2012 LucidWorks
Search + Machine Learning

• Search-driven applications present multiple
  opportunities for leveraging machine learning
    - Clustering – Enhance Discovery, outlier detection
    - Classification – Queries, Documents, Users
    - Content Recommendation – Collab. Filtering and
      personalization
    - NLP – phrases, named entities, co-reference, much more


• Many of these can also power faceted navigation

• Aside: Search can also often be used effectively to
  implement many machine learning algorithms

Confidential and Proprietary
© 2012 LucidWorks
How and When
                                                    Access APIs
                                                                    •View into
                                Search View             Analytic     numeric/hist     Personalization &
                                                                     oric data
                 1                                      Services                      Machine Learning
                      2                                                                   Services
              Shards       3                 N
                                                                                             •Classification
                                                                                             •Recommendation

                                                                         •Documents      Classification
                  Discovery &                            Document
                                                           Store         •Users             Models
                  Enrichment                                             •Logs
                     Clustering, classific                                               In memory
                     ation, NLP, topic                                                   Replicated
                     identification, searc                                               Multi-tenant
                     h log analysis, user
                     behavior
                                                 Content Acquisition
                                                    ETL, batch or near
                                                    real-time



                   Data
         • LucidWorks Search
           connectors
         • Push


Confidential and Proprietary
© 2012 LucidWorks
Scaling

• Search
    - Solr Cloud = Large scale, distributed search and faceting
          » http://wiki.apache.org/solr/SolrCloud


• Machine Learning
    - Mahout is built on Hadoop for most things
    - SGD is sequential and really fast


• Sometimes all you can do is make an educated guess
    - Storm, Kafka, etc. can help by allowing you to make estimates in
      near real time



Confidential and Proprietary
© 2012 LucidWorks
Wrap

• Search, Discovery and Analytics, when combined into
  a single, coherent system provides powerful insight into
  both your content and your users

• LucidWorks has combined many of these things into
  LucidWorks Big Data
    - http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based
  applications



Confidential and Proprietary
© 2012 LucidWorks
Resources

• LucidWorks
    - http://www.lucidworks.com
    - http://www.lucidworks.com/products/lucidworks-big-data
    - @LucidImagineer

• Me
    - grant@lucidworks.com
    - @gsingers


• Taming Text
    - http://www.manning.com/ingersoll
    - http://www.tamingtext.com
    - @tamingtext

Confidential and Proprietary
© 2012 LucidWorks

Contenu connexe

Plus de Caserta

Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 

Plus de Caserta (20)

Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 

Dernier

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Dernier (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Big Data Warehousing: Leveraging Solr & Mahout

  • 1. Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief Scientist Confidential © Copyright 2012
  • 2. Search is Dead, Long Live Search • Modern Data Challenges are multi-structured • Search is a system building block Content - Text is only a part of the story • If the algorithms fit, Content use them! Relationships Users • Embrace fuzziness! Access • Scoring features are everywhere Confidential and Proprietary © 2012 LucidWorks
  • 3. Topics • Intros • Search (R)Evolution • Apache Solr • Apache Mahout • Search and Machine Learning • Scaling Confidential and Proprietary 3 © 2012 LucidWorks
  • 4. Grant’s Background • Co-founder: - LucidWorks – Chief Scientist - Apache Mahout • Long time Lucene/Solr committer • Author: Taming Text - www.manning.com/ingersoll • Background in IR and NLP - Built CLIR, QA and a variety of other search-based apps Confidential and Proprietary © 2012 LucidWorks
  • 5. Search (R)evolution • Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply • Lucene/Solr evolution - Non-free text usages abound - Many DB-like features - NoSQL before NoSQL was cool - Flexible indexing - Finite State Transducers FTW! • Scale • “This ain’t your father’s relevance anymore” Confidential and Proprietary © 2012 LucidWorks
  • 6. Apache Solr? • “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “ - http://lucene.apache.org/solr • Did I mention free? Confidential and Proprietary © 2012 LucidWorks
  • 7. Apache Mahout • Goal: create library of scalable machine learning algorithms • Mahout’s 3 “C”s provide tools for helping across many aspects of discovery - Collaborative Filtering - Classification - Clustering • Also: - Collocations (Statistically Interesting Phrases) - SVD - Java math, primitives libraries and more Confidential and Proprietary © 2012 LucidWorks
  • 8. Search + Machine Learning • Search-driven applications present multiple opportunities for leveraging machine learning - Clustering – Enhance Discovery, outlier detection - Classification – Queries, Documents, Users - Content Recommendation – Collab. Filtering and personalization - NLP – phrases, named entities, co-reference, much more • Many of these can also power faceted navigation • Aside: Search can also often be used effectively to implement many machine learning algorithms Confidential and Proprietary © 2012 LucidWorks
  • 9. How and When Access APIs •View into Search View Analytic numeric/hist Personalization & oric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation •Documents Classification Discovery & Document Store •Users Models Enrichment •Logs Clustering, classific In memory ation, NLP, topic Replicated identification, searc Multi-tenant h log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • Push Confidential and Proprietary © 2012 LucidWorks
  • 10. Scaling • Search - Solr Cloud = Large scale, distributed search and faceting » http://wiki.apache.org/solr/SolrCloud • Machine Learning - Mahout is built on Hadoop for most things - SGD is sequential and really fast • Sometimes all you can do is make an educated guess - Storm, Kafka, etc. can help by allowing you to make estimates in near real time Confidential and Proprietary © 2012 LucidWorks
  • 11. Wrap • Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data - http://www.lucidworks.com/products/lucidworks-big-data • Design for the big picture when building search-based applications Confidential and Proprietary © 2012 LucidWorks
  • 12. Resources • LucidWorks - http://www.lucidworks.com - http://www.lucidworks.com/products/lucidworks-big-data - @LucidImagineer • Me - grant@lucidworks.com - @gsingers • Taming Text - http://www.manning.com/ingersoll - http://www.tamingtext.com - @tamingtext Confidential and Proprietary © 2012 LucidWorks

Notes de l'éditeur

  1. This is a money slide where people should say “Wow man”. They shouldn’t understand the implications of this, but they should be very, very aware that something big just slide into the room.Tech Building Block: Not just textNot just users + queriesEmbrace Fuzziness: Esp. in Big Data, it is the only way you are going to survive.TED: I think that this should make the case for advanced that is still search at its heart. The idea that search can be radically changed should be on the next slide.
  2. Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much moreSearch has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  3. Big Picture: too often devs are stuck in the weeds