SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Search	
  	
  	
  	
  	
  	
  	
  Discover	
  	
  	
  	
  	
  	
  	
  Analyze	
  




Enabling	
  Scalable	
  Search,	
  Discovery	
  
and	
  Analy6cs	
  with	
  Solr,	
  Mahout	
  and	
  
Hadoop	
  




Grant	
  Ingersoll	
  
Chief	
  Scien:st	
  
Lucid	
  Imagina:on	
  


                                                                                                              	
  	
  	
  	
  	
  	
  |	
     	
  1	
  	
  
We	
  All	
  Know	
  the	
  Pain	
  



l    ________	
  data	
  growth	
  in	
  the	
  next	
  ___	
  days/months/years	
  
       –  Many	
  es:mate	
  80-­‐90%	
  of	
  data	
  is	
  “unstructured”	
  (mul:-­‐structured?)	
  


l    The	
  Age	
  of	
  “Data	
  Paranoia”	
  
       –  What	
  if	
  I	
  don’t	
  collect	
  it	
  all?	
  
       –  What	
  if	
  I	
  miss	
  something	
  or	
  lose	
  something?	
  
       –  What	
  if	
  I	
  can’t	
  store	
  it	
  long	
  enough?	
  
       –  How	
  do	
  I	
  secure	
  it?	
  
       –  Can	
  I	
  afford	
  to	
  do	
  any	
  of	
  this?	
  	
  Can	
  I	
  afford	
  not	
  to?	
  


       –  What	
  if	
  I	
  can’t	
  make	
  sense	
  of	
  it?	
  



                                                                                                           	
  	
  	
  	
  	
  	
  |	
     	
  2	
  	
  
Big	
  Data	
  Premise	
  and	
  Promise	
  

        Premise                                Promise


        Large Scale Data Collection/Storage    ✔


        Prevents Data Loss                     ✔


        Long Term Storage                      ✔


        Affordable                             ✔


        New Science Delivering New Insights    ?




                                                         	
  	
  	
  	
  	
  	
  |	
     	
  3	
  	
  
Why	
  Search,	
  Discovery	
  and	
  Analy;cs	
  (SDA)?	
  



l    User	
  Needs:	
  
       –  Real-­‐:me,	
  ad	
  hoc	
  access	
  to	
  content	
  
                                                                                    Search
       –  Aggressive	
  Priori:za:on	
  based	
  on	
  Importance	
  
       –  Serendipity	
  
l    Batch	
  processing	
  isn’t	
  enough	
  
l    Search	
  is	
  built	
  for	
  mul:-­‐structured	
  
                                                                        Analytics            Discovery
l    Deeper	
  analysis	
  yields:	
  
       –  Business	
  insight	
  into	
  users	
  
       –  Beaer	
  Search	
  and	
  Discovery	
  for	
  users	
  




                                                                                                 	
  	
  	
  	
  	
  	
  |	
     	
  4	
  	
  
What	
  do	
  you	
  need	
  for	
  SDA?	
  



l    Fast, efficient, scalable search
       –  Bulk and Near Real Time Indexing
l    Large scale, cost effective storage
l    Large scale processing power
       –  Large scale and distributed for whole data consumption and analysis
       –  Sampling tools
       –  Distributed In Memory where appropriate

l    NLP and machine learning tools that scale to enhance discovery and
      analysis




                                                                                	
  	
  	
  	
  	
  	
  |	
     	
  5	
  	
  
Example	
  Use	
  Cases	
  



l    Dark	
  Data	
  –	
  Petabytes	
  (and	
  beyond)	
  of	
  content	
  in	
  storage	
  with	
  liale	
  insight	
  
      into	
  what’s	
  in	
  it	
  
       –  Forensics,	
  Intelligence	
  Gathering,	
  Risk	
  analysis,	
  etc.	
  
l    Financial	
  –	
  Enable	
  total	
  customer	
  view	
  to	
  beaer	
  understand	
  risks	
  and	
  
      opportuni:es	
  
l    Medical	
  –	
  Extend	
  research	
  capabili:es	
  through	
  deeper	
  analysis	
  of	
  both	
  
      scien:fic	
  data,	
  publica:ons	
  and	
  field	
  usage	
  
l    Social	
  Media	
  Monitoring	
  –	
  Understand	
  and	
  analyze	
  social	
  networks	
  and	
  
      their	
  trends	
  all	
  the	
  :me,	
  no	
  maaer	
  the	
  scale	
  
l    Commerce	
  –	
  Drive	
  more	
  sales	
  through	
  metric	
  driven	
  search	
  and	
  discovery	
  
      without	
  the	
  guesswork	
  


                                                                                                                     	
  	
  	
  	
  	
  	
  |	
     	
  6	
  	
  
Announcing	
  LucidWorks	
  Big	
  Data	
  Beta	
  




An	
  applica:on	
  development	
  plaiorm	
  aimed	
  at	
  enabling	
  Search,	
  Discovery	
  and	
  
 Analysis	
  of	
  your	
  content	
  and	
  user	
  interac:ons,	
  no	
  maaer	
  the	
  volume,	
  variety	
  
                    and	
  velocity	
  of	
  that	
  content,	
  nor	
  the	
  number	
  of	
  users	
  




                                                                                                            	
  	
  	
  	
  	
  	
  |	
     	
  7	
  	
  
Architecture	
  




                   	
  	
  	
  	
  	
  	
  |	
     	
  8	
  	
  
Key	
  Features	
  of	
  Beta	
  



l    Combines	
  the	
  real	
  :me,	
  ad	
  hoc	
  data	
  accessibility	
  of	
  LucidWorks	
  with	
  
      compute	
  and	
  storage	
  capabili:es	
  of	
  Hadoop	
  
l    Delivers	
  analy:c	
  capabili:es	
  along	
  with	
  scalable	
  machine	
  learning	
  
      algorithms	
  for	
  deeper	
  insight	
  into	
  both	
  content	
  and	
  users	
  
l    RESTful	
  API	
  suppor:ng	
  JSON	
  input/output	
  formats	
  for	
  easy	
  integra:on	
  


l    Full	
  Stack	
  -­‐	
  Minimizes	
  the	
  impact	
  of	
  provisioning	
  Hadoop,	
  LucidWorks	
  and	
  
      other	
  components	
  


l    Hosted	
  in	
  cloud	
  and	
  supported	
  by	
  Lucid	
  Imagina:on	
  


                                                                                                                	
  	
  	
  	
  	
  	
  |	
     	
  9	
  	
  
APIs	
  



l    Search	
  and	
  Indexing	
                               l    Analy:cs	
  
       –  Full	
  power	
  of	
  LucidWorks	
  (Solr)	
                –  Common	
  search	
  analy:cs	
  for	
  
       –  Bulk	
  and	
  Near	
  Real	
  Time	
  Indexing	
               beaer	
  understanding	
  of	
  relevancy	
  
                                                                          based	
  on	
  log	
  analysis	
  
       –  Sharded	
  via	
  SolrCloud	
  
                                                                       –  Historical	
  views	
  
l    Workflows	
  
                                                                l    Machine	
  Learning	
  
       –  Predefined	
  workflows	
  ease	
  
          common	
  data	
  tasks	
  such	
  as	
  bulk	
              –  Clustering	
  
          indexing	
                                                   –  Sta:s:cally	
  Interes:ng	
  Phrases	
  
l    Administra:on	
                                                  –  Future	
  enhancements	
  planned	
  

       –  Access	
  to	
  key	
  system	
  informa:on	
         l    Proxy	
  APIs	
  
       –  User	
  management	
                                         –  LucidWorks	
  
                                                                       –  WebHDFS	
  


                                                                                                                  	
  	
  	
  	
  	
  	
  |	
     	
  10	
  	
  
Under	
  the	
  Hood	
  


       LucidWorks 2.1                                                    SDA Engine

l    Lucene/Solr	
  4.0-­‐dev	
                                  l    RESTful	
  services	
  built	
  on	
  Restlet	
  2.1	
  
l    Sharded	
  with	
  SolrCloud	
                              l    Service	
  Discovery,	
  load	
  balancing,	
  
       –  1	
  second	
  (default)	
  som	
  commits	
  for	
           failover	
  enabled	
  via	
  ZooKeeper	
  +	
  
          NRT	
  updates	
                                              Neilix	
  Curator	
  
       –  1	
  minute	
  (default)	
  hard	
  commits	
           l    Authen:ca:on	
  and	
  authoriza:on	
  
          (no	
  searcher	
  reopen)	
  
                                                                        over	
  SSL	
  (op:onal)	
  
       –  Transac:on	
  logs	
  for	
  recovery	
  
                                                                  l    Proxies	
  for	
  LucidWorks	
  and	
  
       –  Solr	
  takes	
  care	
  of	
  leader	
  elec:on,	
  
          etc.	
  so	
  no	
  more	
  master/worker	
                   WebHDFS	
  API	
  

l    See	
  Mark	
  Miller’s	
  talk	
  on	
  SolrCloud	
        l    Workflow	
  engine	
  coordinates	
  data	
  
                                                                        flow	
  


                                                                                                                         	
  	
  	
  	
  	
  	
  |	
     	
  11	
  	
  
Under	
  the	
  Hood	
  



l    Apache	
  Hadoop	
                                          l    Apache	
  HBase	
  
       –  Map-­‐Reduce	
  (MR)	
  jobs	
  for	
  ETL	
  and	
            –  Key-­‐value	
  and	
  :me	
  series	
  of	
  all	
  
          bulk	
  indexing	
  into	
  SolrCloud	
                           calculated	
  metrics	
  
          sharded	
  system	
  
                                                                  l    Apache	
  Pig	
  
       –  Leverage	
  Pig	
  and	
  custom	
  MR	
  jobs	
  
          for	
  log	
  processing	
  and	
  metric	
                    –  ETL	
  
          calcula:on	
                                                   –  Log	
  analysis	
  -­‐>	
  HBase	
  
       –  WebHDFS	
                                               l    Apache	
  ZooKeeper	
  
l    Apache	
  Mahout	
                                                 –  Neilix	
  Curator	
  for	
  service	
  
       –  K-­‐Means	
  Clustering	
                                         discovery	
  and	
  higher	
  level	
  ZK	
  client	
  
       –  Sta:s:cally	
  Interes:ng	
  Phrases	
                  l    Apache	
  Kasa	
  
       –  More	
  to	
  come	
                                           –  Pub-­‐sub	
  for	
  collec:ng	
  logs	
  from	
  
                                                                            LucidWorks	
  into	
  HDFS	
  


                                                                                                                               	
  	
  	
  	
  	
  	
  |	
     	
  12	
  	
  
The	
  Road	
  Ahead	
  



l    Our	
  approach	
  is	
  from	
  search	
  and	
  discovery	
  outwards	
  to	
  analy:cs	
  
       –  Analy:cs	
  in	
  beta	
  are	
  focused	
  around	
  analysis	
  of	
  search	
  logs	
  
l    Analy:cs	
  Themes	
  
       –  Relevance	
  
       –  Data	
  quality	
  
       –  Discovery	
  	
  
       –  Integra:on	
  with	
  other	
  packages	
  (R?)	
  

l    Machine	
  Learning	
  
       –  Classifica:on	
  
       –  NLP	
  

l    More	
  analy:cs	
  on	
  the	
  index	
  itself?	
  


                                                                                                       	
  	
  	
  	
  	
  	
  |	
     	
  13	
  	
  
Contacts	
  



l    hap://bit.ly/lucidworks-­‐big-­‐data	
  




l    hap://www.lucidimagina:on.com	
  


l    grant@lucidimagina:on.com	
  
l    @gsingers	
  




                                                 	
  	
  	
  	
  	
  	
  |	
     	
  14	
  	
  

Contenu connexe

Plus de lucenerevolution

Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 

Plus de lucenerevolution (20)

Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 

Dernier

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Dernier (20)

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

  • 1. Search              Discover              Analyze   Enabling  Scalable  Search,  Discovery   and  Analy6cs  with  Solr,  Mahout  and   Hadoop   Grant  Ingersoll   Chief  Scien:st   Lucid  Imagina:on              |    1    
  • 2. We  All  Know  the  Pain   l  ________  data  growth  in  the  next  ___  days/months/years   –  Many  es:mate  80-­‐90%  of  data  is  “unstructured”  (mul:-­‐structured?)   l  The  Age  of  “Data  Paranoia”   –  What  if  I  don’t  collect  it  all?   –  What  if  I  miss  something  or  lose  something?   –  What  if  I  can’t  store  it  long  enough?   –  How  do  I  secure  it?   –  Can  I  afford  to  do  any  of  this?    Can  I  afford  not  to?   –  What  if  I  can’t  make  sense  of  it?              |    2    
  • 3. Big  Data  Premise  and  Promise   Premise Promise Large Scale Data Collection/Storage ✔ Prevents Data Loss ✔ Long Term Storage ✔ Affordable ✔ New Science Delivering New Insights ?            |    3    
  • 4. Why  Search,  Discovery  and  Analy;cs  (SDA)?   l  User  Needs:   –  Real-­‐:me,  ad  hoc  access  to  content   Search –  Aggressive  Priori:za:on  based  on  Importance   –  Serendipity   l  Batch  processing  isn’t  enough   l  Search  is  built  for  mul:-­‐structured   Analytics Discovery l  Deeper  analysis  yields:   –  Business  insight  into  users   –  Beaer  Search  and  Discovery  for  users              |    4    
  • 5. What  do  you  need  for  SDA?   l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing l  Large scale, cost effective storage l  Large scale processing power –  Large scale and distributed for whole data consumption and analysis –  Sampling tools –  Distributed In Memory where appropriate l  NLP and machine learning tools that scale to enhance discovery and analysis            |    5    
  • 6. Example  Use  Cases   l  Dark  Data  –  Petabytes  (and  beyond)  of  content  in  storage  with  liale  insight   into  what’s  in  it   –  Forensics,  Intelligence  Gathering,  Risk  analysis,  etc.   l  Financial  –  Enable  total  customer  view  to  beaer  understand  risks  and   opportuni:es   l  Medical  –  Extend  research  capabili:es  through  deeper  analysis  of  both   scien:fic  data,  publica:ons  and  field  usage   l  Social  Media  Monitoring  –  Understand  and  analyze  social  networks  and   their  trends  all  the  :me,  no  maaer  the  scale   l  Commerce  –  Drive  more  sales  through  metric  driven  search  and  discovery   without  the  guesswork              |    6    
  • 7. Announcing  LucidWorks  Big  Data  Beta   An  applica:on  development  plaiorm  aimed  at  enabling  Search,  Discovery  and   Analysis  of  your  content  and  user  interac:ons,  no  maaer  the  volume,  variety   and  velocity  of  that  content,  nor  the  number  of  users              |    7    
  • 8. Architecture              |    8    
  • 9. Key  Features  of  Beta   l  Combines  the  real  :me,  ad  hoc  data  accessibility  of  LucidWorks  with   compute  and  storage  capabili:es  of  Hadoop   l  Delivers  analy:c  capabili:es  along  with  scalable  machine  learning   algorithms  for  deeper  insight  into  both  content  and  users   l  RESTful  API  suppor:ng  JSON  input/output  formats  for  easy  integra:on   l  Full  Stack  -­‐  Minimizes  the  impact  of  provisioning  Hadoop,  LucidWorks  and   other  components   l  Hosted  in  cloud  and  supported  by  Lucid  Imagina:on              |    9    
  • 10. APIs   l  Search  and  Indexing   l  Analy:cs   –  Full  power  of  LucidWorks  (Solr)   –  Common  search  analy:cs  for   –  Bulk  and  Near  Real  Time  Indexing   beaer  understanding  of  relevancy   based  on  log  analysis   –  Sharded  via  SolrCloud   –  Historical  views   l  Workflows   l  Machine  Learning   –  Predefined  workflows  ease   common  data  tasks  such  as  bulk   –  Clustering   indexing   –  Sta:s:cally  Interes:ng  Phrases   l  Administra:on   –  Future  enhancements  planned   –  Access  to  key  system  informa:on   l  Proxy  APIs   –  User  management   –  LucidWorks   –  WebHDFS              |    10    
  • 11. Under  the  Hood   LucidWorks 2.1 SDA Engine l  Lucene/Solr  4.0-­‐dev   l  RESTful  services  built  on  Restlet  2.1   l  Sharded  with  SolrCloud   l  Service  Discovery,  load  balancing,   –  1  second  (default)  som  commits  for   failover  enabled  via  ZooKeeper  +   NRT  updates   Neilix  Curator   –  1  minute  (default)  hard  commits   l  Authen:ca:on  and  authoriza:on   (no  searcher  reopen)   over  SSL  (op:onal)   –  Transac:on  logs  for  recovery   l  Proxies  for  LucidWorks  and   –  Solr  takes  care  of  leader  elec:on,   etc.  so  no  more  master/worker   WebHDFS  API   l  See  Mark  Miller’s  talk  on  SolrCloud   l  Workflow  engine  coordinates  data   flow              |    11    
  • 12. Under  the  Hood   l  Apache  Hadoop   l  Apache  HBase   –  Map-­‐Reduce  (MR)  jobs  for  ETL  and   –  Key-­‐value  and  :me  series  of  all   bulk  indexing  into  SolrCloud   calculated  metrics   sharded  system   l  Apache  Pig   –  Leverage  Pig  and  custom  MR  jobs   for  log  processing  and  metric   –  ETL   calcula:on   –  Log  analysis  -­‐>  HBase   –  WebHDFS   l  Apache  ZooKeeper   l  Apache  Mahout   –  Neilix  Curator  for  service   –  K-­‐Means  Clustering   discovery  and  higher  level  ZK  client   –  Sta:s:cally  Interes:ng  Phrases   l  Apache  Kasa   –  More  to  come   –  Pub-­‐sub  for  collec:ng  logs  from   LucidWorks  into  HDFS              |    12    
  • 13. The  Road  Ahead   l  Our  approach  is  from  search  and  discovery  outwards  to  analy:cs   –  Analy:cs  in  beta  are  focused  around  analysis  of  search  logs   l  Analy:cs  Themes   –  Relevance   –  Data  quality   –  Discovery     –  Integra:on  with  other  packages  (R?)   l  Machine  Learning   –  Classifica:on   –  NLP   l  More  analy:cs  on  the  index  itself?              |    13    
  • 14. Contacts   l  hap://bit.ly/lucidworks-­‐big-­‐data   l  hap://www.lucidimagina:on.com   l  grant@lucidimagina:on.com   l  @gsingers              |    14