SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Migration from FAST ESP to
        Lucene Solr
       Presented by Michael McIntosh
   michaelm@tnrglobal.com, Oct 19th, 2011
What will we cover?
Core Aspects of ESP to Solr Migration
           Migration Overview
           Crawling Content
           Processing Content
           Searching Content
           Scaling for Growth
           Questions?
                           © 2011 TNR Global, LLC.
Who am I?

• 7+ Years FAST ESP
• 10+ Years in Search
• 15+ Years in Software
• Early Lycos Developer
• I also develop brain-computer interfaces :)
                               © 2011 TNR Global, LLC.
Who are we?

• 7+ Years in Search
• 15+ Years in Web Dev
• 30+ Years in Software
• Focus on ESP, Solr, Lucene, and the Cloud
• Scalable Web & Search Solution Experts
                               © 2011 TNR Global, LLC.
Migration Overview


             © 2011 TNR Global, LLC.
Migration Challenges

• Our clients depend on ESP 5.3
• No future support for Linux ESP
• We need a viable exit strategy
• We want a fairly painless approach
• How do we provide an alternative?
                         © 2011 TNR Global, LLC.
Migration Use Case

   Federated Product Search
   ...millions of parts and services...

• XML documents (highly-structured)
• PDF documents (semi-structured)
• HTML documents (unstructured)

                             © 2011 TNR Global, LLC.
Our Approach
    Solr Search Platform (SolrSP)
• Custom Scalable Crawler using Heritrix
• Events & Queues managed with RabbitMQ
• Caching & Persistence supported via Riak
• Python pipeline replacement using Pypes
• Advanced Linguistics via NLTK or Rosette
                             © 2011 TNR Global, LLC.
Crawling Content


            © 2011 TNR Global, LLC.
Crawling for ESP

• For XML content, our scripts query a
  service, download resources and feed
• For PDF content, our scripts query a
  database, download PDF urls and feed
• For HTML, our scripts query a database,
  download seed URLs and launch ESP’s
  Enterprise Crawler

                             © 2011 TNR Global, LLC.
Crawling for Solr

• For XML & PDF content, the approach
  remains the same with a different writer
• We tried Nutch crawler, but found it
  challenging to make it do what we needed
• We tried Lucid Works bundled crawler, but
  found the exposed functionality did not
  offer the level of flexibility we needed

                              © 2011 TNR Global, LLC.
Crawling with Heritrix

• Heritrix, created by the Internet Archive,
  supports much of the same functionality
  that the ESP Enterprise Crawler provides
• We wrapped Heritrix to provide a higher
  level interface for service management
• Made it scalable and added document
  caching via Riak to support refresh crawling

                                © 2011 TNR Global, LLC.
Crawler Architecture
     Crawl Job        Crawler
      Request         Manager



                   Queue Cluster
                    (RabbitMQ)



      Heritrix        Heritrix          Heritrix
     Messenger       Messenger         Messenger



      Heritrix        Heritrix          Heritrix
      Crawler         Crawler           Crawler



                 Persistance Cluster
                        (Riak)



                                        © 2011 TNR Global, LLC.
Processing Content


             © 2011 TNR Global, LLC.
Processing for ESP
  ESP Processing is document-centric
• For XML, we transform, tag metadata,
  classify content before indexing
• For PDF, we split pages, generate
  thumbnails, tag metadata and classify before
  indexing
• For HTML, we normalize, clean content,
  tag metadata and classify before indexing

                               © 2011 TNR Global, LLC.
Processing for Solr
     Solr Processing is field-centric
• Solr analyzers work on a field by field basis
  and lack the flexible workflow ESP provides
• Using some Solr analyzers for the now, but
  evaluating alternatives (Rosette, NLTK)
• Hadoop + Cascading looks promising
• We use Stackless Python with Pypes to
  make ESP stage migration less painful
                               © 2011 TNR Global, LLC.
Processing with Pypes
              •   Written in Python

              •   Easy stage migration

              •   Very flexible & robust

              •   Branching & Merging

              •   Single Input, Many
                  Outputs

              •   Trivial to embed and
                  extend

                       © 2011 TNR Global, LLC.
Processor Migration

                ...From ESP




                   © 2011 TNR Global, LLC.
Processor Migration

                ...to Pypes




                  © 2011 TNR Global, LLC.
Searching Content


            © 2011 TNR Global, LLC.
Feature Differences
•   ESP has robust faceting support but facets must be
    defined at index time, unlike Solr faceting

•   Solr does most of the heavy lifting at query time,
    which allows for more flexible approaches

•   Solr now directly supports taxonomy (hierarchical)
    faceting functionality (for drill down categories)

•   Solr now supports field collapsing which we use
    heavily in ESP installation to collapse result sets

•   ESP to Solr schema mapping fairly strait-forward

                                        © 2011 TNR Global, LLC.
Search Interface
•   Solr has no direct equivalent to FAST Query
    Language (FQL) but function queries look like a
    possible option for complex queries

•   If you don’t have overly complex queries, the
    edismax query parser looks like a good option

•   Solr doesn’t have an easily extendable search-front
    component like ESP, but we like TwigKit for that

•   Default Solr stemmer isn’t as good as the ESP
    lemmatizer, so if you need good lemmatization
    consider Rosette Linguistics Platform or NLTK

                                      © 2011 TNR Global, LLC.
Scaling for Growth


             © 2011 TNR Global, LLC.
About the hardware...
• Solr allows you to use the familiar rows /
  columns layout ESP uses
• Add shards to scale content, add search
  slaves to scale queries
• We’re currently using master/slave indexer/
  search setup, but options are numerous
• We’re developing a solution to support
  scaling at will, a pain point for ESP as well

                                 © 2011 TNR Global, LLC.
Its not just hardware...
• Use Fabric to automate cluster installs, data
  builds and deployment tasks
• Use Jenkins to automate, manage and track
  Fabric tasks
• Use Supervisor to manage multiple services
  running on each node
• Use Lucid Works for better out-of-the-box
  stemming, alerts, services and support

                                © 2011 TNR Global, LLC.
Migration In a Nutshell

•   We now consider Solr robust enough to be a
    viable replacement of a FAST ESP solution

•   You supply the glue, or work with someone like us
    to tie the different components together

•   If you have many custom pipeline stages, consider
    using Pypes to ease your initial ESP migration

•   Fully supported versions of Solr are available via
    Lucid Works using latest cutting edge features

                                       © 2011 TNR Global, LLC.
Resources
 Lucid Works   http://www.lucidimagination.com/
   Rosette     http://www.basistech.com/lucene/
   Heritrix    http://crawler.archive.org/
   TwigKit     http://twigkit.com/
     Pypes     https://bitbucket.org/diji/pypes/
      Riak     http://basho.com/
     NLTK      http://www.nltk.org/
  RabbitMQ     http://www.rabbitmq.com/
  Cascading    http://www.cascading.org/
     Fabric    http://fabfile.org/
    Jenkins    http://jenkins-ci.org/
  Supervisor   http://supervisord.org/

                                        © 2011 TNR Global, LLC.
Questions?
• Contact Us!
 • Website: http://www.tnrglobal.com
 • E-Mail: fast2solr@tnrglobal.com
 • Phone: 001-413-425-1499

 Thank you for your time!
                             © 2011 TNR Global, LLC.

Contenu connexe

Tendances

Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
Solr Consistency and Recovery Internals - Mano Kovacs, Cloudera
Solr Consistency and Recovery Internals - Mano Kovacs, ClouderaSolr Consistency and Recovery Internals - Mano Kovacs, Cloudera
Solr Consistency and Recovery Internals - Mano Kovacs, ClouderaLucidworks
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...Ernie Souhrada
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMPT.JUG
 
Upping your NiFi Game with Docker
Upping your NiFi Game with DockerUpping your NiFi Game with Docker
Upping your NiFi Game with DockerAldrin Piri
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionDataWorks Summit
 
Lessons from Sharding Solr
Lessons from Sharding SolrLessons from Sharding Solr
Lessons from Sharding SolrGregg Donovan
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Navigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software FoundationNavigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software FoundationBrett Porter
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataWorks Summit
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Apache NiFi User Guide
Apache NiFi User GuideApache NiFi User Guide
Apache NiFi User GuideDeon Huang
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without DataBryan Bende
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Reactive Supply To Changing Demand
Reactive Supply To Changing DemandReactive Supply To Changing Demand
Reactive Supply To Changing DemandJonas Bonér
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 

Tendances (20)

Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Solr Consistency and Recovery Internals - Mano Kovacs, Cloudera
Solr Consistency and Recovery Internals - Mano Kovacs, ClouderaSolr Consistency and Recovery Internals - Mano Kovacs, Cloudera
Solr Consistency and Recovery Internals - Mano Kovacs, Cloudera
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
 
Upping your NiFi Game with Docker
Upping your NiFi Game with DockerUpping your NiFi Game with Docker
Upping your NiFi Game with Docker
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
Lessons from Sharding Solr
Lessons from Sharding SolrLessons from Sharding Solr
Lessons from Sharding Solr
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Navigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software FoundationNavigating the Incubator at the Apache Software Foundation
Navigating the Incubator at the Apache Software Foundation
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Apache NiFi User Guide
Apache NiFi User GuideApache NiFi User Guide
Apache NiFi User Guide
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Reactive Supply To Changing Demand
Reactive Supply To Changing DemandReactive Supply To Changing Demand
Reactive Supply To Changing Demand
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 

Similaire à Migration from FAST ESP to Lucene Solr

Migration from Fast ESP to Lucene Solr - Michael McIntosh
Migration from Fast ESP to Lucene Solr - Michael McIntoshMigration from Fast ESP to Lucene Solr - Michael McIntosh
Migration from Fast ESP to Lucene Solr - Michael McIntoshlucenerevolution
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueTimothy Spann
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Timothy Spann
 
Alfresco tech talk live on solr august 2012
Alfresco tech talk live on solr august 2012Alfresco tech talk live on solr august 2012
Alfresco tech talk live on solr august 2012Alfresco Software
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Paolo Negri
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social gamesWooga
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in AnzoLeeFeigenbaum
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Atmosphere Conference 2015: Service Operations Evolution at Spotify
Atmosphere Conference 2015: Service Operations Evolution at SpotifyAtmosphere Conference 2015: Service Operations Evolution at Spotify
Atmosphere Conference 2015: Service Operations Evolution at SpotifyPROIDEA
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunktdthomassld
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop ZooDataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)mosaicnet
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 

Similaire à Migration from FAST ESP to Lucene Solr (20)

Migration from Fast ESP to Lucene Solr - Michael McIntosh
Migration from Fast ESP to Lucene Solr - Michael McIntoshMigration from Fast ESP to Lucene Solr - Michael McIntosh
Migration from Fast ESP to Lucene Solr - Michael McIntosh
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
Alfresco tech talk live on solr august 2012
Alfresco tech talk live on solr august 2012Alfresco tech talk live on solr august 2012
Alfresco tech talk live on solr august 2012
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Atmosphere Conference 2015: Service Operations Evolution at Spotify
Atmosphere Conference 2015: Service Operations Evolution at SpotifyAtmosphere Conference 2015: Service Operations Evolution at Spotify
Atmosphere Conference 2015: Service Operations Evolution at Spotify
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
Lift Introduction
Lift IntroductionLift Introduction
Lift Introduction
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop Zoo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 

Dernier

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 

Dernier (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 

Migration from FAST ESP to Lucene Solr

  • 1. Migration from FAST ESP to Lucene Solr Presented by Michael McIntosh michaelm@tnrglobal.com, Oct 19th, 2011
  • 2. What will we cover? Core Aspects of ESP to Solr Migration Migration Overview Crawling Content Processing Content Searching Content Scaling for Growth Questions? © 2011 TNR Global, LLC.
  • 3. Who am I? • 7+ Years FAST ESP • 10+ Years in Search • 15+ Years in Software • Early Lycos Developer • I also develop brain-computer interfaces :) © 2011 TNR Global, LLC.
  • 4. Who are we? • 7+ Years in Search • 15+ Years in Web Dev • 30+ Years in Software • Focus on ESP, Solr, Lucene, and the Cloud • Scalable Web & Search Solution Experts © 2011 TNR Global, LLC.
  • 5. Migration Overview © 2011 TNR Global, LLC.
  • 6. Migration Challenges • Our clients depend on ESP 5.3 • No future support for Linux ESP • We need a viable exit strategy • We want a fairly painless approach • How do we provide an alternative? © 2011 TNR Global, LLC.
  • 7. Migration Use Case Federated Product Search ...millions of parts and services... • XML documents (highly-structured) • PDF documents (semi-structured) • HTML documents (unstructured) © 2011 TNR Global, LLC.
  • 8. Our Approach Solr Search Platform (SolrSP) • Custom Scalable Crawler using Heritrix • Events & Queues managed with RabbitMQ • Caching & Persistence supported via Riak • Python pipeline replacement using Pypes • Advanced Linguistics via NLTK or Rosette © 2011 TNR Global, LLC.
  • 9. Crawling Content © 2011 TNR Global, LLC.
  • 10. Crawling for ESP • For XML content, our scripts query a service, download resources and feed • For PDF content, our scripts query a database, download PDF urls and feed • For HTML, our scripts query a database, download seed URLs and launch ESP’s Enterprise Crawler © 2011 TNR Global, LLC.
  • 11. Crawling for Solr • For XML & PDF content, the approach remains the same with a different writer • We tried Nutch crawler, but found it challenging to make it do what we needed • We tried Lucid Works bundled crawler, but found the exposed functionality did not offer the level of flexibility we needed © 2011 TNR Global, LLC.
  • 12. Crawling with Heritrix • Heritrix, created by the Internet Archive, supports much of the same functionality that the ESP Enterprise Crawler provides • We wrapped Heritrix to provide a higher level interface for service management • Made it scalable and added document caching via Riak to support refresh crawling © 2011 TNR Global, LLC.
  • 13. Crawler Architecture Crawl Job Crawler Request Manager Queue Cluster (RabbitMQ) Heritrix Heritrix Heritrix Messenger Messenger Messenger Heritrix Heritrix Heritrix Crawler Crawler Crawler Persistance Cluster (Riak) © 2011 TNR Global, LLC.
  • 14. Processing Content © 2011 TNR Global, LLC.
  • 15. Processing for ESP ESP Processing is document-centric • For XML, we transform, tag metadata, classify content before indexing • For PDF, we split pages, generate thumbnails, tag metadata and classify before indexing • For HTML, we normalize, clean content, tag metadata and classify before indexing © 2011 TNR Global, LLC.
  • 16. Processing for Solr Solr Processing is field-centric • Solr analyzers work on a field by field basis and lack the flexible workflow ESP provides • Using some Solr analyzers for the now, but evaluating alternatives (Rosette, NLTK) • Hadoop + Cascading looks promising • We use Stackless Python with Pypes to make ESP stage migration less painful © 2011 TNR Global, LLC.
  • 17. Processing with Pypes • Written in Python • Easy stage migration • Very flexible & robust • Branching & Merging • Single Input, Many Outputs • Trivial to embed and extend © 2011 TNR Global, LLC.
  • 18. Processor Migration ...From ESP © 2011 TNR Global, LLC.
  • 19. Processor Migration ...to Pypes © 2011 TNR Global, LLC.
  • 20. Searching Content © 2011 TNR Global, LLC.
  • 21. Feature Differences • ESP has robust faceting support but facets must be defined at index time, unlike Solr faceting • Solr does most of the heavy lifting at query time, which allows for more flexible approaches • Solr now directly supports taxonomy (hierarchical) faceting functionality (for drill down categories) • Solr now supports field collapsing which we use heavily in ESP installation to collapse result sets • ESP to Solr schema mapping fairly strait-forward © 2011 TNR Global, LLC.
  • 22. Search Interface • Solr has no direct equivalent to FAST Query Language (FQL) but function queries look like a possible option for complex queries • If you don’t have overly complex queries, the edismax query parser looks like a good option • Solr doesn’t have an easily extendable search-front component like ESP, but we like TwigKit for that • Default Solr stemmer isn’t as good as the ESP lemmatizer, so if you need good lemmatization consider Rosette Linguistics Platform or NLTK © 2011 TNR Global, LLC.
  • 23. Scaling for Growth © 2011 TNR Global, LLC.
  • 24. About the hardware... • Solr allows you to use the familiar rows / columns layout ESP uses • Add shards to scale content, add search slaves to scale queries • We’re currently using master/slave indexer/ search setup, but options are numerous • We’re developing a solution to support scaling at will, a pain point for ESP as well © 2011 TNR Global, LLC.
  • 25. Its not just hardware... • Use Fabric to automate cluster installs, data builds and deployment tasks • Use Jenkins to automate, manage and track Fabric tasks • Use Supervisor to manage multiple services running on each node • Use Lucid Works for better out-of-the-box stemming, alerts, services and support © 2011 TNR Global, LLC.
  • 26. Migration In a Nutshell • We now consider Solr robust enough to be a viable replacement of a FAST ESP solution • You supply the glue, or work with someone like us to tie the different components together • If you have many custom pipeline stages, consider using Pypes to ease your initial ESP migration • Fully supported versions of Solr are available via Lucid Works using latest cutting edge features © 2011 TNR Global, LLC.
  • 27. Resources Lucid Works http://www.lucidimagination.com/ Rosette http://www.basistech.com/lucene/ Heritrix http://crawler.archive.org/ TwigKit http://twigkit.com/ Pypes https://bitbucket.org/diji/pypes/ Riak http://basho.com/ NLTK http://www.nltk.org/ RabbitMQ http://www.rabbitmq.com/ Cascading http://www.cascading.org/ Fabric http://fabfile.org/ Jenkins http://jenkins-ci.org/ Supervisor http://supervisord.org/ © 2011 TNR Global, LLC.
  • 28. Questions? • Contact Us! • Website: http://www.tnrglobal.com • E-Mail: fast2solr@tnrglobal.com • Phone: 001-413-425-1499 Thank you for your time! © 2011 TNR Global, LLC.