SlideShare une entreprise Scribd logo
1  sur  17
Populating your Search Index
NEST Meetup, 2016-01
5 Presentations
 Indexing Considerations, Pipelines, and Apache NiFi
 A Proposal for a Document Pipeline
 How we do it at TIAA-CREF with Solr
 How we do it at DRG with Solr
 Logstash and Beats with ElasticSearch
Indexing Considerations
Indexing considerations to think about when building out a
search platform
What do I mean?
 How do you plan to get data into the index
(Solr/ES/…)?
 Backups?
 Schedule & Monitor?
 Realtime search requirements?
 What software? (pipelines, crawlers, …)
Crawling?
 Common in the “enterprise search” space
 What crawler will you use?
 Nutch is well-known but too complex for smaller scale
jobs
 Many more exist.
 Security access control metadata to federate?
 Try ManifoldCF which excels at this.
Bulk indexing
 Plan for a “bulk reindex” use-case
 When changing schemas / ingestion extraction rules
 Or recovering when there’s no backup
 Not having a backup is typical; esp. if re-indexing is fast
 Optimize settings for this to be fast
 May need to toggle after ingestion into “normal” settings
 Use multiple machines during indexing (e.g. via hadoop)?
 “Optimize” (merge) Lucene segments at the end?
Incremental indexing
 (adding new/updated content)
 Detect deletes how?
 A: Flag for removal upstream before eventually removing
 B: Track all IDs somewhere; find the ones that went
missing
 Maybe don’t need to synchronize deletes until off-hours?
 Realtime Indexing, separate?
Backups (DR: Disaster Recovery)
 Scenario:
 Admin accidentally deleted 30k random docs; oh %#?!
 Not solved by replication/redundancy
 Useful in other scenarios, like testing
 Might not need it; especially if bulk re-indexing is fast
 Take Snapshots (e.g. AWS, or via the search
system, or…)
 Recovery: Deploy snapshot then sync it back up to date.
 Solr: see BloomReach’s “HAFT” project
Document Transformations
 Mapping source data (e.g. HTML doc or database
record) to a search document
 Examples:
 Text from PDF extraction
 Enrichment (e.g. Named Entity Recognition)
 Text pre-processing before search platform gets it
 Merging multiple data sources; joining
 Home-grown or use an existing ETL / “pipeline”?
 Do some of this directly on the search platform?
Schedule, Monitor
 How will a bulk index be triggered? Incremental
index?
 Unix Cron? Basic but crude.
 A Web UI to control this is great.
 A CI server (e.g. Jenkins) can work! (web, logs, alerting)
 Monitor/alert for problems?
 Perhaps via general log monitoring (e.g. ELK)
Open-Source ETL Software
A summary of an investigation I did on open-source
options in 2013.
ETL Software
 Extract Transform Load – a general idea
 Software that calls itself ETL tends to be very similar.
 Clover ETL
 Pentaho Data Integration, AKA Kettle
 Talend Open Studio, Data Integration
Common features
 Two are GPL/LGPL, Talend is Apache
 Fremium model – pay for “enterprise” features
 The Good: (in a word, mature)
 GUI wire diagram builder
 Books / resources
 The Bad:
 Text-editing the pipeline not recommended: thus need
GUI
 Poor community
 Data model is table-like; no native multi-valued fields
Talend screenshot
Apache NiFi
“is an easy to use, powerful, and reliable system to
process and distribute data.”
Apache Nifi overview
 Web-based UI
 Runtime modification of flow control
 Data provenance features
 Extensible (of course)
 Security, role based access control

Contenu connexe

Tendances

Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysisRaminder Singh
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Ryft
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiTaswar Bhatti
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterKamran Munshi
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 

Tendances (20)

Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysis
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
poster draft 5
poster draft 5poster draft 5
poster draft 5
 

En vedette

Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Digipak vs Jewel pack
Digipak vs Jewel packDigipak vs Jewel pack
Digipak vs Jewel packHydeK
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDavid Gil Sánchez
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 

En vedette (20)

Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Search engine
Search engineSearch engine
Search engine
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Digipak vs Jewel pack
Digipak vs Jewel packDigipak vs Jewel pack
Digipak vs Jewel pack
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 

Similaire à Populate your Search index, NEST 2016-01

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience ReportNetcetera
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxKnoldus Inc.
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduceGreg Hanchin
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Better integrations through open interfaces
Better integrations through open interfacesBetter integrations through open interfaces
Better integrations through open interfacesSteve Speicher
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 

Similaire à Populate your Search index, NEST 2016-01 (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptx
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Better integrations through open interfaces
Better integrations through open interfacesBetter integrations through open interfaces
Better integrations through open interfaces
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 

Plus de David Smiley

2020-02 Solr's New Plugin System
2020-02 Solr's New Plugin System2020-02 Solr's New Plugin System
2020-02 Solr's New Plugin SystemDavid Smiley
 
H-Hypermap Heatmap Analytics at Scale
H-Hypermap Heatmap Analytics at ScaleH-Hypermap Heatmap Analytics at Scale
H-Hypermap Heatmap Analytics at ScaleDavid Smiley
 
2016-01 Lucene Solr spatial in 2015, NYC Meetup
2016-01 Lucene Solr spatial in 2015, NYC Meetup2016-01 Lucene Solr spatial in 2015, NYC Meetup
2016-01 Lucene Solr spatial in 2015, NYC MeetupDavid Smiley
 
Lucene/Solr spatial in 2015
Lucene/Solr spatial in 2015Lucene/Solr spatial in 2015
Lucene/Solr spatial in 2015David Smiley
 
2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal update2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal updateDavid Smiley
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 

Plus de David Smiley (7)

2020-02 Solr's New Plugin System
2020-02 Solr's New Plugin System2020-02 Solr's New Plugin System
2020-02 Solr's New Plugin System
 
H-Hypermap Heatmap Analytics at Scale
H-Hypermap Heatmap Analytics at ScaleH-Hypermap Heatmap Analytics at Scale
H-Hypermap Heatmap Analytics at Scale
 
2016-01 Lucene Solr spatial in 2015, NYC Meetup
2016-01 Lucene Solr spatial in 2015, NYC Meetup2016-01 Lucene Solr spatial in 2015, NYC Meetup
2016-01 Lucene Solr spatial in 2015, NYC Meetup
 
Lucene/Solr spatial in 2015
Lucene/Solr spatial in 2015Lucene/Solr spatial in 2015
Lucene/Solr spatial in 2015
 
2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal update2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal update
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Lucene 4 spatial
Lucene 4 spatialLucene 4 spatial
Lucene 4 spatial
 

Dernier

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Populate your Search index, NEST 2016-01

  • 1. Populating your Search Index NEST Meetup, 2016-01
  • 2. 5 Presentations  Indexing Considerations, Pipelines, and Apache NiFi  A Proposal for a Document Pipeline  How we do it at TIAA-CREF with Solr  How we do it at DRG with Solr  Logstash and Beats with ElasticSearch
  • 3. Indexing Considerations Indexing considerations to think about when building out a search platform
  • 4. What do I mean?  How do you plan to get data into the index (Solr/ES/…)?  Backups?  Schedule & Monitor?  Realtime search requirements?  What software? (pipelines, crawlers, …)
  • 5. Crawling?  Common in the “enterprise search” space  What crawler will you use?  Nutch is well-known but too complex for smaller scale jobs  Many more exist.  Security access control metadata to federate?  Try ManifoldCF which excels at this.
  • 6. Bulk indexing  Plan for a “bulk reindex” use-case  When changing schemas / ingestion extraction rules  Or recovering when there’s no backup  Not having a backup is typical; esp. if re-indexing is fast  Optimize settings for this to be fast  May need to toggle after ingestion into “normal” settings  Use multiple machines during indexing (e.g. via hadoop)?  “Optimize” (merge) Lucene segments at the end?
  • 7. Incremental indexing  (adding new/updated content)  Detect deletes how?  A: Flag for removal upstream before eventually removing  B: Track all IDs somewhere; find the ones that went missing  Maybe don’t need to synchronize deletes until off-hours?  Realtime Indexing, separate?
  • 8. Backups (DR: Disaster Recovery)  Scenario:  Admin accidentally deleted 30k random docs; oh %#?!  Not solved by replication/redundancy  Useful in other scenarios, like testing  Might not need it; especially if bulk re-indexing is fast  Take Snapshots (e.g. AWS, or via the search system, or…)  Recovery: Deploy snapshot then sync it back up to date.  Solr: see BloomReach’s “HAFT” project
  • 9. Document Transformations  Mapping source data (e.g. HTML doc or database record) to a search document  Examples:  Text from PDF extraction  Enrichment (e.g. Named Entity Recognition)  Text pre-processing before search platform gets it  Merging multiple data sources; joining  Home-grown or use an existing ETL / “pipeline”?  Do some of this directly on the search platform?
  • 10. Schedule, Monitor  How will a bulk index be triggered? Incremental index?  Unix Cron? Basic but crude.  A Web UI to control this is great.  A CI server (e.g. Jenkins) can work! (web, logs, alerting)  Monitor/alert for problems?  Perhaps via general log monitoring (e.g. ELK)
  • 11. Open-Source ETL Software A summary of an investigation I did on open-source options in 2013.
  • 12. ETL Software  Extract Transform Load – a general idea  Software that calls itself ETL tends to be very similar.  Clover ETL  Pentaho Data Integration, AKA Kettle  Talend Open Studio, Data Integration
  • 13. Common features  Two are GPL/LGPL, Talend is Apache  Fremium model – pay for “enterprise” features  The Good: (in a word, mature)  GUI wire diagram builder  Books / resources  The Bad:  Text-editing the pipeline not recommended: thus need GUI  Poor community  Data model is table-like; no native multi-valued fields
  • 15. Apache NiFi “is an easy to use, powerful, and reliable system to process and distribute data.”
  • 16.
  • 17. Apache Nifi overview  Web-based UI  Runtime modification of flow control  Data provenance features  Extensible (of course)  Security, role based access control

Notes de l'éditeur

  1. New England Search Technologies, Meetup Group http://www.meetup.com/New-England-Search-Technologies-NEST-Group/events/227860780/
  2. Recommend Talend, then Pentaho, then Clover in that order. But probably none of them for most search projects.