Populate your Search index, NEST 2016-01

•Télécharger en tant que PPTX, PDF•

1 j'aime•814 vues

David Smiley

Presented at a Meetup by New England Search Technology, 2016-01-14.

Technologie

Populating your Search Index
NEST Meetup, 2016-01

5 Presentations
 Indexing Considerations, Pipelines, and Apache NiFi
 A Proposal for a Document Pipeline
 How we do it at TIAA-CREF with Solr
 How we do it at DRG with Solr
 Logstash and Beats with ElasticSearch

Indexing Considerations
Indexing considerations to think about when building out a
search platform

What do I mean?
 How do you plan to get data into the index
(Solr/ES/…)?
 Backups?
 Schedule & Monitor?
 Realtime search requirements?
 What software? (pipelines, crawlers, …)

Crawling?
 Common in the “enterprise search” space
 What crawler will you use?
 Nutch is well-known but too complex for smaller scale
jobs
 Many more exist.
 Security access control metadata to federate?
 Try ManifoldCF which excels at this.

Bulk indexing
 Plan for a “bulk reindex” use-case
 When changing schemas / ingestion extraction rules
 Or recovering when there’s no backup
 Not having a backup is typical; esp. if re-indexing is fast
 Optimize settings for this to be fast
 May need to toggle after ingestion into “normal” settings
 Use multiple machines during indexing (e.g. via hadoop)?
 “Optimize” (merge) Lucene segments at the end?

Incremental indexing
 (adding new/updated content)
 Detect deletes how?
 A: Flag for removal upstream before eventually removing
 B: Track all IDs somewhere; find the ones that went
missing
 Maybe don’t need to synchronize deletes until off-hours?
 Realtime Indexing, separate?

Backups (DR: Disaster Recovery)
 Scenario:
 Admin accidentally deleted 30k random docs; oh %#?!
 Not solved by replication/redundancy
 Useful in other scenarios, like testing
 Might not need it; especially if bulk re-indexing is fast
 Take Snapshots (e.g. AWS, or via the search
system, or…)
 Recovery: Deploy snapshot then sync it back up to date.
 Solr: see BloomReach’s “HAFT” project

Document Transformations
 Mapping source data (e.g. HTML doc or database
record) to a search document
 Examples:
 Text from PDF extraction
 Enrichment (e.g. Named Entity Recognition)
 Text pre-processing before search platform gets it
 Merging multiple data sources; joining
 Home-grown or use an existing ETL / “pipeline”?
 Do some of this directly on the search platform?

Schedule, Monitor
 How will a bulk index be triggered? Incremental
index?
 Unix Cron? Basic but crude.
 A Web UI to control this is great.
 A CI server (e.g. Jenkins) can work! (web, logs, alerting)
 Monitor/alert for problems?
 Perhaps via general log monitoring (e.g. ELK)

Open-Source ETL Software
A summary of an investigation I did on open-source
options in 2013.

ETL Software
 Extract Transform Load – a general idea
 Software that calls itself ETL tends to be very similar.
 Clover ETL
 Pentaho Data Integration, AKA Kettle
 Talend Open Studio, Data Integration

Common features
 Two are GPL/LGPL, Talend is Apache
 Fremium model – pay for “enterprise” features
 The Good: (in a word, mature)
 GUI wire diagram builder
 Books / resources
 The Bad:
 Text-editing the pipeline not recommended: thus need
GUI
 Poor community
 Data model is table-like; no native multi-valued fields

Apache NiFi
“is an easy to use, powerful, and reliable system to
process and distribute data.”

Apache Nifi overview
 Web-based UI
 Runtime modification of flow control
 Data provenance features
 Extensible (of course)
 Security, role based access control

Recommandé

Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit

Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven

Data automation 101Yosua Michael Maranatha

Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex BlackTurkish Testing Board

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson

What's new in pandas and the SciPy stack for financial usersWes McKinney

Recommandé

Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit

Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven

Data automation 101Yosua Michael Maranatha

Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex BlackTurkish Testing Board

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson

What's new in pandas and the SciPy stack for financial usersWes McKinney

Data Automation at Light SourcesIan Foster

Airflow at lyft for Airflow summit 2020 conferenceTao Feng

Data munging and analysisRaminder Singh

Data Science at Scale by Sarah GuidoSpark Summit

Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Ryft

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

ElasticsearchDivij Sehgal

ElasticsearchOto Brglez

Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)

Intro elasticsearch taswarbhattiTaswar Bhatti

Etl with apache impala by athemasterAthemaster Co., Ltd.

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis

Cloud com foster december 2010Ian Foster

Discovery & Consumption of Analytics Data @TwitterKamran Munshi

Configuring elasticsearch for performance and scaleBharvi Dixit

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis

Open source data ingestionTreasure Data, Inc.

Data Science with the Help of MetadataJim Dowling

Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit

poster draft 5Mitchell Baller

Open source enterprise search and retrieval platformmteutelink

Large Scale Crawling with Apache Nutch and FriendsJulien Nioche

Contenu connexe

Tendances

Data Automation at Light SourcesIan Foster

Airflow at lyft for Airflow summit 2020 conferenceTao Feng

Data munging and analysisRaminder Singh

Data Science at Scale by Sarah GuidoSpark Summit

Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...Ryft

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

ElasticsearchDivij Sehgal

ElasticsearchOto Brglez

Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)

Intro elasticsearch taswarbhattiTaswar Bhatti

Etl with apache impala by athemasterAthemaster Co., Ltd.

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis

Cloud com foster december 2010Ian Foster

Discovery & Consumption of Analytics Data @TwitterKamran Munshi

Configuring elasticsearch for performance and scaleBharvi Dixit

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis

Open source data ingestionTreasure Data, Inc.

Data Science with the Help of MetadataJim Dowling

Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit

poster draft 5Mitchell Baller

Tendances (20)

Data Automation at Light Sources

Airflow at lyft for Airflow summit 2020 conference

Data munging and analysis

Data Science at Scale by Sarah Guido

Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Elasticsearch

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search

Intro elasticsearch taswarbhatti

Etl with apache impala by athemaster

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...

Cloud com foster december 2010

Discovery & Consumption of Analytics Data @Twitter

Configuring elasticsearch for performance and scale

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

Open source data ingestion

Data Science with the Help of Metadata

Insights into Customer Behavior from Clickstream Data by Ronald Nowling

poster draft 5

En vedette

Open source enterprise search and retrieval platformmteutelink

Large Scale Crawling with Apache Nutch and FriendsJulien Nioche

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill

Apache Tika end-to-endgagravarr

Content Analysis with Apache TikaPaolo Mottadelli

Large Scale Crawling with Apache Nutch and Friendslucenerevolution

ProjectHubSematext Group, Inc.

Search Engine Capabilities - Apache Solr(Lucene)Manish kumar

Web Crawling with Apache Nutchsebastian_nagel

Search engineAlisha Korpal

An introduction to Storm CrawlerJulien Nioche

Metadata Extraction and Content TransformationAlfresco Software

Large scale crawling with Apache NutchJulien Nioche

Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber

Introduction to Apache SolrAndy Jackson

Apache Solr crash courseTommaso Teofili

Indexing Text and HTML Files with SolrLucidworks (Archived)

Digipak vs Jewel packHydeK

Drupal + Solr Mejorando la experiencia de búsquedaDavid Gil Sánchez

Content analysis for ECM with Apache TikaPaolo Mottadelli

En vedette (20)

Open source enterprise search and retrieval platform

Large Scale Crawling with Apache Nutch and Friends

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...

Apache Tika end-to-end

Content Analysis with Apache Tika

Large Scale Crawling with Apache Nutch and Friends

ProjectHub

Search Engine Capabilities - Apache Solr(Lucene)

Web Crawling with Apache Nutch

Search engine

An introduction to Storm Crawler

Metadata Extraction and Content Transformation

Large scale crawling with Apache Nutch

Apache Lucene: Searching the Web and Everything Else (Jazoon07)

Introduction to Apache Solr

Apache Solr crash course

Indexing Text and HTML Files with Solr

Digipak vs Jewel pack

Drupal + Solr Mejorando la experiencia de búsqueda

Content analysis for ECM with Apache Tika

Similaire à Populate your Search index, NEST 2016-01

Web Crawling and Data Gathering with Apache NutchSteve Watt

Datalake ArchitectureTechYugadi IT Solutions & Consulting

Introduction to elasticsearchpmanvi

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board

Building and deploying LLM applications with Apache AirflowKaxil Naik

Apache Lucene Searching The WebFrancisco Gonçalves

data analytics lecture 3.2.pptRutujaPatil247341

Slide 2 collecting, storing and analyzing big dataTrieu Nguyen

Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.

Apache Solr - An Experience ReportNetcetera

Filebeat Elastic Search Presentation.pptxKnoldus Inc.

UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin

Splunk and map_reduceGreg Hanchin

Elastic search overviewABC Talks

Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling

Real-Time Data Flows with Apache NiFiManish Gupta

Better integrations through open interfacesSteve Speicher

Building data pipelinesJonathan Holloway

Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS

Similaire à Populate your Search index, NEST 2016-01 (20)

Web Crawling and Data Gathering with Apache Nutch

Datalake Architecture

Introduction to elasticsearch

Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden

Building and deploying LLM applications with Apache Airflow

Apache Lucene Searching The Web

data analytics lecture 3.2.ppt

Slide 2 collecting, storing and analyzing big data

Big Data, Ingeniería de datos, y Data Lakes en AWS

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...

Apache Solr - An Experience Report

Filebeat Elastic Search Presentation.pptx

UnConference for Georgia Southern Computer Science March 31, 2015

Splunk and map_reduce

Elastic search overview

Hopsworks in the cloud Berlin Buzzwords 2019

Real-Time Data Flows with Apache NiFi

Better integrations through open interfaces

Building data pipelines

Testing Big Data: Automated Testing of Hadoop with QuerySurge

Plus de David Smiley

2020-02 Solr's New Plugin SystemDavid Smiley

H-Hypermap Heatmap Analytics at ScaleDavid Smiley

2016-01 Lucene Solr spatial in 2015, NYC MeetupDavid Smiley

Lucene/Solr spatial in 2015David Smiley

2014 11 lucene spatial temporal updateDavid Smiley

Solr: 4 big featuresDavid Smiley

Lucene 4 spatialDavid Smiley

Plus de David Smiley (7)

2020-02 Solr's New Plugin System

H-Hypermap Heatmap Analytics at Scale

2016-01 Lucene Solr spatial in 2015, NYC Meetup

Lucene/Solr spatial in 2015

2014 11 lucene spatial temporal update

Solr: 4 big features

Lucene 4 spatial

Dernier

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Platformless Horizons for Digital AdaptabilityWSO2

JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37

Corporate and higher education May webinar.pptxRustici Software

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Architecting Cloud Native ApplicationsWSO2

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Platformless Horizons for Digital Adaptability

JohnPollard-hybrid-app-RailsConf2024.pptx

Corporate and higher education May webinar.pptx

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Artificial Intelligence Chap.5 : Uncertainty

Six Myths about Ontologies: The Basics of Formal Ontology

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Strategies for Landing an Oracle DBA Job as a Fresher

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Vector Search -An Introduction in Oracle Database 23ai.pptx

AI in Action: Real World Use Cases by Anitaraj

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Architecting Cloud Native Applications

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Populate your Search index, NEST 2016-01

1. Populating your Search Index NEST Meetup, 2016-01

2. 5 Presentations  Indexing Considerations, Pipelines, and Apache NiFi  A Proposal for a Document Pipeline  How we do it at TIAA-CREF with Solr  How we do it at DRG with Solr  Logstash and Beats with ElasticSearch

3. Indexing Considerations Indexing considerations to think about when building out a search platform

4. What do I mean?  How do you plan to get data into the index (Solr/ES/…)?  Backups?  Schedule & Monitor?  Realtime search requirements?  What software? (pipelines, crawlers, …)

5. Crawling?  Common in the “enterprise search” space  What crawler will you use?  Nutch is well-known but too complex for smaller scale jobs  Many more exist.  Security access control metadata to federate?  Try ManifoldCF which excels at this.

6. Bulk indexing  Plan for a “bulk reindex” use-case  When changing schemas / ingestion extraction rules  Or recovering when there’s no backup  Not having a backup is typical; esp. if re-indexing is fast  Optimize settings for this to be fast  May need to toggle after ingestion into “normal” settings  Use multiple machines during indexing (e.g. via hadoop)?  “Optimize” (merge) Lucene segments at the end?

7. Incremental indexing  (adding new/updated content)  Detect deletes how?  A: Flag for removal upstream before eventually removing  B: Track all IDs somewhere; find the ones that went missing  Maybe don’t need to synchronize deletes until off-hours?  Realtime Indexing, separate?

8. Backups (DR: Disaster Recovery)  Scenario:  Admin accidentally deleted 30k random docs; oh %#?!  Not solved by replication/redundancy  Useful in other scenarios, like testing  Might not need it; especially if bulk re-indexing is fast  Take Snapshots (e.g. AWS, or via the search system, or…)  Recovery: Deploy snapshot then sync it back up to date.  Solr: see BloomReach’s “HAFT” project

9. Document Transformations  Mapping source data (e.g. HTML doc or database record) to a search document  Examples:  Text from PDF extraction  Enrichment (e.g. Named Entity Recognition)  Text pre-processing before search platform gets it  Merging multiple data sources; joining  Home-grown or use an existing ETL / “pipeline”?  Do some of this directly on the search platform?

10. Schedule, Monitor  How will a bulk index be triggered? Incremental index?  Unix Cron? Basic but crude.  A Web UI to control this is great.  A CI server (e.g. Jenkins) can work! (web, logs, alerting)  Monitor/alert for problems?  Perhaps via general log monitoring (e.g. ELK)

11. Open-Source ETL Software A summary of an investigation I did on open-source options in 2013.

12. ETL Software  Extract Transform Load – a general idea  Software that calls itself ETL tends to be very similar.  Clover ETL  Pentaho Data Integration, AKA Kettle  Talend Open Studio, Data Integration

13. Common features  Two are GPL/LGPL, Talend is Apache  Fremium model – pay for “enterprise” features  The Good: (in a word, mature)  GUI wire diagram builder  Books / resources  The Bad:  Text-editing the pipeline not recommended: thus need GUI  Poor community  Data model is table-like; no native multi-valued fields

14. Talend screenshot

15. Apache NiFi “is an easy to use, powerful, and reliable system to process and distribute data.”

16.

17. Apache Nifi overview  Web-based UI  Runtime modification of flow control  Data provenance features  Extensible (of course)  Security, role based access control

Notes de l'éditeur

New England Search Technologies, Meetup Group http://www.meetup.com/New-England-Search-Technologies-NEST-Group/events/227860780/
Recommend Talend, then Pentaho, then Clover in that order. But probably none of them for most search projects.