2. 5 Presentations
Indexing Considerations, Pipelines, and Apache NiFi
A Proposal for a Document Pipeline
How we do it at TIAA-CREF with Solr
How we do it at DRG with Solr
Logstash and Beats with ElasticSearch
4. What do I mean?
How do you plan to get data into the index
(Solr/ES/…)?
Backups?
Schedule & Monitor?
Realtime search requirements?
What software? (pipelines, crawlers, …)
5. Crawling?
Common in the “enterprise search” space
What crawler will you use?
Nutch is well-known but too complex for smaller scale
jobs
Many more exist.
Security access control metadata to federate?
Try ManifoldCF which excels at this.
6. Bulk indexing
Plan for a “bulk reindex” use-case
When changing schemas / ingestion extraction rules
Or recovering when there’s no backup
Not having a backup is typical; esp. if re-indexing is fast
Optimize settings for this to be fast
May need to toggle after ingestion into “normal” settings
Use multiple machines during indexing (e.g. via hadoop)?
“Optimize” (merge) Lucene segments at the end?
7. Incremental indexing
(adding new/updated content)
Detect deletes how?
A: Flag for removal upstream before eventually removing
B: Track all IDs somewhere; find the ones that went
missing
Maybe don’t need to synchronize deletes until off-hours?
Realtime Indexing, separate?
8. Backups (DR: Disaster Recovery)
Scenario:
Admin accidentally deleted 30k random docs; oh %#?!
Not solved by replication/redundancy
Useful in other scenarios, like testing
Might not need it; especially if bulk re-indexing is fast
Take Snapshots (e.g. AWS, or via the search
system, or…)
Recovery: Deploy snapshot then sync it back up to date.
Solr: see BloomReach’s “HAFT” project
9. Document Transformations
Mapping source data (e.g. HTML doc or database
record) to a search document
Examples:
Text from PDF extraction
Enrichment (e.g. Named Entity Recognition)
Text pre-processing before search platform gets it
Merging multiple data sources; joining
Home-grown or use an existing ETL / “pipeline”?
Do some of this directly on the search platform?
10. Schedule, Monitor
How will a bulk index be triggered? Incremental
index?
Unix Cron? Basic but crude.
A Web UI to control this is great.
A CI server (e.g. Jenkins) can work! (web, logs, alerting)
Monitor/alert for problems?
Perhaps via general log monitoring (e.g. ELK)
12. ETL Software
Extract Transform Load – a general idea
Software that calls itself ETL tends to be very similar.
Clover ETL
Pentaho Data Integration, AKA Kettle
Talend Open Studio, Data Integration
13. Common features
Two are GPL/LGPL, Talend is Apache
Fremium model – pay for “enterprise” features
The Good: (in a word, mature)
GUI wire diagram builder
Books / resources
The Bad:
Text-editing the pipeline not recommended: thus need
GUI
Poor community
Data model is table-like; no native multi-valued fields
15. Apache NiFi
“is an easy to use, powerful, and reliable system to
process and distribute data.”
16.
17. Apache Nifi overview
Web-based UI
Runtime modification of flow control
Data provenance features
Extensible (of course)
Security, role based access control
Notes de l'éditeur
New England Search Technologies, Meetup Group
http://www.meetup.com/New-England-Search-Technologies-NEST-Group/events/227860780/
Recommend Talend, then Pentaho, then Clover in that order. But probably none of them for most search projects.