SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Lucene
InputFormat
and more!
Lookups on HDFS
SequenceFile is great for fast sequential access, but how
to do lookups?
MapFile, BloomMapFile, HBase, Cassandra, et al. all
provide one primary-key type index of the data. But what if
you want to index all your fields (or at least many of them)?
What if you want search?
Lucene to the rescue!
Lucene is, among many things, a file format.
The stored fields file (fdt) has fast sequential access, so it
acts as our “sequence file” of key/values. In addition to this,
you get the power of the inverted index and the search
capabilities of Lucene.
Solr HDFSDirectory
● Start with SOLR-4916 (HDFS support)
● Pull out Solr-specific bits so we can use with
vanilla Lucene
● Backport to Hadoop 1.x
Lucene InputFormat
● Glob HDFS for Lucene instance directories
● Read SegmentInfos and create a split per
segment
● Use a MatchAllDocsQuery to quickly iterate
through the doc set
● RecordReader returns docs from
DocIdSetIterator
Lucene InputFormat cont.
● Gives back a Document with the stored fields
● The time spent searching is negligible
compared to iterating through docs
● Think of it as a key/value storage format plus
an efficient inverted index
Adding a query
Add a simple TermQuery like “key:value” and
specify which fields to return
LIF.setLuceneQuery(job, "body:anarchy");
LIF.setLuceneFields(job, "title", "body");
More complex queries?
Use JavaScript to dynamically set more
complicated queries
var clause1 = new TermQuery("body", "anarchy");
var clause2 = new TermQuery("title", "revolution");
var query = new BooleanQuery();
query.add(clause1, BooleanClause.Occur.MUST);
query.add(clause2, BooleanClause.Occur.MUST);
Adding Pig LoadFunc
X = LOAD 'hdfs://localhost:50001/tmp/lucene/*'
USING DefaultLuceneLoadFunc('body:anarchy')
AS (title:chararray, date:long, body:chararray);
Y = FOREACH X GENERATE title, date;
(Anarchism,1355654644000)
(Abraham Lincoln,1357087785000)
(Art,1357159249000)
(Anarcho-capitalism,1356671677000)
Demo!
Adding some schema
● Schema is hard-coded in previous examples
● InputFormat gives back Lucene Document
● Use Avro to reflect a schema onto the
Lucene docs when reading/writing
● Similarly, use Avro to reflect a Pig schema
Avro-ified IF and LoadFunc
X = LOAD 'hdfs://localhost:50001/tmp/lucene/*'
USING AvroLuceneLoadFunc(
'com.lucid.MyAvroClass',
'body:anarchy'
);
Y = FOREACH X GENERATE title, date;
That’s it!
David Arthur
http://mumrah.github.io/
Bonus Slide - Kafka 0.8
Kafka 0.8.0 was released last week!
Now with 100% more logo:

Apache Kafka

Contenu connexe

Tendances

Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoop
mensb
 
Redis: REmote DIctionary Server
Redis: REmote DIctionary ServerRedis: REmote DIctionary Server
Redis: REmote DIctionary Server
Ezra Zygmuntowicz
 
iOS: Using persistant storage
iOS: Using persistant storageiOS: Using persistant storage
iOS: Using persistant storage
Jussi Pohjolainen
 

Tendances (20)

Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoop
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
DevOps Braga #9: Introdução ao Terraform
DevOps Braga #9:  Introdução ao TerraformDevOps Braga #9:  Introdução ao Terraform
DevOps Braga #9: Introdução ao Terraform
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Redis: REmote DIctionary Server
Redis: REmote DIctionary ServerRedis: REmote DIctionary Server
Redis: REmote DIctionary Server
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime web
 
Configuration management
Configuration managementConfiguration management
Configuration management
 
iOS: Using persistant storage
iOS: Using persistant storageiOS: Using persistant storage
iOS: Using persistant storage
 
Kudu and Rust
Kudu and RustKudu and Rust
Kudu and Rust
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
 
Hdfs
HdfsHdfs
Hdfs
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational Database
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 

Similaire à Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 

Similaire à Lucene InputFormat (lightning talk) - TriHUG December 10, 2013 (20)

Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
Infinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGMInfinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGM
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
File Context
File ContextFile Context
File Context
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Fauna DB - Functional NoSQL
Fauna DB - Functional NoSQLFauna DB - Functional NoSQL
Fauna DB - Functional NoSQL
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Drupal 8 meets to symphony
Drupal 8 meets to symphonyDrupal 8 meets to symphony
Drupal 8 meets to symphony
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Gsummit apis-2013
Gsummit apis-2013Gsummit apis-2013
Gsummit apis-2013
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Android Data Persistence
Android Data PersistenceAndroid Data Persistence
Android Data Persistence
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
ElasticSearch Getting Started
ElasticSearch Getting StartedElasticSearch Getting Started
ElasticSearch Getting Started
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

  • 2. Lookups on HDFS SequenceFile is great for fast sequential access, but how to do lookups? MapFile, BloomMapFile, HBase, Cassandra, et al. all provide one primary-key type index of the data. But what if you want to index all your fields (or at least many of them)? What if you want search?
  • 3. Lucene to the rescue! Lucene is, among many things, a file format. The stored fields file (fdt) has fast sequential access, so it acts as our “sequence file” of key/values. In addition to this, you get the power of the inverted index and the search capabilities of Lucene.
  • 4. Solr HDFSDirectory ● Start with SOLR-4916 (HDFS support) ● Pull out Solr-specific bits so we can use with vanilla Lucene ● Backport to Hadoop 1.x
  • 5. Lucene InputFormat ● Glob HDFS for Lucene instance directories ● Read SegmentInfos and create a split per segment ● Use a MatchAllDocsQuery to quickly iterate through the doc set ● RecordReader returns docs from DocIdSetIterator
  • 6. Lucene InputFormat cont. ● Gives back a Document with the stored fields ● The time spent searching is negligible compared to iterating through docs ● Think of it as a key/value storage format plus an efficient inverted index
  • 7. Adding a query Add a simple TermQuery like “key:value” and specify which fields to return LIF.setLuceneQuery(job, "body:anarchy"); LIF.setLuceneFields(job, "title", "body");
  • 8. More complex queries? Use JavaScript to dynamically set more complicated queries var clause1 = new TermQuery("body", "anarchy"); var clause2 = new TermQuery("title", "revolution"); var query = new BooleanQuery(); query.add(clause1, BooleanClause.Occur.MUST); query.add(clause2, BooleanClause.Occur.MUST);
  • 9. Adding Pig LoadFunc X = LOAD 'hdfs://localhost:50001/tmp/lucene/*' USING DefaultLuceneLoadFunc('body:anarchy') AS (title:chararray, date:long, body:chararray); Y = FOREACH X GENERATE title, date; (Anarchism,1355654644000) (Abraham Lincoln,1357087785000) (Art,1357159249000) (Anarcho-capitalism,1356671677000)
  • 10. Demo!
  • 11. Adding some schema ● Schema is hard-coded in previous examples ● InputFormat gives back Lucene Document ● Use Avro to reflect a schema onto the Lucene docs when reading/writing ● Similarly, use Avro to reflect a Pig schema
  • 12. Avro-ified IF and LoadFunc X = LOAD 'hdfs://localhost:50001/tmp/lucene/*' USING AvroLuceneLoadFunc( 'com.lucid.MyAvroClass', 'body:anarchy' ); Y = FOREACH X GENERATE title, date;
  • 14. Bonus Slide - Kafka 0.8 Kafka 0.8.0 was released last week! Now with 100% more logo: Apache Kafka