SlideShare une entreprise Scribd logo
1  sur  22
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
How does Lucene
store your data?
Adrien Grand
@jpountz
Apache Lucene/Solr committer
Software engineer @ Elasticsearch
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Outline
●Segments
●What does a segment store?
●Improvements since Lucene 4.0
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
●Every segment is a fully
functional index
●High numbers of
segments trigger merges
●Merge: Copy all live data
from several segments
into a new one
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
●Immutable (up to deletes)
● SSD-friendly (no write amplification)
● great for caches (including the FS cache)
● easy incremental backups
●Merged together when they are too many of them
● Expunges deleted documents
●An IndexReader is a point-in-time view over a fixed
number of segments
● Need to reopen to see changes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What does a
segment store?
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What is in a segment?
Stores Useful for
Segment &
Field infos
Metadata
Getting doc count / index
options
Live docs Non-deleted docs
Excluding deleted docs
from results
Inverted index
The mapping from terms to
docs and positions Finding matching docs
Norms Index-time boosts Scoring
Doc values Any number or (small) bytes
Sorting, faceting, custom
scoring
Stored fields The original doc Result summaries
Term vectors Single doc inverted index Highlighting, MoreLikeThis
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What is in a segment?
API
Field infos AtomicReader.getFieldInfos()
Live docs AtomicReader.getLiveDocs()
Inverted index AtomicReader.fields()
Norms AtomicReader.getNormValues(String field)
Doc values AtomicReader.get*Values(String field)
Stored fields AtomicReader.document(int docID, FieldVisitor visitor)
Term vectors AtomicReader.getTermVectors()
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Doc IDs
●Lucene gives sequential doc IDs to all documents in a
segment, from 0 (inclusive) to AtomicReader.maxDoc()
(exclusive)
●Uniquely identifies documents inside a segment
● ie. if the inverted index API says that document 42
matches the term "bbuzz", I can query the stored
fields API with the same ID
●Allows for efficient storage
● doc IDs can be used as ordinals
● Small & dense ints are easy to compress
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: bit packing
●Efficient technique to store blocks of small ints
● Supports random access
● Special case: bits per value = 1 is a bit set
●Say you want to store
● 5 30 1 1 10 12
● Raw data: 6 * 32 = 192 bits
● Packed : 6 * 5 = 30 bits (84% size reduction!)
00000000000000000000000000000101 = 5
00000000000000000000000000011110 = 30
00000000000000000000000000000001 = 1
00000000000000000000000000000001 = 1
00000000000000000000000000001010 = 10
00000000000000000000000000001100 = 12
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Fixed-length data
●Dense doc IDs are great for single-valued fixed-length
data
● Store data sequentially
● Data for doc N is at offset N * dataLength
● Allows for fast and memory-efficient lookups
●Live docs (1 bit per value)
●Norms (1 byte per value)
●Numeric doc values
● Blocks with independent numbers of bits per value
4096 values 4096 values 4096 values ● Block idx
○ docID / 4096
● Idx in block
○ docID % 4096
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Variable-length data
end addresses
bytes
●Binary doc values
●Stored fields
●Term vectors
●Need one level of indirection: store end addresses
● Easy to compress since end addresses are
increasing
● Only store endAddress - (docID+1) * avgLength
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
String data
●Terms index
●Sorted (Set) doc values
●MemoryPostingsFormat
●Suggesters
s/1 t a c k
r/1o/2
p
t/4
●FST: automaton with weighted arcs
○ compact thanks to shared prefixes/suffixes
●Stack = 1
●Star = 2
●Stop = 3
●Top = 4
o
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Inverted index
●Terms index: map a term prefix to a block in the dict
○ FST
●Terms dictionary: statistics + pointer in postings lists
●Postings lists: encodes matching docs in sorted order
○ + positions + offsets
Original data 1 2 4 11 42 43 (6 * 4 = 32 bytes)
Split into blocks of 3
(128 in practice)
1 2 4 | 11 42 43
Delta-encode 1 1 2 | 11 31 1
Pack values 3 [1 1 2] | 5 [11 31 1] (1+1+1+2 = 5 bytes)
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Improvements since
Lucene 4.0
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Improvements since Lucene 4.0
●LUCENE-4399 (4.1): no seek on write
●LUCENE-4498 (4.1): terms "pulsed" when freq=1
●Compression:
● LUCENE-3892 (4.1): postings encoding moved from
vInt to packed ints: smaller & faster!
● LUCENE-4226 (4.1): compressed stored fields
● LUCENE-4599 (4.2): compressed term vectors
● LUCENE-4547 (4.2): better doc values:
● blocks of packed ints for numbers
● compression of addresses for binary
● FST for Sorted (Set)
● LUCENE-4936 (4.4): compression for date DV
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Performance
●http://people.apache.org/~mikemccand/lucenebench/Term.html
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: LZ4
●Super simple, blazing fast compression codec
●http://code.google.com/p/lz4/
●https://github.com/jpountz/lz4-java
●Example
● L: literals
● R: reference = (offset decrement, length)
● 1 2 3 6 7 6 7 6 7 6 7 8 9 1 2 3 6 7 10
● L 1 2 3 6 7 R(2,6) L 8 9 R(13,5) L 10
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: LZ4
●https://github.com/ning/jvm-compressor-benchmark
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Twitter benchmark
●Quick benchmark on a Twitter corpus
● 160908 tweets
● WhitespaceAnalyzer
Type Indexed Stored Doc values
Term
vectors
id long yes yes - -
created_at long - yes numeric -
user.name string yes yes sorted -
text text yes yes - yes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Twitter benchmark
Lucene 4.0
Lucene 4.4
(not released yet)
Difference
Inverted index 23.3M 20.5M -12%
Norms 157K 157K +0%
Doc values 3.4M 3.1M -9%
Stored fields 21.2M 15.7M -26%
Term vectors 23.5M 15.5M -34%
Overall ~71.5M ~55.0M -23%
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Questions?

Contenu connexe

Tendances

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performanceDaum DNA
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022InfluxData
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioAlluxio, Inc.
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 

Tendances (20)

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performance
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 

Similaire à Berlin Buzzwords 2013 - How does lucene store your data?

Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleEDB
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkAnant Corporation
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptxShadi Akil
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbWei Shan Ang
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fastDenis Karpenko
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS GlueLaercio Serra
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Linaro
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 

Similaire à Berlin Buzzwords 2013 - How does lucene store your data? (20)

Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
mdc_ppt
mdc_pptmdc_ppt
mdc_ppt
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptx
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
$ Spark start
$  Spark start$  Spark start
$ Spark start
 
Doc32000
Doc32000Doc32000
Doc32000
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Memory mgmt 80386
Memory mgmt 80386Memory mgmt 80386
Memory mgmt 80386
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Berlin Buzzwords 2013 - How does lucene store your data?

  • 1. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited How does Lucene store your data? Adrien Grand @jpountz Apache Lucene/Solr committer Software engineer @ Elasticsearch
  • 2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Outline ●Segments ●What does a segment store? ●Improvements since Lucene 4.0
  • 3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments
  • 4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments ●Every segment is a fully functional index ●High numbers of segments trigger merges ●Merge: Copy all live data from several segments into a new one
  • 5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments ●Immutable (up to deletes) ● SSD-friendly (no write amplification) ● great for caches (including the FS cache) ● easy incremental backups ●Merged together when they are too many of them ● Expunges deleted documents ●An IndexReader is a point-in-time view over a fixed number of segments ● Need to reopen to see changes
  • 6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What does a segment store?
  • 7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What is in a segment? Stores Useful for Segment & Field infos Metadata Getting doc count / index options Live docs Non-deleted docs Excluding deleted docs from results Inverted index The mapping from terms to docs and positions Finding matching docs Norms Index-time boosts Scoring Doc values Any number or (small) bytes Sorting, faceting, custom scoring Stored fields The original doc Result summaries Term vectors Single doc inverted index Highlighting, MoreLikeThis
  • 8. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What is in a segment? API Field infos AtomicReader.getFieldInfos() Live docs AtomicReader.getLiveDocs() Inverted index AtomicReader.fields() Norms AtomicReader.getNormValues(String field) Doc values AtomicReader.get*Values(String field) Stored fields AtomicReader.document(int docID, FieldVisitor visitor) Term vectors AtomicReader.getTermVectors()
  • 9. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Doc IDs ●Lucene gives sequential doc IDs to all documents in a segment, from 0 (inclusive) to AtomicReader.maxDoc() (exclusive) ●Uniquely identifies documents inside a segment ● ie. if the inverted index API says that document 42 matches the term "bbuzz", I can query the stored fields API with the same ID ●Allows for efficient storage ● doc IDs can be used as ordinals ● Small & dense ints are easy to compress
  • 10. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: bit packing ●Efficient technique to store blocks of small ints ● Supports random access ● Special case: bits per value = 1 is a bit set ●Say you want to store ● 5 30 1 1 10 12 ● Raw data: 6 * 32 = 192 bits ● Packed : 6 * 5 = 30 bits (84% size reduction!) 00000000000000000000000000000101 = 5 00000000000000000000000000011110 = 30 00000000000000000000000000000001 = 1 00000000000000000000000000000001 = 1 00000000000000000000000000001010 = 10 00000000000000000000000000001100 = 12
  • 11. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Fixed-length data ●Dense doc IDs are great for single-valued fixed-length data ● Store data sequentially ● Data for doc N is at offset N * dataLength ● Allows for fast and memory-efficient lookups ●Live docs (1 bit per value) ●Norms (1 byte per value) ●Numeric doc values ● Blocks with independent numbers of bits per value 4096 values 4096 values 4096 values ● Block idx ○ docID / 4096 ● Idx in block ○ docID % 4096
  • 12. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Variable-length data end addresses bytes ●Binary doc values ●Stored fields ●Term vectors ●Need one level of indirection: store end addresses ● Easy to compress since end addresses are increasing ● Only store endAddress - (docID+1) * avgLength
  • 13. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited String data ●Terms index ●Sorted (Set) doc values ●MemoryPostingsFormat ●Suggesters s/1 t a c k r/1o/2 p t/4 ●FST: automaton with weighted arcs ○ compact thanks to shared prefixes/suffixes ●Stack = 1 ●Star = 2 ●Stop = 3 ●Top = 4 o
  • 14. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Inverted index ●Terms index: map a term prefix to a block in the dict ○ FST ●Terms dictionary: statistics + pointer in postings lists ●Postings lists: encodes matching docs in sorted order ○ + positions + offsets Original data 1 2 4 11 42 43 (6 * 4 = 32 bytes) Split into blocks of 3 (128 in practice) 1 2 4 | 11 42 43 Delta-encode 1 1 2 | 11 31 1 Pack values 3 [1 1 2] | 5 [11 31 1] (1+1+1+2 = 5 bytes)
  • 15. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Improvements since Lucene 4.0
  • 16. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Improvements since Lucene 4.0 ●LUCENE-4399 (4.1): no seek on write ●LUCENE-4498 (4.1): terms "pulsed" when freq=1 ●Compression: ● LUCENE-3892 (4.1): postings encoding moved from vInt to packed ints: smaller & faster! ● LUCENE-4226 (4.1): compressed stored fields ● LUCENE-4599 (4.2): compressed term vectors ● LUCENE-4547 (4.2): better doc values: ● blocks of packed ints for numbers ● compression of addresses for binary ● FST for Sorted (Set) ● LUCENE-4936 (4.4): compression for date DV
  • 17. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Performance ●http://people.apache.org/~mikemccand/lucenebench/Term.html
  • 18. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: LZ4 ●Super simple, blazing fast compression codec ●http://code.google.com/p/lz4/ ●https://github.com/jpountz/lz4-java ●Example ● L: literals ● R: reference = (offset decrement, length) ● 1 2 3 6 7 6 7 6 7 6 7 8 9 1 2 3 6 7 10 ● L 1 2 3 6 7 R(2,6) L 8 9 R(13,5) L 10
  • 19. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: LZ4 ●https://github.com/ning/jvm-compressor-benchmark
  • 20. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Twitter benchmark ●Quick benchmark on a Twitter corpus ● 160908 tweets ● WhitespaceAnalyzer Type Indexed Stored Doc values Term vectors id long yes yes - - created_at long - yes numeric - user.name string yes yes sorted - text text yes yes - yes
  • 21. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Twitter benchmark Lucene 4.0 Lucene 4.4 (not released yet) Difference Inverted index 23.3M 20.5M -12% Norms 157K 157K +0% Doc values 3.4M 3.1M -9% Stored fields 21.2M 15.7M -26% Term vectors 23.5M 15.5M -34% Overall ~71.5M ~55.0M -23%
  • 22. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Questions?