How Prometheus Store the Data

MegaEase
How Prometheus
Store the Data
Hao Chen

MegaEase
Self Introduction
l 20+ years working experience for large-scale distributed system
architecture and development. Familiar with Cloud Native
computing and high concurrency / high availability architecture
solution.
l Working Experiences
l MegaEase – Cloud Native Software products as Founder
l Alibaba – AliCloud, Tmall as principle software engineer.
l Amazon – Amazon.com as senior software manager.
l Thomson Reuters – Real-time system software development Manager.
l IBM Platform – Distributed computing system as software engineer.
Weibo: @左耳朵耗子
Twitter: @haoel
Blog: http://coolshell.cn/

MegaEase
Understanding Time Series

MegaEase
Understanding Time Series Data
l Data scheme
l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), ....
l Prometheus Data Model
l <metric name>{<label name>=<label value>, ...}
l Typical set of series identifiers
l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355
l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934
l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483
l Query
l __name__=“requests_total” - selects all series belonging to the requests_total metric.
l method=“PUT|POST” - selects all series method is PUT or POST
Metric Name Labels Timestamp Sample Value
Key - Series Value - Sample

MegaEase
2D Data Plane
l Write
l Completely vertical and highly concurrent as samples from each target are ingested independently
l Query
l Retrieves data can be paralleled and batched
series
^
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“GET”}
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“POST”}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“POST”}
│ . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“GET”}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->

MegaEase
The Fundamental Problem
l Storage problem
l IDE – spinning physically
l SSD - write amplification
l Query is much more complicated than
write
l Time series query could cause the
random read.
l Ideal Write
l Sequential writes
l Batched writes
l Ideal Read
l Same Time Series should be
sequentially

MegaEase
Prometheus Solution
(v1.x - “V2”)

MegaEase
Prometheus Solution (v1.x “V2”)
l One file per time series
l Batch up 1KiB chunks in memory
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A
└──────────┴─────────┴─────────┴─────────┴─────────┘
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B
└──────────┴─────────┴─────────┴─────────┴─────────┘
. . .
┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ
└──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
chunk 1 chunk 2 chunk 3 ...
l Dark Sides
l Chunk are hold in memory, it could be lost if application or node crashed.
l With several million files, inodes would be run out
l With several thousands of chunks need be persisted, could cause disk I/O so busy.
l Keep so many files open for I/O, which cause very high latency.
l Old data need be cleaned, it could cause the SSD’s write amplification
l Very big CPU/MEM/DISK resource consumption

MegaEase
Series Churn
l Definition
l Some time series become INACTIVE
l Some time series become ACTIVE
l Reasons
l Rolling up a number of microservice
l Kubernetes scaling the services
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->

MegaEase
New Prometheus Design
(v2.x - “V3”)

MegaEase
Fundamental Design – V3
l Storage Layout
l 01XXXXXXX- is a data block
l ULID - like UUID but lexicographically sortable and encoding the creation time
l chunk directory
l contains the raw chucks of data points for various series(likes “V2”)
l No long a single file per series
l index – index of data
l Lots of black magic find the data by labels.
l meta.json - Readable meta data
l the state of our storage and the data it contains
l tombstone
l Deleted data will be recorded into this file, instead removing from chunk file
l wal – Write-Ahead Log
l The WAL segments would be truncated to “checkpoint.X” directory
l chunks_head – in memory data
l Notes
l The data will be persisted into disk every 2 hours
l WAL is used for data recovery.
l 2 Hours block could make the range data query efficiently
$ tree ./data
./data
├── 01BKGV7JBM69T2G1BGBGM6KB12
│ ├── chunks
│ │ ├── 000001
│ │ ├── 000002
│ │ └── 000003
│ ├── index
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98
│ │ └── 000001
│ ├── index
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│ │ └── 000001
│ ├── index
│ ├── tombstones
├── chunks_head
│ └── 000001
└── wal
├── 000000003
└── checkpoint.00000002
├── 00000000
└── 00000001
https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md
File Format

MegaEase
Blocks – Little Database
l Partition the data into non-overlapping blocks
l Each block acts as a fully independent database
l Containing all time series data for its time window
l it has its own index and set of chunk files.
l Every block of data is immutable
l The current block can be append the data
l All new data is write to an in-memory database
l To prevent data loss, a temporary WAL is also written.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └────────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘

MegaEase
Tree Concept
Block 1 Block 2 Block 3 Block 4 Block N
chunk1 chunk2 chunk3
time

MegaEase
New Design’s Benefits
l Good for querying a time range
l we can easily ignore all data blocks outside of this range.
l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with
l Good for disk writes
l When completing a block, we can persist the data from our in-memory database by sequentially writing just
a handful of larger files.
l Keep the good property of V2 that recent chunks
l which are queried most, are always hot in memory.
l Flexible for chunk size
l We can pick any size that makes the most sense for the individual data points and chosen compression
format.
l Deleting old data becomes extremely cheap and instantaneous.
l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write
up to hundreds of millions of files, which could take hours to converge.

MegaEase
Chunk-head
l Chunk will be cut
l fills till 120 samples
l 2 hour (by default)
l Since Prometheus v2.19
l not all chunks are stored in memory
l When the chunk is cut, it will be flushed to disk and to mmap
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/

MegaEase
Chunk head à Block
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
l After some time, the chunks meet threshold
l When the Chunks range is 3hrs
l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block
l Meanwhile
l The WAL is truncated at this point
l And the “checkpoint” is created!

MegaEase
Large file with “mmap”
l mmap stands for memory-mapped files. It is a
way to read and write files without invoking
system calls.
l It is great if multiple processes accessing data in
a read only fashion from the same file
l It allows all those processes to share the same
physical memory pages, saving a lot of memory.
l it also allows the operating system to optimize
paging operations.
User Process
File System
Page Cache
Disk
User Space
Kernel Space
Device
mmap
Direct I/O
read/write
Why mmap is faster than system calls
https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37

MegaEase
Write-Ahead Log（WAL）
l widely used in relational databases to provide durability (D from ACID)
l Persisting every state change as a command to the append only log.
https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html
l Store each state changes as command
l A single log is appended sequentially
l Each log entry is given a unique identifier
l Roll the logs as Segmented Log
l Clean the log with Low-Water Mark
l Snapshot based (Zookeeper & ETCD)
l Time based (Kafka)
l Support Singular Update Queue
l A work queue
l A single thread

MegaEase
Prometheus WAL & Checkpoint
l WAL Records - includes the Series and their corresponding Samples.
l The Series record is written only once when we see it for the first time
l The Samples record is written for all write requests that contain a sample.
l WAL Truncation - Checkpoints
l Drops all the series records for series which are no longer in the Head.
l Drops all the samples which are before time T.
l Drops all the tombstone records for time ranges before T.
l Retain back remaining series, samples and tombstone records in the same way as
you find it in the WAL (in the same order as they appear in the WAL).
l WAL Replay
l Replaying the “checkpoint.X”
l Replaying the WAL X+1, X+2,… X+N
l WAL Compression
l The WAL records are not heavily compressed by Snappy
l Snappy is developed by Google based on LZ77
l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility.
l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB….
Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal
data
└── wal
├── 000000
├── 000001
├── 000002
├── 000003
├── 000004
└── 000005
data
└── wal
├── checkpoint.000003
| ├── 000000
| └── 000001
├── 000004
└── 000005

MegaEase
Block Compaction
l Problem
l When querying multiple blocks, we have to merge their results into an overall result.
l If we need a week-long query, it has to merge 80+ partial blocks.
l Compaction
t0 t1 t2 t3 t4 now
┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before
└────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘
┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐
│ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A)
└─────────────────────────────────────────┘ └───────────┘ └───────────┘
┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐
│ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B)
└──────────────────────────┘ └──────────────────────────┘ └───────────┘

MegaEase
Retention
l Example
l Block 1 can be deleted safely, bock 2 has to keep until it fully behind the boundary.
l Block Compaction impacts
l Block compaction could make the block too large to delete.
l We need to limit the block size.
Maximum block size = 10% * retention window.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary

MegaEase
Index
l Using inverted index for label index
l Allocate an unique ID for every series
l Look up the series by this ID, the time complexity is O(1)
l This ID is forward index.
l Construct the labels’ index
l If series ID = {2,5, 10, 29} contains app=“nginx”
l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx”
l In Short
l Number of labels is significantly less then the number of series.
l Walking through all of the labels is not problem.
{
__name__=”requests_total”,
pod=”nginx-34534242-abc723
job=”nginx”,
path=”/api/v1/status”,
status=”200”,
method=”GET”,
}
status=”200”: 1 2 5 ...
method=”GET”: 2 3 4 5 6 9 ...
ID : 5

MegaEase
Sets Operation
l Considering we have the following query:
l app=“foo” AND __name__=“requests_total”
l How to do intersection with two invert index list?
l General Algorithm Interview Question
l By given two integer array, return their intersection.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 2, 9} as there intersection
l By given two integer array return their union.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union
l Time: O(m*n) - no extra space

MegaEase
Sort The Array
l If we sort the array
__name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ]
app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ]
intersection => [ 1000, 1001 ]
l We can have efficient algorithm
l O(m+n) : two pointers for each array.
while (idx1 < len1 && idx2 < len2) {
if (a[idx1] > b[idx2] ) {
idx2++
} else if (a[idx1] < b[idx2] ) {
idx1++
} else {
c = append(c, a[idx1])
}
}
return c
l Series ID must be easy to sort, use
MD5 or UUID is not a good idea
( V2 use the hash ID)
l Delete the data could cause the
index rebuild.

MegaEase
Benchmark
(v1.5.2 vs v2.0)

MegaEase
Benchmark – Memory
l Heap memory usage in GB
l Prometheus 2.0’s memory consumption is reduced by 3-4x

MegaEase
Benchmark – CPU
l CPU usage in cores/second
l Prometheus 2.0 needs 3-10 times fewer CPU resources.

MegaEase
Benchmark – Disk Writes
l Disk writes in MB/second
l Prometheus 2.0 saving 97-99%.
l Prometheus 1.5 is prone to wear out SSD

MegaEase
Benchmark – Query Latency
l Query P99 latency in seconds
l Prometheus 1.5 the query latency increases over time as more series are stored.

MegaEase
Facebook Paper
Gorilla: A fast, scalable, in-memory time series database
TimeScale

MegaEase
Gorilla Requirements
l 2 billion unique time series identified by a string key.
l 700 million data points (time stamp and value) added per minute.
l Store data for 26 hours.
l More than 40,000 queries per second at peak.
l Reads succeed in under one millisecond.
l Support time series with 15 second granularity (4 points per minute per time series).
l Two in-memory, not co-located replicas (for disaster recovery capacity).
l Always serve reads even when a single server crashes.
l Ability to quickly scan over all in memory data.
l Support at least 2x growth per year.
85% Queries for latest 26 hours data

MegaEase
Key Technology
l Simple Data Model – (string key, int64 timestamp, double value)
l In memory – low latency
l High Data Compression Raito – Save 90% space
l Cache first then Disk – accept the data lost
l Stateless - Easy to scale
l Hash(key) à Shard à node

MegaEase
Fundamental
l Delta Encoding (aka Delta Compression)
l https://en.wikipedia.org/wiki/Delta_encoding
l Examples
l HTTP RFC 3229 “Delta encoding in HTTP”
l rsync - Delta file copying
l Online backup
l Version Control

MegaEase
Compress of timestamp
l Delta-of-Delta

MegaEase
Compression Algorithm
Compress Timestamp
D = 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐)
l D = 0, then store a single ‘0’ bit
l D = [-63, 64], ‘10’ : value (7 bits)
l D = [-255, 256], ‘110’ : value (9 bits)
l D = [-2047, 2048], ‘1110’ : value (12 bits)
l Otherwise store ‘1111’ : D (32 bits)
Compress Values (Double float)
X = 𝑽𝒊 ^ 𝑽𝒊"𝟏
l X = 0, then store a single ‘0’ bit
l X != 0,
首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit
值存为’1’，第二个bit值为
如果Leading Zeros与Trailing Zeros与前一个XOR值相同，则第2个bit
值存为’0’，而后，紧跟着去掉Leading Zeros与Trailing Zeros以后的
有效XOR值部分。
如果Leading Zeros与Trailing Zeros与前一个XOR值不同，则第2个bit
值存为’1’，而后，紧跟着5个bits用来描述Leading Zeros的个数，再
用6个bits来描述有效XOR值的长度，最后再存储有效XOR值部分
（这种情形下，至少产生了13个bits的冗余信息）

MegaEase
Sample Compression
l Raw : 16 bytes/ sample
l Compressed: 1.37 bytes/sample

MegaEase
Open Source Implementation
l Golang
l https://github.com/dgryski/go-tsz
l Java
l https://github.com/burmanm/gorilla-tsc
l https://github.com/milpol/gorilla4j
l Rust
l https://github.com/jeromefroe/tsz-rs
l https://github.com/mheffner/rust-gorilla-tsdb

MegaEase
Reference
l Writing a Time Series Database from Scratch by Fabian Reinartz
https://fabxc.org/tsdb/
l Gorilla: A Fast, Scalable, In-Memory Time Series Database
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
l TSDB format
https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md
l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz
l video: https://www.youtube.com/watch?v=b_pEevMAC3I
l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf
l Ganesh Vernekar Blog - Prometheus TSDB
l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block
l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint
l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk
l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index
l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries
l Time-series compression algorithms, explained
l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/

How Prometheus Store the Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à How Prometheus Store the Data

Similaire à How Prometheus Store the Data (20)

Dernier

Dernier (20)

How Prometheus Store the Data