SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Bookie Storage
M a t t e o M e r l i
BookKeeper
2
▪ Provides distributed logs (ledgers)
▪ BookKeeper client + Bookies
▪ Client API can be summarized as :
› createLedger() → ledgerId
› ledger.addEntry(data) → entryId
› ledger.readEntry(ledgerId, entryId)
› deleteLedger(ledgerId)
▪ BK Client library implements all the “logic”
› Consistency, metadata in ZK, fencing, recovery, replication
▪ Bookie Server are charged to store the data
Bookie Storage
Bookie external interface
3
▪ Simple primitives
› addEntry(ledgerId, entryId, payload) → OK
› readEntry(ledgerId, entryId) → payload
› getLastEntry(ledgerId) → entryId
▪ Is that all??
› Fence flag on readEntry() → no more writes allowed to a ledger
› Deletion → background garbage collection
› Auto replication → it's a different logical component that uses the BK client API
Bookie Storage
Interleaved storage
4 Bookie Storage
▪ Default bookie storage
▪ Use journal on a separate device
› Every entry is synced on the journal
▪ Entries are also written to "entryLog" files as they come in
› Writes on the entryLog are periodically flushed in background
› Entries are appended to the current entry log file
› When entryLog reaches 2GB, a new one is created
› Entries for multiple ledgers are interleaved in the same entry log
▪ Need to maintain and index (ledgerId, entryId) → (entryLogId, offset)
› Default implementation uses a file for each ledger to store the data locations
Bookie Garbage Collection
5 Bookie Storage
▪ Runs periodically in background
▪ Get the list of ledgers stored locally
▪ Get list of ledgers from ZK
▪ Whatever ledger is not in ZK is marked for deletion
When are entry logs deleted?
6
▪ Need to keep track of usage of each entry log
▪ EntryLog metadata, in memory map for each entryLog
› (ledgerId → size)
▪ Whenever a ledger is deleted, each entry log will update the usage
▪ Metadata is appended to each entryLog, to avoid having to scan the log
when bookie restarts (since 4.4.0)
▪ If the entryLog usage is 0% → delete it
▪ If usage falls below x % → compaction
Bookie Storage
Entry log compaction
7 Bookie Storage
▪ There are 2 compactions which differs in threshold :
› Minor (every 1 hour, usage < 50%)
› Major (every 1 day, usage < 80%)
▪ Scan the entryLog file and append all valid entries into the current
(newer) entryLog file
▪ Update the indexes to point to new location
Changes already done
8 Bookie Storage
▪ Writes interleaved in entryLog makes poor read performance
› Typically you want to read many entries sequentially
› In SortedLedgerStorage (since BK-4.3) and in DbLedgerStorage (scheduled for
BK-4.5), there’s the concept of write-cache :
• Defer the writing to entryLog and sort by ledgerId/entryId to have entries stored sequentially.
› On the same note, using read-ahead cache will amortize IO ops
▪ Use RocksDB to maintain indexes
› In DbLedgerStorage we load all the offsets into RocksDB. Helps when storing many
ledgers (tested with few millions) in a single bookie.
Improvements areas / 1
9
▪ JVM GC still has impact on latencies
› Already done several improvements
› GC cannot be avoided, going to 0 allocation per entry written is not practical
› Only option is to make pauses as least as possible frequent
› Single bookie throughput is limited by GC rather than hardware:
› Above a certain rate the latency spikes from pauses would make it miss SLA
› Batching more logical entries into a single BK entry helps a lot, but it’s not always
practical
Bookie Storage
Improvements areas / 2
10
▪ Having large sequence of sequential entries and take advantage of
read-ahead cache depends on flushInterval :
› Frequent flushes will make for less contiguous entries
› Longer interval means to have more long-lived java objects and longer pauses.
▪ Similarly, if writes are spread across many ledgers, with very low per-
ledger rate, there will be few sequential entries
▪ Bookie compaction
› During compaction, older entries are re-appended and mixed with new entries.
› Long lived entries will get compacted all over again.
› Need to keep EntryLogMetadata in memory (when storing 20TB that can be quite
significant)
Bookie Storage
Consideration on Bookie storage
11
▪ Original BK implementation dates back to 2009
▪ Bookie storage really resembles an LSM DB
› Journal → Write Ahead Log
› Entry Log → SSTs
› Compaction
› Write cache → MemTable
› Read cache → LRU Block cache
▪ Why not directly store all the data in RocksDB?
▪ Can we get the same performance as current Bookie?
▪ That would replace large portion of Bookie code
▪ At that point, why not have the Bookie server in C++?
Bookie Storage
Bookie-CPP
12
▪ What is it?
› Proof of Concept to validate performance assumption
› Compatible with regular BK Java client
› Async C++ server that writes into RocksDB
› So far, only addEntry() implemented
▪ What is not
› No plan to write BK client in C++
▪ ¿¿Why??
› Fully utilize IO capacity (vertical scalability)
› Better compaction, no GC pauses, block-level compression, etc…
Bookie Storage
RocksDB tuning
13
▪ We can make RocksDB look like Bookie
▪ Goals : high-throughput and low-latency for writes
› Use background thread to implement group-commit on top of RocksDB
› To ensure writes are not stalled by compaction, use large MemTable (write-cache)
size: 4x 1GB
› Use big SST size: 1GB
› Big block-size: 256K (helps for HDDs)
› Compaction read-ahead buffer: 8MB
Bookie Storage
How to implement deletion
14
▪ Bookie GC will still do to he same scan & compare
▪ Typically, in LSM DBs a delete operation consist is writing a tombstone
marker
› Data is deleted when the tombstones are pushed to the last level and the SST is
compacted
▪ RocksDB provides additional options to delete data:
› DeleteFilesInRange() → immediately delete SSTs that only contains keys in that range
• eg: DeleteFilesInRange( [ledgerId, ledgerId+1) )
› Compaction filter → hook into RocksDB compaction to decide which data needs to be
kept when compacting. Can use the map of active ledgers to do it. Compaction can
also be forced by calling CompactRange()
Bookie Storage
Preliminary tests / 1
15
▪ 1 client - 1 bookie
▪ Bookie journal: SSD + RAID BBU
▪ Bookie ledgers: HDDs
▪ Writing 60K 1KB entries/s over multiple ledgers
▪ C++ perf tool that simulates BK client and measure latency
› Only send addEntry request / no actual ledger metadata in ZK
› Using C++ client, removes JVM GC measure noise on the client side
▪ Measure 99pct write latency over different time intervals
› 1min, 10sec, 1sec
Bookie Storage
Preliminary tests / 2
16 Bookie Storage
Preliminary tests / 3
17 Bookie Storage
Preliminary tests / 4
18 Bookie Storage
Conclusions
19
▪ Preliminary results look promising
▪ Work in Progress, code at github.com/merlimat/bookie-cpp
▪ Feedback welcome
▪ Hopefully there’s interest in this area
▪ It would be great to include in main BK repository at some point
Bookie Storage

Contenu connexe

Tendances

FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache KafkaKafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafkaconfluent
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
 
Inside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentInside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentMariaDB plc
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Ashnikbiz
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase IntroductionMembase
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandFuenteovejuna
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologiesMariaDB plc
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Membase East Coast Meetups
Membase East Coast MeetupsMembase East Coast Meetups
Membase East Coast MeetupsMembase
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 

Tendances (20)

FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache KafkaKafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Inside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentInside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at Tencent
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase Introduction
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Membase East Coast Meetups
Membase East Coast MeetupsMembase East Coast Meetups
Membase East Coast Meetups
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 

En vedette

สุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้านสุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้านkan2500
 
Website development services
Website development servicesWebsite development services
Website development servicessourcPEP
 
Interseccion superficies
Interseccion superficiesInterseccion superficies
Interseccion superficiesannie ww
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewMessaging Meetup
 
Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014StampedeCon
 
SEO services
SEO servicesSEO services
SEO servicessourcPEP
 
Representation in soap operas
Representation in soap operas Representation in soap operas
Representation in soap operas teasticks
 
Neuropeptide Y
Neuropeptide YNeuropeptide Y
Neuropeptide YChee Oh
 
Disruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab ReportDisruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab ReportSamirah Boksmati
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

En vedette (15)

สุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้านสุขภาพเริ่มต้นที่บ้าน
สุขภาพเริ่มต้นที่บ้าน
 
Poster evidence
Poster evidencePoster evidence
Poster evidence
 
манғолия
манғолияманғолия
манғолия
 
Shan chowdhury
Shan chowdhuryShan chowdhury
Shan chowdhury
 
Website development services
Website development servicesWebsite development services
Website development services
 
Interseccion superficies
Interseccion superficiesInterseccion superficies
Interseccion superficies
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical Overview
 
Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014Datomic – A Modern Database - StampedeCon 2014
Datomic – A Modern Database - StampedeCon 2014
 
SEO services
SEO servicesSEO services
SEO services
 
Ua bmay2015 aml.fheili
Ua bmay2015 aml.fheiliUa bmay2015 aml.fheili
Ua bmay2015 aml.fheili
 
Representation in soap operas
Representation in soap operas Representation in soap operas
Representation in soap operas
 
Neuropeptide Y
Neuropeptide YNeuropeptide Y
Neuropeptide Y
 
Disruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab ReportDisruption of Mitosis in Onion Lab Report
Disruption of Mitosis in Onion Lab Report
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 

Similaire à Bookie storage - Apache BookKeeper Meetup - 2015-06-28

MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformMongoDB
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLMorgan Tocker
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)Shivji Kumar Jha
 
CosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosIvo Andreev
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdffengxun
 
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021StreamNative
 
Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQLlefredbe
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningKenichiro Nakamura
 
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Ontico
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...Insight Technology, Inc.
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 

Similaire à Bookie storage - Apache BookKeeper Meetup - 2015-06-28 (20)

MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
An Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media PlatformAn Elastic Metadata Store for eBay’s Media Platform
An Elastic Metadata Store for eBay’s Media Platform
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
CosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT Scenarios
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
Pushing Pulsar Performance to the Limits - Pulsar Summit NA 2021
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQL
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 

Dernier

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 

Dernier (20)

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 

Bookie storage - Apache BookKeeper Meetup - 2015-06-28

  • 1. Bookie Storage M a t t e o M e r l i
  • 2. BookKeeper 2 ▪ Provides distributed logs (ledgers) ▪ BookKeeper client + Bookies ▪ Client API can be summarized as : › createLedger() → ledgerId › ledger.addEntry(data) → entryId › ledger.readEntry(ledgerId, entryId) › deleteLedger(ledgerId) ▪ BK Client library implements all the “logic” › Consistency, metadata in ZK, fencing, recovery, replication ▪ Bookie Server are charged to store the data Bookie Storage
  • 3. Bookie external interface 3 ▪ Simple primitives › addEntry(ledgerId, entryId, payload) → OK › readEntry(ledgerId, entryId) → payload › getLastEntry(ledgerId) → entryId ▪ Is that all?? › Fence flag on readEntry() → no more writes allowed to a ledger › Deletion → background garbage collection › Auto replication → it's a different logical component that uses the BK client API Bookie Storage
  • 4. Interleaved storage 4 Bookie Storage ▪ Default bookie storage ▪ Use journal on a separate device › Every entry is synced on the journal ▪ Entries are also written to "entryLog" files as they come in › Writes on the entryLog are periodically flushed in background › Entries are appended to the current entry log file › When entryLog reaches 2GB, a new one is created › Entries for multiple ledgers are interleaved in the same entry log ▪ Need to maintain and index (ledgerId, entryId) → (entryLogId, offset) › Default implementation uses a file for each ledger to store the data locations
  • 5. Bookie Garbage Collection 5 Bookie Storage ▪ Runs periodically in background ▪ Get the list of ledgers stored locally ▪ Get list of ledgers from ZK ▪ Whatever ledger is not in ZK is marked for deletion
  • 6. When are entry logs deleted? 6 ▪ Need to keep track of usage of each entry log ▪ EntryLog metadata, in memory map for each entryLog › (ledgerId → size) ▪ Whenever a ledger is deleted, each entry log will update the usage ▪ Metadata is appended to each entryLog, to avoid having to scan the log when bookie restarts (since 4.4.0) ▪ If the entryLog usage is 0% → delete it ▪ If usage falls below x % → compaction Bookie Storage
  • 7. Entry log compaction 7 Bookie Storage ▪ There are 2 compactions which differs in threshold : › Minor (every 1 hour, usage < 50%) › Major (every 1 day, usage < 80%) ▪ Scan the entryLog file and append all valid entries into the current (newer) entryLog file ▪ Update the indexes to point to new location
  • 8. Changes already done 8 Bookie Storage ▪ Writes interleaved in entryLog makes poor read performance › Typically you want to read many entries sequentially › In SortedLedgerStorage (since BK-4.3) and in DbLedgerStorage (scheduled for BK-4.5), there’s the concept of write-cache : • Defer the writing to entryLog and sort by ledgerId/entryId to have entries stored sequentially. › On the same note, using read-ahead cache will amortize IO ops ▪ Use RocksDB to maintain indexes › In DbLedgerStorage we load all the offsets into RocksDB. Helps when storing many ledgers (tested with few millions) in a single bookie.
  • 9. Improvements areas / 1 9 ▪ JVM GC still has impact on latencies › Already done several improvements › GC cannot be avoided, going to 0 allocation per entry written is not practical › Only option is to make pauses as least as possible frequent › Single bookie throughput is limited by GC rather than hardware: › Above a certain rate the latency spikes from pauses would make it miss SLA › Batching more logical entries into a single BK entry helps a lot, but it’s not always practical Bookie Storage
  • 10. Improvements areas / 2 10 ▪ Having large sequence of sequential entries and take advantage of read-ahead cache depends on flushInterval : › Frequent flushes will make for less contiguous entries › Longer interval means to have more long-lived java objects and longer pauses. ▪ Similarly, if writes are spread across many ledgers, with very low per- ledger rate, there will be few sequential entries ▪ Bookie compaction › During compaction, older entries are re-appended and mixed with new entries. › Long lived entries will get compacted all over again. › Need to keep EntryLogMetadata in memory (when storing 20TB that can be quite significant) Bookie Storage
  • 11. Consideration on Bookie storage 11 ▪ Original BK implementation dates back to 2009 ▪ Bookie storage really resembles an LSM DB › Journal → Write Ahead Log › Entry Log → SSTs › Compaction › Write cache → MemTable › Read cache → LRU Block cache ▪ Why not directly store all the data in RocksDB? ▪ Can we get the same performance as current Bookie? ▪ That would replace large portion of Bookie code ▪ At that point, why not have the Bookie server in C++? Bookie Storage
  • 12. Bookie-CPP 12 ▪ What is it? › Proof of Concept to validate performance assumption › Compatible with regular BK Java client › Async C++ server that writes into RocksDB › So far, only addEntry() implemented ▪ What is not › No plan to write BK client in C++ ▪ ¿¿Why?? › Fully utilize IO capacity (vertical scalability) › Better compaction, no GC pauses, block-level compression, etc… Bookie Storage
  • 13. RocksDB tuning 13 ▪ We can make RocksDB look like Bookie ▪ Goals : high-throughput and low-latency for writes › Use background thread to implement group-commit on top of RocksDB › To ensure writes are not stalled by compaction, use large MemTable (write-cache) size: 4x 1GB › Use big SST size: 1GB › Big block-size: 256K (helps for HDDs) › Compaction read-ahead buffer: 8MB Bookie Storage
  • 14. How to implement deletion 14 ▪ Bookie GC will still do to he same scan & compare ▪ Typically, in LSM DBs a delete operation consist is writing a tombstone marker › Data is deleted when the tombstones are pushed to the last level and the SST is compacted ▪ RocksDB provides additional options to delete data: › DeleteFilesInRange() → immediately delete SSTs that only contains keys in that range • eg: DeleteFilesInRange( [ledgerId, ledgerId+1) ) › Compaction filter → hook into RocksDB compaction to decide which data needs to be kept when compacting. Can use the map of active ledgers to do it. Compaction can also be forced by calling CompactRange() Bookie Storage
  • 15. Preliminary tests / 1 15 ▪ 1 client - 1 bookie ▪ Bookie journal: SSD + RAID BBU ▪ Bookie ledgers: HDDs ▪ Writing 60K 1KB entries/s over multiple ledgers ▪ C++ perf tool that simulates BK client and measure latency › Only send addEntry request / no actual ledger metadata in ZK › Using C++ client, removes JVM GC measure noise on the client side ▪ Measure 99pct write latency over different time intervals › 1min, 10sec, 1sec Bookie Storage
  • 16. Preliminary tests / 2 16 Bookie Storage
  • 17. Preliminary tests / 3 17 Bookie Storage
  • 18. Preliminary tests / 4 18 Bookie Storage
  • 19. Conclusions 19 ▪ Preliminary results look promising ▪ Work in Progress, code at github.com/merlimat/bookie-cpp ▪ Feedback welcome ▪ Hopefully there’s interest in this area ▪ It would be great to include in main BK repository at some point Bookie Storage