12. Bigtable
• “Bigtable: A Distributed Storage System for Structured Data”
- Google, Inc., OSDI ’06
• Ordered
• Consistent
• Not designed to handle inter-datacenter replication
21. Database Model
• Sparse, two-dimensional table with cell versions
• Cells are identified by a 4-part key
• Row (string)
• Column Family
• Column Qualifier (string)
• Timestamp
24. Anatomy of a Key
• Column Family is represented with 1 byte
• Timestamp and revision are stored big-endian,
ones-compliment
• Simple byte-wise comparison
31. PNUTS
• Geographically distributed database
• Designed for low-latency access
• Manages hashed or ordered tables of records
• Hashed tables implemented via proprietary disk-based hash
• Ordered tables implemented with MySQL+InnoDB
• Not optimized for bulk storage (image, videos, …)
• Runs as a hosted service inside Yahoo!
33. Record-level Mastering
• Provides per-record timeline consistency
• Master is adaptively changed to suit workload
• Region names are two bytes associated with each record
37. Spanserver
• Manages 100-1000 tablets
• A tablet is similar to a Bigtable tablet and manages a bag of
mappings:
(key:string, timestamp:int64) -> string
• Single Paxos state machine implemented on top of each tablet
• Tablet may contain multiple directories
• Set of contiguous keys that share a common prefix
• Unit of data placement
• Can be moved between Tablets for performance reasons
38. TrueTime
• Universal Clock
• Set of time master servers per-datacenter
• GPL clock via GPS receivers with dedicated antennas
• Atomic clock
• Time daemon runs on every machine
• TrueTime API:
41. Dremel
• Scalable, interactive ad-hoc query system
• Designed to operate on read-only data
• Handles nested data (Protocol Buffers)
• Can run aggregation queries over trillion-row tables in seconds
42. Columnar Storage Format
• Novel format for storing lists of nested records (Protocol
Buffers)
• Highly space-efficient
• Algorithm for dissecting list of nested records into columns
• Algorithm for reassembling columns into list of records
43. Multi-level Execution Trees
• Execution model for one-pass aggregations returning small
and medium-sized results (very common at Google)
• Query gets re-written as it passes down the execution tree.
• On the way up, intermediate servers perform a parallel
aggregation of partial results.
45. Example Queries
• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1
• SELECT country, SUM(item.amount) FROM T2
GROUP BY country
• SELECT domain, SUM(item.amount) FROM T2
WHERE domain CONTAINS ’.net’
GROUP BY domain
• SELECT COUNT(DISTINCT a) FROM T5
55. Ethernet
• 10GbE
• Starting to replace 1GbE for server NICs
• De facto network port for new servers in 2014
• 40GbE
• Data center core & aggregation
• Top-of-rack server aggregation
• 100GbE
• Service Provider core and aggregation
• Metro and large Campus core
• Data center core & aggregation
• No technology currently exists to transport 40 Gbps or 100 Gbps as a
single stream over existing copper or fiber
• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE
“lanes”
Strengths: Multiple-servers can satisfy readsWeaknesses: 1. Failover, 2. Mapping service on single machine, 3. Irregular growth patterns can cause imbalance
Strengths: Multiple-servers can satisfy readsWeaknesses: 1. Failover, 2. Mapping service on single machine, 3. Irregular growth patterns can cause imbalance
Designed for their “Shopping Cart” service
An important part of any DHT is the mechanism by which keys get mapped …Supports incremental scalabilityGossip protocol is used to propagate membership changes
Dynamo was designed for high write availability (Shopping Cart service)This is also how they handle inter-datacenter replicationRead repair (downside is reading becomes expensive)
Dynamo uses Vector Clocks to assist with the reconciliation of divergent copies of objects in the systemVector Clocks are used to track revision history of objectsfor the purposes of reconciliation in the event of a divergenceAny storage node in Dynamo is eligible to receive client get and put operations for any keyOne vector clock is associated with every version of every objectIf the version numbers on the first object’s clock are <= to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
Strengths: Low latency writes, handles inter-datacenter replicationWeaknesses: Not ordered, Read-repair can impact read latency
Strengths: Ordered, consistentWeaknesses: Does not handle inter-datacenter replication
This diagram shows two regions …Data replication happens via that Yahoo Message Broker (YMB), which is a pub/sub system that offers reliable message deliveryStorage units manage tabletsTablet controller contains authoritative mapping of tablets-to-storage-units, also orchestrates tablet movment
PNUTS offers relaxed consistency guarantees across the dataset, but per-record timeline consistency through the use of Record Level MasteringThey’ve found this to be sufficient for most of their web use cases
Read-any – reads any (possible stale) recent version. Low latency.Read-critical – Reads any record that
Placement driver handles automated movement of data across zones.
Time daemons synchronize with time masters every 30 seconds and apply a drift rate of 200 microseconds/s between synchronizationsThe reason for the two clocks is that they each have different failure modes.
Read-Write transactions use pessimistic concurrency control (lock table)Reads are lock-free and can happen at an replica that is sufficiently up-to-date.
Another key aspect of Dremel is how it handles certain common aggregation queries
SLC – Single Level Cell
Kryder’s law
40% CAGR for arial density, 15% CAGR for sustained write bandwidth