Dissecting Scalable Database Architectures

Dissecting Scalable Database
Architectures
Doug Judd
CEO, Hypertable Inc.

Talk Outline
• Scalable “NoSQL” Architectures
• Next-generation Architectures
• Future Evolution - Hardware Trends

Scalable NoSQL
Architecture Categories
• Auto-sharding
• Dynamo
• Bigtable

Auto-sharding Systems
• Oracle NoSQL Database
• MongoDB

Dynamo
• “Dynamo: Amazon’s Highly Available Key-value Store”
– Amazon.com, 2007
• Distributed Hash Table (DHT)
• Handles inter-datacenter replication
• Designed for High Write Availability

Dynamo-based Systems
• Cassandra
• DynamoDB
• Riak
• Voldemort

Bigtable
• “Bigtable: A Distributed Storage System for Structured Data”
- Google, Inc., OSDI ’06
• Ordered
• Consistent
• Not designed to handle inter-datacenter replication

Database Model
• Sparse, two-dimensional table with cell versions
• Cells are identified by a 4-part key
• Row (string)
• Column Family
• Column Qualifier (string)
• Timestamp

Anatomy of a Key
• Column Family is represented with 1 byte
• Timestamp and revision are stored big-endian,
ones-compliment
• Simple byte-wise comparison

Range Server: CellStore
• Sequence of 65K blocks of
compressed key/value pairs

Bloom Filter
• Associated with each Cell Store
• Dramatically reduces disk access
• Tells you if key is definitively not present

Bigtable-based Systems
• Accumulo
• HBase
• Hypertable

Next-generation Architectures
• PNUTS (Yahoo, Inc.)
• Spanner (Google, Inc.)
• Dremel (Google, Inc.)

PNUTS
• Geographically distributed database
• Designed for low-latency access
• Manages hashed or ordered tables of records
• Hashed tables implemented via proprietary disk-based hash
• Ordered tables implemented with MySQL+InnoDB
• Not optimized for bulk storage (image, videos, …)
• Runs as a hosted service inside Yahoo!

Record-level Mastering
• Provides per-record timeline consistency
• Master is adaptively changed to suit workload
• Region names are two bytes associated with each record

PNUTS API
• Read-any
• Read-critical(required_version)
• Read-latest
• Write
• Test-and-set-write(required_version)

Spanner
• Globally distributed database (cross-datacenter replication)
• Synchronously Replicated
• Externally-consistent distributed transactions
• Globally distributed transaction management
• SQL-based query language

Spanserver
• Manages 100-1000 tablets
• A tablet is similar to a Bigtable tablet and manages a bag of
mappings:
(key:string, timestamp:int64) -> string
• Single Paxos state machine implemented on top of each tablet
• Tablet may contain multiple directories
• Set of contiguous keys that share a common prefix
• Unit of data placement
• Can be moved between Tablets for performance reasons

TrueTime
• Universal Clock
• Set of time master servers per-datacenter
• GPL clock via GPS receivers with dedicated antennas
• Atomic clock
• Time daemon runs on every machine
• TrueTime API:

Externally-consistent Operations
• Read-Write Transaction
• Read-Only Transaction
• Snapshot Read (client-provided timestamp)
• Snapshot Read (client-provided bound)
• Schema Change Transaction

Dremel
• Scalable, interactive ad-hoc query system
• Designed to operate on read-only data
• Handles nested data (Protocol Buffers)
• Can run aggregation queries over trillion-row tables in seconds

Columnar Storage Format

• Novel format for storing lists of nested records (Protocol
Buffers)
• Highly space-efficient
• Algorithm for dissecting list of nested records into columns
• Algorithm for reassembling columns into list of records

Multi-level Execution Trees

• Execution model for one-pass aggregations returning small
and medium-sized results (very common at Google)
• Query gets re-written as it passes down the execution tree.
• On the way up, intermediate servers perform a parallel
aggregation of partial results.

Example Queries
• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1

• SELECT country, SUM(item.amount) FROM T2
GROUP BY country

• SELECT domain, SUM(item.amount) FROM T2
WHERE domain CONTAINS ’.net’
GROUP BY domain

• SELECT COUNT(DISTINCT a) FROM T5

Future Evolution - Hardware
Trends
• SSD Drives
• Disk Drives
• Networking

Flash Memory Rated Lifetime
(P/E Cycles)

Source: Bleak Future of NAND Flash Memory,
Grupp et al., FAST 2012

Flash Memory Average BER at
Rated Lifetime

Source: Bleak Future of NAND Flash Memory,
Grupp et al., FAST 2012

Disk: Maximum Sustained
Bandwidth Trend


Time Required to Sequentially Fill a
SATA Drive

Average Seek Time


Average Rotational Latency


Time Required to Randomly
Read a SATA Drive

Ethernet
• 10GbE
• Starting to replace 1GbE for server NICs
• De facto network port for new servers in 2014
• 40GbE
• Data center core & aggregation
• Top-of-rack server aggregation
• 100GbE
• Service Provider core and aggregation
• Metro and large Campus core
• Data center core & aggregation
• No technology currently exists to transport 40 Gbps or 100 Gbps as a
single stream over existing copper or fiber
• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE
“lanes”

Dissecting Scalable Database Architectures

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Dissecting Scalable Database Architectures

Similaire à Dissecting Scalable Database Architectures (20)

Dernier

Dernier (20)

Dissecting Scalable Database Architectures

Notes de l'éditeur