EVCache: Lowering Costs for a Low Latency Cache with RocksDB

EVCache: Lowering Costs
for a Low Latency Cache
with RocksDB
Scott Mansfield
Vu Nguyen
EVCache

What do caches touch?
Signing up*
Logging in
Choosing a profile
Picking liked videos
Personalization*
Loading home page*
Scrolling home page*
A/B tests
Video image selection
Searching*
Viewing title details
Playing a title*
Subtitle / language prefs
Rating a title
My List
Video history*
UI strings
Video production*
* multiple caches involved

Key-Value store optimized for
AWS and tuned for Netflix use
cases
Ephemeral Volatile Cache

What is EVCache?
Distributed, sharded, replicated key-value store
Tunable in-region and global replication
Based on Memcached
Resilient to failure
Topology aware
Linearly scalable
Seamless deployments

Why Optimize for AWS
Instances disappear
Zones fail
Regions become unstable
Network is lossy
Customer requests bounce between regions
Failures happen and we test all the time

EVCache Use @ Netflix
Hundreds of terabytes of data
Trillions of ops / day
Tens of billions of items stored
Tens of millions of ops / sec
Millions of replications / sec
Thousands of servers
Hundreds of instances per cluster
Hundreds of microservice clients
Tens of distinct clusters
3 regions
4 engineers

Architecture
Server
Memcached
EVCar
Application
Client Library
Client
Eureka
(Service Discovery)

Architecture
us-west-2a us-west-2cus-west-2b
ClientClient Client

Reading (get)
Client
Primary Secondary

Writing (set, delete, add, etc.)
ClientClient Client

Use Case: Lookaside Cache
Application
Client Library
Client Ribbon Client
S S S S
C C C C
Data Flow

Use Case: Transient Data Store
Application
Client Library
Client
Application
Client Library
Client
Application
Client Library
Client
Time

Use Case: Primary Store
Offline / Nearline
Precomputes for
Recommendations
Online Services
Offline Services
Online Application
Client Library
Client
Data Flow

Online Services
Offline Services
Use Case: Versioned Primary Store
Offline Compute
Online Application
Client Library
Client
Data Flow
Archaius
(Dynamic Properties)
Control System
(Valhalla)

Use Case: High Volume && High Availability
Compute & Publish
on schedule
Data Flow
Application
Client Library
In-memory Remote Ribbon Client
Optional
S S S S
C C C C

Pipeline of Personalization
Compute A
Compute B Compute C
Compute D
Online Services
Offline Services
Compute E
Data Flow
Online 1 Online 2

Additional Features
Kafka
● Global data replication
● Consistency metrics
Key Iteration
● Cache warming
● Lost instance recovery
● Backup (and restore)

Additional Features (Kafka)
Global data replication
Consistency metrics

Region BRegion A
APP APP
Repl Proxy
Repl Relay
1 mutate
2 send
metadata
3 poll msg
5
https send
m
sg
6 mutate4
get data
for set
Kafka Repl Relay Kafka
Repl Proxy
Cross-Region Replication
7 read

Additional Features (Key Iteration)
Cache warming
Lost instance recovery
Backup (and restore)

Cache Warming
Cache Warmer
(Spark)
Application
Client Library
Client
Control
S3
Data Flow
Metadata Flow
Control Flow

Moneta
Next-generation
EVCache server

Moneta
Moneta: The Goddess of Memory
Juno Moneta: The Protectress of Funds for Juno
● Evolution of the EVCache server
● Cost optimization
● EVCache on SSD
● Ongoing lower EVCache cost per stream
● Takes advantage of global request patterns

Old Server
● Stock Memcached and EVCar (sidecar)
● All data stored in RAM in Memcached
● Expensive with global expansion / N+1 architecture
Memcached
EVCar
external

Optimization
● Global data means many copies
● Access patterns are heavily region-oriented
● In one region:
○ Hot data is used often
○ Cold data is almost never touched
● Keep hot data in RAM, cold data on SSD
● Size RAM for working set, SSD for overall dataset

New Server
● Adds Rend and Mnemonic
● Still looks like Memcached
● Unlocks cost-efficient storage & server-side intelligence

go get github.com/netflix/rend
Rend

Rend
● High-performance Memcached proxy & server
● Written in Go
○ Powerful concurrency primitives
○ Productive and fast
● Manages the L1/L2 relationship
● Tens of thousands of connections

Rend
● Modular set of libraries and an example main()
● Manages connections, request orchestration, and
communication
● Low-overhead metrics library
● Multiple orchestrators
● Parallel locking for data integrity
● Efficient connection pool
Server Loop
Request Orchestration
Backend Handlers
M
E
T
R
I
C
S
Connection Management
Protocol

Mnemonic
● Manages data storage on SSD
● Uses Rend server libraries
○ Handles Memcached protocol
● Maps Memcached ops to RocksDB ops
Rend Server Core Lib (Go)
Mnemonic Op Handler (Go)
Mnemonic Core (C++)
RocksDB (C++)

Why RocksDB?
● Fast at medium to high write load
○ Disk--write load higher than read load (because of Memcached)
● Predictable RAM Usage
memtable
Record A Record B
SST
memtable memtable
SST SST . . .
SST: Static Sorted Table

How we use RocksDB
● No Level Compaction
○ Generated too much traffic to SSD
○ High and unpredictable read latencies
● No Block Cache
○ Rely on Local Memcached
● No Compression

How we use RocksDB
● FIFO Compaction
○ SST’s ordered by time
○ Oldest SST deleted when full
○ Reads access every SST until record found

How we use RocksDB
● Full File Bloom Filters
○ Full Filter reduces unnecessary SSD reads
● Bloom Filters and Indices pinned in memory
○ Minimize SSD access per request

How we use RocksDB
● Records sharded across multiple RocksDB per node
○ Reduces number of files checked to decrease latency
R
...
Mnemonic Core
Key: ABC
Key: XYZ
R R R

Region-Locality Optimizations
● Replication and Batch updates only RocksDB*
○ Keeps Region-Local and “hot” data in memory
○ Separate Network Port for “off-line” requests
○ Memcached data “replaced”

FIFO Limitations
● FIFO compaction not suitable for all use cases
○ Very frequently updated records may push out valid records
● Expired Records still exist
● Requires Larger Bloom Filters
SST
Record A2
Record B1
Record B2
Record A3
Record A1
Record A2
Record B1
Record B2
Record A3
Record A1
Record B3Record B3
Record C
Record D
Record E
Record F
Record G
Record H
time
SSTSST

AWS Instance Type
● i2.xlarge
○ 4 vCPU
○ 30 GB RAM
○ 800 GB SSD
■ 32K IOPS (4KB Pages)
■ ~130MB/sec

Moneta Perf Benchmark (High-Vol Online Requests)

Moneta Performance in Production (Batch Systems)
● Request ‘get’ 95%, 99% = 729 μs, 938 μs
● L1 ‘get’ 95%, 99% = 153 μs, 191 μs
● L2 ‘get’ 95%, 99% = 1005 μs, 1713 μs
● ~20 KB Records
● ~99% Overall Hit Rate
● ~90% L1 Hit Rate

Moneta Performance in Prod (High Vol-Online Req)
● Request ‘get’ 95%, 99% = 174 μs, 588 μs
● L1 ‘get’ 95%, 99% = 145 μs, 190 μs
● L2 ‘get’ 95%, 99% = 770 μs, 1330 μs
● ~1 KB Records
● ~98% Overall Hit Rate
● ~97% L1 Hit Rate

Get Percentiles:
● 50th
: 102 μs (101 μs)
● 75th
: 120 μs (115 μs)
● 90th
: 146 μs (137 μs)
● 95th
: 174 μs (166 μs)
● 99th
: 588 μs (427 μs)
● 99.5th
: 733 μs (568 μs)
● 99.9th
: 1.39 ms (979 μs)
Moneta Performance in Prod (High Vol-Online Req)
Set Percentiles:
● 50th
: 97.2 μs (87.2 μs)
● 75th
: 107 μs (101 μs)
● 90th
: 125 μs (115 μs)
● 95th
: 138 μs (126 μs)
● 99th
: 177 μs (152 μs)
● 99.5th
: 208 μs (169 μs)
● 99.9th
: 1.19 ms (318 μs)
Latencies: peak (trough)

Challenges/Concerns
● Less Visibility
○ Unclear of Overall Data Size because of duplicates and expired records
○ Restrict Unique Data Set to ½ of Max for Precompute Batch Data
● Lower Max Throughput than Memcached-based Server
○ Higher CPU usage
○ Planning must be better so we can handle unusually high request spikes

Current/Future Work
● Investigate Blob Storage feature
○ Less Data read/write from SSD during Level Compaction
○ Lower Latency, Higher Throughput
○ Better View of Total Data Size
● Purging Expired SST’s earlier
○ Useful in “short” TTL use cases
○ May purge 60%+ SST earlier than FIFO Compaction
○ Reduce Worst Case Latency
○ Better Visibility of Overall Data Size
● Inexpensive Deduping for Batch Data

Open Source
https://github.com/netflix/EVCache
https://github.com/netflix/rend

Thank You
smansfield@netflix.com (@sgmansfield)
nguyenvu@netflix.com
techblog.netflix.com

Failure Resilience in Client
● Operation Fast Failure
● Tunable Retries
● Operation Queues
● Tunable Latch for Mutations
● Async Replication through Kafka

Region A
APP
Consistency
Checker
1 mutate
2 send
metadata
3 poll msg
Kafka
Consistency Metrics
4 pull data
Atlas
(Metrics Backend)
5 report
Client
Dashboards

Lost Instance Recovery
Cache Warmer
(Spark)
Application
Client Library
Client
S3
Partial Data Flow
Metadata Flow
Control Flow
Control
Zone A Zone B
Data Flow

Backup (and Restore)
Cache Warmer
(Spark)
Application
Client Library
Client
Control
S3Data Flow
Control Flow

Moneta in Production
● Serving all of our personalization data
● Rend runs with two ports:
○ One for standard users (read heavy or active data management)
○ Another for async and batch users: Replication and Precompute
● Maintains working set in RAM
● Optimized for precomputes
○ Smartly replaces data in L1
external internal
EVCar
Memcached (RAM)
Mnemonic (SSD)
Std
Batch

EVCache: Lowering Costs for a Low Latency Cache with RocksDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to EVCache: Lowering Costs for a Low Latency Cache with RocksDB

Similar to EVCache: Lowering Costs for a Low Latency Cache with RocksDB (20)

Recently uploaded

Recently uploaded (20)

EVCache: Lowering Costs for a Low Latency Cache with RocksDB