By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
3. Agenda
1. Why we need to scale ingest
2. Basic properties and limitations of modern
hardware
3. Optimization techniques inspired by HPC
4. Results!
5. Q&A (hopefully!)
5. • High resolution:
• Up to 1 sec
• Streaming analytics:
• Charts/analytics update @1sec
• Real time
• Multidimensional metrics:
• Dimensions : representing customer, server etc
• Filter, aggregate : 99th-pct-latency-by-service,customer
SignalFx is an advanced monitoring platform for modern applications
7. SignalFx ingest library
Raw data in Rollup data out
TimeSeries 0 rollup
TimeSeries 1 rollup
TimeSeries 2 rollup
TimeSeries 3 rollup
TimeSeries 4 rollup
TimeSeries 5 rollup
TimeSeries 6 rollup
TimeSeries 7 rollup
TimeSeries 8 rollup
8. Issues identified (before applying HPC techniques)
• Expensive - too many servers
• Exhibits parallel slow down
• More threads = worse performance
• What did the profile say?
• Death by a thousand cuts
• The core library = 35% of profile
11. Cache Lines
• Data is transferred between memory and cache in blocks of
fixed size, called cache lines. Usually 64 bytes
• When the processor needs to read or write a location in
main memory, it first checks for a corresponding entry in the
cache. In the case of:
• a cache hit, the processor immediately reads or writes
the data in the cache line
• a cache miss, the cache allocates a new entry and
copies in data from main memory, then the request (read
or write) is fulfilled from the contents of the cache
• The memory subsystem makes two kinds of bets to help us:
• Temporal locality
• Spatial locality
12. Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
19. SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
20. SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
35% of the profile of
the entire application
21. SignalFx
Techniques inspired by HPC that have
improved our pipeline
Single threaded, event based architectures:
parallelize by running multiple copies of
single threaded code
22. Single threaded event based architectures
• Threads work on their own private
data (as much as possible)
• Communicate with other threads
using events/messages
23. SignalFx
local data
Network In thread Processor thread(s) Network out thread
Receive data
Process data
Write batched
data
Events
Events
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
24. SignalFx
Network In thread Processor thread(s) Network out thread
Receive data
Process data Write batched
data
local data
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
25. Single threaded event based architectures advantages
• It enables many other optimal choices like
• Compact array based data structures
• Buffer/object re-use
• Loosely coupled - easy to test
• Run multiple copies for parallelism
26. SignalFx
Ring
Buffer
Network In thread Worker thread(s) Network out thread
Receive data
Process data Write batched
data
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
27. SignalFx
Worker thread
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Worker thread Worker thread
local data local data local data
Key Value
key 5 value 5
key 6 value 6
key 7 value 7
key 8 value 8
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Key Value
key 9 value 9
key 10 value 10
key 11 value 11
key 12 value 12
28. SignalFx
Ring
Buffer
Network thread Processor thread(s) Async IO thread
Receive data
Process data
Batched
IO calls
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
5
Ring
Buffer
6
7
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
29. Advice for threaded applications
• Threads should ideally reflect the actual
parallelism of the system.
• Avoid gratuitous over subscribing
• Exception: IO threads?
• DO NOT communicate unless you have to
30. SignalFx
Techniques inspired by HPC that have
improved our pipeline
Use compact, cache-conscious, array based
data structures with minimal indirection
32. Basic principles
• Strive for smaller data structures
• Extra computation is ok
• E.g. Compressing network data
• Design data structures that facilitate
processing multiple entries—big
arrays!
• Layout should reflect access patterns
33. Hash maps
• Hash maps look ups are NOT free!
• A lookup in a well implemented hash
map is by definition a cache miss
• Popular implementations like
java.util.HashMap can cause multiple
cache misses
41. Array of co-located key/value
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
42. Cache misses with no collision
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
43. Cache misses with collisions
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
2
44. Hash map of key to index to an array of structs
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
45. Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1
46. Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1 2
47. New library memory layout
TimeSeries rollup 0
TimeSeries rollup 1
TimeSeries rollup 2
TimeSeries rollup 3
TimeSeries rollup 4
TimeSeries rollup 5
TimeSeries rollup 6
TimeSeries rollup 7
TimeSeries rollup 8
ID Index
ID 0 1
ID 1 6
ID 2 4
ID 3 8
Raw data in Rollup out
48. Changing hash map implementations
• java.util.HashMap (uses separate chaining and boxes
primitives) to make a long -> int lookup
• Allocations galore
• net.openhft.koloboke primitive open hash map
• 45% improvement
For the JVM use libraries like https://github.com/
OpenHFT/Koloboke.
For C++ try https://github.com/preshing/
CompareIntegerMaps or similar.
49. Access patterns
Hot data
Cold data
object 0
object 1
object 2
object 3
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
50. Group fields accessed together
Hot fields
Cold fields
object 1
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
51. Results of separating hot and cold data
A hot loop run about once every 500 ms
• Old - Hot and cold data kept together
• 5 cache lines per time series
• Took anywhere between 62-70 ms
• New - Hot and cold data kept separate
• 3 cache lines of hot data per time series
• Took anywhere between 40-45 ms
• 35% improvement
53. Old vs New
• Concurrent -> single threaded
• Locks gone
• Array based data structures
• Zero allocations
• Extensive batching and hardware prefetching
• Multiple hash maps -> a single hash map look up
62. 35% of the profile but 3.4x improvement?
• Amdahl’s law
• Max 1.54x improvement if 35% => 0%
• Why 3.4x ?
• When you use less cache, you leave more for
others - thus speeding up other code too
• Lesson
• A profiler is a necessary tool, but not a substitute for
informed design
77. Potential layout after GC
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
Other data
B1 (header)
int
int
Other data
B2 (header
int
int
80. What the control and data planes do
In networking terminology:
• Data plane - Defines the part that
decides what to do with packets
arriving on an inbound interface—
Frequent
• Control plane - Defines the part that is
concerned with drawing the network
map or routing table—Infrequent
81. The goal of control and data plane separation
DO NOT slow the frequent path because
of the infrequent path
82. Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
while (1) {
process_data_using_configuration_variables();
}
Flag 0
Flag 1
Flag 2
Flag 3
83. Flag 0
Flag 1
Flag 2
Flag 3
Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
Flag 0
Flag 1
Flag 2
Flag 3
while (1) {
cache_configuration_variables();
process_a_ton_of_stuff();
}
Cached configuration
variables
84. Volatile/atomic flag vs cached local flag
• All run time flags (used on every data point) are
volatile/atomic loads
• All run time flags are cached and refreshed on each
run loop
• About 8% improvement in datapoint/second. Others
might see more or less