Slides for Brian Bulkowski's talk about Golang performance:
microbenchmarks, profilers, and a war story about optimizing the Aerospike Database Go client.
http://www.meetup.com/Go-lang-Developers-NYC/events/216650022/
8257 interfacing 2 in microprocessor for btech students
Golang Performance : microbenchmarks, profilers, and a war story
1. The fastest NoSQL database!
!
Talking about Go Performance!
!
Try it while I blab !!
github.com/aerospike/aerospike-server!
github.com/aerospike/aerospike-client-go!
2. Who am I ?
Brian Bulkowski!
brian@bulkowski.org!
brian@aerospike.com!
@bbulkow!
TRS-80, PC, Apple II, Vax 11/70, Wang
First product: lightpen university teaching kiosk
Palo Alto High School ( ‘85 )
Liberate / NetComputer through the boom
10B market cap in 1999, employee 32
2003-2007 “time off” ( startups )
Citrusleaf / Aerospike history
42 year old first-time CEO (me)
2008 Prototype
2010 First sales “get the band back together”
2011+ 3 rounds of funding (Draper, ALP, NEA, CNTP)
70 employees, 2 offices
3. Does brian know performance?
Brian Bulkowski!
brian@bulkowski.org!
brian@aerospike.com!
@bbulkow!
Undergrad project: image converter
Single pass arbitrary scale and rotate w/ nyquist filters
Novell
Fastest Appletalk server + router available
Starlight Networks
150Mb/sec video server on P133
Liberate
HTML technology for embedded systems
Aggregate Knowledge
Realtime reccommendations: 2x faster in first week
Aerospike
10x faster than existing NoSQL, 100x faster than RDBMs
4. Internet Technology Stack
MILLIONS OF CONSUMERS
BILLIONS OF DEVICES
APP SERVERS
DATA
WRITE CONTEXT
INSIGHTS WAREHOUSE
In-memory NoSQL
WRITE REAL-TIME CONTEXT
READ RECENT CONTENT
PROFILE STORE
Cookies, email, deviceID, IP address, location,
segments, clicks, likes, tweets, search terms...
REAL-TIME ANALYTICS
Best sellers, top scores, trending tweets
BATCH ANALYTICS
Discover patterns,
segment data: location
patterns, audience
affinity
9. If it is so good, why haven't I heard of it?
Established in 2009 (newer than most)
Used in Advertising – ad exchanges, data exchanges,
targeting, real-time bidding, real-time attribution.
Open Sourced in June 2014
10. When should I use Aerospike?
Redis, but with scale & flash
Cassandra, but fast
User data, session data, behavior, fraud…
API billing ~ retail actions ~ recommendations
Up and running in 10 minutes!
( vagrant, EC2 …)!
11. Why does Aerospike care about Go?
It’s cool !
Promises performance with expressive
( as an old C guy, Go is aimed at me )
Our customers are diving in, deploying
What about (other versions of other languages)…
( sure, they’re cool too! )
Go!
12. Let’s talk about….
Some old microbenchmarks
Profilers, how to run it
War story: optimizing our Go client
( sure, we know Go isn’t JUST about performance )
14. Old Microbenchmark
Seconds (Nov 2009)
1.1 -
python (CPython 2.6.2, the distro release with no tweaks) "
4.6 -
go (current hg release) "
4.2 -
ruby 1.8 (distro release) "
1.1 -
ruby 1.9 (distro release)
Pike said: "
I suspect the great majority of the time in your benchmark is due to Go's current
rudimentary garbage collector. Tests like this generate a lot of garbage that is
collected slowly. From experiments I've done, a better implementation can make a
huge difference. Profiling this test shows at least 50% of the time is in the allocator
and collector, as opposed to about 5% printing the string and less than 15% in the
map code. A better allocator and collector would make a dramatic change. "
"
The short answer: the Go runtime is new and completely untuned. The libraries
need work too.
15. Microbenchmark
“T1”
for i := 0; i < 1000000; i++ {
x = ( 2 * x ) + x + 1
}
1.96 s (big integer only) Python
1.04 ms (2.17s big.Int) Go
5 ms (2.15s BigNum) Java
Good news: go is right in the hunt, but easier to code
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
16. Microbenchmarks
T5 – the 2009 benchmark
12.5 sec Python
12.56 sec Go
2.56 sec Java
Good news: not slower than python!
Bad news: Holy Crap compared to Java
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
17. Microbenchmarks – the old code
T5 – the 2009 benchmark (slower CPU)
for x := 0; x < 1000000; x++ {
a := make(map[int] string);
for a1 := 0; a1 < 50; a1++ {
a[a1] = strconv.Itoa(a1);
}
}
12.56 seconds
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
18. Microbenchmarks – tune the map
T5 – the 2009 benchmark
for x := 0; x < 1000000; x++ {
a := make(map[int] string, 50);
for a1 := 0; a1 < 50; a1++ {
a[a1] = strconv.Itoa(a1);
}
}
7.80 seconds
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
19. Microbenchmarks – remove the Itoa
T5 – the 2009 benchmark
for x := 0; x < 1000000; x++ {
a := make(map[int] string, 50);
for a1 := 0; a1 < 50; a1++ {
a[a1] = "123456”;
}
}
5.45 seconds
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
20. Microbenchmarks – singleton Map
T5 – the 2009 benchmark
a := make(map[int] string, 50);
for x := 0; x < 1000000; x++ {
// a := make(map[int] string, 50);
for a1 := 0; a1 < 50; a1++ {
a[a1] = "123456”;
}
}
2.03 seconds ! Finally better than Java !
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
21. Microbenchmarks – Java
T5 – the 2009 benchmark
for (int x=0; x < 1000000; x++) {
HashMap<Integer, String> a = new HashMap<Integer, String>();
for (int a1=0; a1 < 50; a1++) {
a.put(a1, Integer.toString(a1) );
}
}
2.56 seconds
Amazon m3.xlarge (4 core E3@2.5Ghz)"
Python 2.6.9"
Go 1.3.3"
Java 1.7.0_71"
Amazon Linux (3.16)
23. Next microbenchmarks !
Float, String
Go Channels vs Java Futures
… couldn’t code the java part in time!
Simple TCP echo, but with transactions
Log processing
Ruby 2.1, Go 1.4…
Your votes ?
24. Profilers
pprof is pretty great!
Import in all your main’s, does not seem to hurt
import _ "net/http/pprof”
Add the HTTP listener ( only on flag )
// launch http pprof listener if in profile mode
if *profileMode {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
}
28. Profilers
Good old ‘oprofile’, let’s not forget it –---
( especially if you can get kernel symbols, hard )
sudo yum -y install oprofile
Start capturing
sudo opcontrol --reset
sudo opcontrol --no-vmlinux
sudo opcontrol –start
Run your program
sudo opcontrol --dump
sudo opcontrol --shutdown
Dump your result
sudo opreport -l --demangle=smart --debug-info
Cheat Sheet http://www.bonsai.com/wiki/howtos/tuning/oprofile/
30. Tuning the Aerospike Client
What does the client do?!
!
Maintain the DHT state!
!
Keep a connection pool!
!
Make requests to the right servers!
!
Box / unbox to wire protocol…!
SIMPLE
31. Tuning the Aerospike Client
Attempt 1: run pprof!
!
The usual dance of making life!
easy for the garbage collector !
(just like java)!
!
pprof worked!!
the hot objects showed up!
!
Cache easily with Sized Channels !!!!
32. Tuning the Aerospike Client
Attempt 2: oprofile!
!
oprofile found rand() taking time!
!
Optimization gave nothing!
!
… not sure why not …!
!
Currently happy with throughput!
33. Tuning the Aerospike Client
Latency problem at customer site !!
!
User validating a server install with a quick Go client!
“17 ms average latency @ 20K TPS” --- terrible!!
!
Server measured at 0.4 ms @ 40k TPS, !
-- ping ok!
-- it’s the client!
!
Where’s the latency source? GC? Green Threads? Network?!
-- Profile shows low GC load!
-- Hard to measure thread latency!
EC2 m3.xlarge ($0.05/hr)!
4 core E5-2670 @ 2.5 Ghz!
Bare metal vs Virtual!
Centos 6 vs Latest Kernel!
Intel SSDs vs RAM!
35. What happened?
• Not sure what happened at deployment !
(yet, suspect old kernel)!
• A week lost by developers using MacOS, Laptop!
(MacOS is showing bad latency)!
• C code is running slower – we think it’s random fill of buffer!
• Lesson: just switch to Linux 3.12-ish kernels!
• Lesson: fewer lines ~ 11k Go, 17k Java!
• Lesson: for network / IO, these languages are THE SAME !