SlideShare une entreprise Scribd logo
1  sur  92
Télécharger pour lire hors ligne
Java Mixed-Mode
Flame Graphs
Brendan Gregg Senior Performance Architect
Oct	
  2015	
  
Understanding Java CPU usage
quickly and completely
Quickly
•  Via SSH and open source tools (covered in this talk)
•  Or using Netflix Vector GUI (also open source):
1.  Observe high CPU usage
2.  Generate a flame graph
Java Mixed-Mode Flame Graph
via Linux perf_events:
Completely
Java JVM
Kernel
GC
Messy House Fallacy
•  Don't overlook system code: kernel, libraries, etc.
Fallacy:	
  my	
  code	
  is	
  a	
  mess,	
  I	
  bet	
  yours	
  is	
  
immaculate,	
  therefore	
  the	
  bug	
  must	
  be	
  mine	
  
	
  
Reality:	
  everyone's	
  code	
  is	
  terrible	
  and	
  buggy	
  
Context
•  Over 60 million subscribers
–  Just launched in Spain!
•  AWS EC2 Linux cloud
•  FreeBSD CDN
•  Awesome place to work
Cloud
•  Tens of thousands of AWS EC2 instances
•  Mostly running Java applications (Oracle JVM)
Linux	
  (usually	
  Ubuntu)	
  
Java	
  (JDK	
  8)	
  
Tomcat	
  GC	
  and	
  
thread	
  
dump	
  
logging	
  
hystrix,	
  metrics	
  (Servo),	
  
health	
  check	
  
OpMonal	
  Apache,	
  
memcached,	
  Node.js,	
  
…	
  
Atlas,	
  S3	
  log	
  rotaMon,	
  
sar,	
  Trace,	
  perf,	
  stap,	
  
perf-­‐tools	
  
Vector,	
  pcp	
  
ApplicaMon	
  war	
  files,	
  
plaYorm,	
  base	
  servlet	
  
Why we need CPU profiling
•  Improving performance
–  Identify tuning targets
–  Incident response
–  Non-regression testing
–  Software evaluations
–  CPU workload
characterization
•  Cost savings
–  ASGs often scale on load
average (CPU), so CPU
usage is proportional to cost
Instance	
  
Instance	
  
Instance	
  
Scaling	
  Policy	
  
loadavg,	
  latency,	
  …	
  
	
  
CloudWatch,	
  Servo	
  
Auto	
  Scaling	
  
Group	
  
The Problem with Profilers
Java Profilers
Java
GC
Kernel,
libraries,
JVM
Java Profilers
•  Visibility
–  Java method execution
–  Object usage
–  GC logs
–  Custom Java context
•  Typical problems:
–  Sampling often happens at safety/yield points (skew)
–  Method tracing has massive observer effect
–  Misidentifies RUNNING as on-CPU (e.g., epoll)
–  Doesn't include or profile GC or JVM CPU time
–  Tree views not quick (proportional) to comprehend
•  Inaccurate (skewed) and incomplete profiles
System Profilers
Java Kernel
TCP/IP
GC
Idle
thread
Time
Locks epoll
JVM
System Profilers
•  Visibility
–  JVM (C++)
–  GC (C++)
–  libraries (C)
–  kernel (C)
•  Typical problems (x86):
–  Stacks missing for Java
–  Symbols missing for Java methods
•  Other architectures (e.g., SPARC) have fared better
•  Profile everything except Java
Workaround
•  Capture both Java and system profiles, and examine
side by side
•  An improvement, but Java context is often crucial for
interpreting system profiles
Java System
Java Mixed-Mode Flame Graph
Solution
Java JVM
Kernel
GC
Solution
•  Fix system profiling
–  Only way to see it all
•  Visibility is everything:
–  Java methods
–  JVM (C++)
–  GC (C++)
–  libraries (C)
–  kernel (C)
•  Minor Problems:
–  0-3% CPU overhead to enable frame pointers (usually <1%).
–  Symbol dumps can consume a burst of CPU
•  Complete and accurate (asynchronous) profiling
Java
JVM
Kernel
GC
Simple Production Example
1.  Poor performance,
and one CPU at 100%
2.  perf_events flame
graph shows JVM
stuck compiling
Another System Example
Exception handling consuming CPU
DEMO
FlameGraph_tomcat01.svg
Exonerating The System
•  From last week:
-  Frequent thread creation/
destruction assumed to be
consuming CPU resources.
Recode application?
-  A flame graph quantified this
CPU time: near zero
-  Time mostly other Java methods
Profiling GC
GC internals, visualized:
CPU Profiling
CPU Profiling
A
B
block interrupt
on-CPU off-CPU
A
B
A A
B
A
syscall
time
•  Record stacks at a timed interval: simple and effective
–  Pros: Low (deterministic) overhead
–  Cons: Coarse accuracy, but usually sufficient
stack
samples: A
Stack Traces
•  A code path snapshot. e.g., from jstack(1):
$ jstack 1819
[…]
"main" prio=10 tid=0x00007ff304009000
nid=0x7361 runnable [0x00007ff30d4f9000]
java.lang.Thread.State: RUNNABLE
at Func_abc.func_c(Func_abc.java:6)
at Func_abc.func_b(Func_abc.java:16)
at Func_abc.func_a(Func_abc.java:23)
at Func_abc.main(Func_abc.java:27)
running
parent
g.parent
g.g.paren
running
codepath
start
System Profilers
•  Linux
–  perf_events (aka "perf")
•  Oracle Solaris
–  DTrace
•  OS X
–  Instruments
•  Windows
–  XPerf
•  And many others…
Linux perf_events
•  Standard Linux profiler
–  Provides the perf command (multi-tool)
–  Usually pkg added by linux-tools-common, etc.
•  Features:
–  Timer-based sampling
–  Hardware events
–  Tracepoints
–  Dynamic tracing
•  Can sample stacks of (almost) everything on CPU
–  Can miss hard interrupt ISRs, but these should be near-zero. They can
be measured if needed (I wrote my own tools)
perf record Profiling
•  Stack profiling on all CPUs at 99 Hertz, then dump:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf script
[…]
bash 13204 cpu-clock:
459c4c dequote_string (/root/bash-4.3/bash)
465c80 glob_expand_word_list (/root/bash-4.3/bash)
466569 expand_word_list_internal (/root/bash-4.3/bash)
465a13 expand_words (/root/bash-4.3/bash)
43bbf7 execute_simple_command (/root/bash-4.3/bash)
435f16 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
43a771 execute_while_or_until (/root/bash-4.3/bash)
43a636 execute_while_command (/root/bash-4.3/bash)
436129 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
420cd5 reader_loop (/root/bash-4.3/bash)
41ea58 main (/root/bash-4.3/bash)
7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[… ~47,000 lines truncated …]
one
stack
sample
perf report Summary
•  Generates a call tree and combines samples:
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
call tree
summary
Flame Graphs
perf report Verbosity
•  Despite summarizing, output is still verbose
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
Full perf report Output
… as a Flame Graph
Flame Graphs
•  Flame Graphs:
–  x-axis: alphabetical stack sort, to maximize merging
–  y-axis: stack depth
–  color: random (default), or a dimension
•  Currently made from Perl + SVG + JavaScript
–  Multiple d3 versions are being developed
•  Easy to get working
–  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
–  Above commands are Linux; see URL for other OSes
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
Linux perf_events Workflow
perf stat perf record
perf report perf script
count events capture stacks
text UI dump profile
stackcollapse-perf.pl
flamegraph.pl
perf.data	
  
flame graph
visualization
perf list
list events
Typical
Workflow
Flame Graph Interpretation
a()
b() h()
c()
d()
e() f()
g()
i()
Flame Graph Interpretation (1/3)
Top edge shows who is running on-CPU,
and how much (width)
a()
b() h()
c()
d()
e() f()
g()
i()
Flame Graph Interpretation (2/3)
Top-down shows ancestry
e.g., from g():
h()
d()
e()
i()
a()
b()
c()
f()
g()
Flame Graph Interpretation (3/3)
a()
b() h()
c()
d()
e() f()
g()
i()
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)
Flame Graph Colors
•  Randomized by default
•  Can be used as a dimension. e.g.:
–  Mixed-mode flame graphs
–  Differential flame graphs
–  Search
Mixed-Mode Flame Graphs
•  Hues:
–  green == Java
–  red == system
–  yellow == C++
•  Intensity randomized
to differentiate frames
–  Or hashed based on
function name
Java JVM
Kernel
Mixed-Mode
Differential Flame Graphs
•  Hues:
–  red == more samples
–  blue == less samples
•  Intensity shows the
degree of difference
•  Used for comparing
two profiles
•  Also used for showing
other metrics: e.g., CPI
Differential
more less
Flame Graph Search
•  Color: magenta to show matched frames
search
button
Flame Charts
•  Flame charts: x-axis is time
•  Flame graphs: x-axis is population (maximize merging)
•  Final note: these are useful, but are not flame graphs
Stack Tracing
System Profiling Java on x86
•  For example,
using Linux perf
•  The stacks are
1 or 2 levels
deep, and have
junk values
# perf record –F 99 –a –g – sleep 30
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
… as a Flame Graph
Broken Java stacks
(missing frame pointer)
Why Stacks are Broken
•  On x86 (x86_64), hotspot uses
the frame pointer register (RBP)
as general purpose
•  This "compiler optimization"
breaks (simple) stack walking
•  Once upon a time, x86 had fewer
registers, and this made much more sense
•  gcc provides -fno-omit-frame-pointer to avoid
doing this, but the JVM had no such option…
Fixing Stack Walking
Possibilities:
A.  Fix frame pointer-based stack walking (the default)
–  Pros: simple, supported by many tools
–  Cons: might cost a little extra CPU
B.  Use a custom walker (likely needing kernel support)
–  Pros: full stack walking (incl. inlining) & arguments
–  Cons: custom kernel code, can cost more CPU when in use
C.  Try libunwind and DWARF
–  Even feasible with JIT?
Our current preference is (A)
Hacking OpenJDK (1/2)
•  As a proof of concept, I hacked hotspot to support an
x86_64 frame pointer
--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 …
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 …
@@ -166,10 +166,9 @@
// 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
//
-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
reg_class any_reg(RAX, RAX_H,
RDX, RDX_H,
- RBP, RBP_H,
RDI, RDI_H,
RSI, RSI_H,
RCX, RCX_H,
[...]
Remove RBP from
register pools
Hacking OpenJDK (2/2)
•  We used this patched version successfully for some limited
(and urgent) performance analysis
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
Fix x86_64 function
prologues
-XX:+PreserveFramePointer
•  We shared our patch publicly
–  See "A hotspot patch for stack profiling (frame pointer)" on the
hotspot complier dev mailing list
–  It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8,
and the -XX:+PreserveFramePointer option
•  Zoltán Majó (Oracle) took this on, rewrote it, and it is now:
–  In JDK 9
–  In JDK 8 update 60 build 19
–  Thanks to Zoltán, Oracle, and the other hotspot engineers for
helping get this done!
•  It might cost 0 – 3% CPU, depending on workload
Broken Java Stacks (before)
•  Check with "perf
script" to see stack
samples
•  These are 1 or 2
levels deep (junk
values)
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f417908c194 [unknown] (/tmp/perf-4458.map)
[…]
Fixed Java Stacks
•  With -XX:
+PreserveFramePointer
stacks are full, and
go all the way to
start_thread()
•  This is what the
CPUs are really
running: inlined
frames are not
present
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map)
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map)
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map)
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
Fixed Stacks Flame Graph
Java stacks
(but no symbols)
Stacks & Inlining
•  Frames may be missing (inlined)
•  Disabling inlining:
–  -XX:-Inline
–  Many more Java frames
–  Can be 80% slower!
•  May not be necessary
–  Inlined flame graphs often make
enough sense
–  Or tune -XX:MaxInlineSize and
-XX:InlineSmallCode a little to reveal more frames
•  Can even improve performance!
•  perf-map-agent (next) has experimental un-inline support
No inlining
Symbols
Missing Symbols
12.06% 62 sed sed [.] re_search_internal
|
--- re_search_internal
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
broken
not broken
•  Missing symbols may show up as hex; e.g., Linux perf:
Fixing Symbols
•  For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map, and
warns if it doesn't exist
•  This file can be created by a Java agent
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
Java Symbols for perf
•  perf-map-agent
–  https://github.com/jrudolph/perf-map-agent
–  Agent attaches and writes the /tmp file on demand (previous
versions attached on Java start, wrote continually)
–  Thanks Johannes Rudolph!
•  Use of a /tmp symbol file
–  Pros: simple, can be low overhead (snapshot on demand)
–  Cons: stale symbols
•  Using a symbol logger with perf instead
–  Patch by Stephane Eranian currently being discussed on
lkml; see "perf: add support for profiling jitted code"
Java Mixed-Mode Flame Graph
Stacks & Symbols
Java JVM
Kernel
GC
Stacks & Symbols (zoom)
Instructions
Instructions
1.  Check Java version
2.  Install perf-map-agent
3.  Set -XX:+PreserveFramePointer
4.  Profile Java
5.  Dump symbols
6.  Generate Mixed-Mode Flame Graph
Note these are unsupported: use at your own risk
Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
1. Check Java Version
•  Need JDK8u60 or better
–  for -XX:+PreserveFramePointer
•  Upgrade Java if necessary
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
2. Install perf-map-agent
•  Check https://github.com/jrudolph/perf-map-agent for the
latest instructions. e.g.:
$ sudo bash
# apt-get install -y cmake
# export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# cd /usr/lib/jvm
# git clone --depth=1 https://github.com/jrudolph/perf-map-agent
# cd perf-map-agent
# cmake .
# make
3. Set -XX:+PreserveFramePointer
•  Needs to be set on Java startup
•  Check it is enabled (on Linux):
$ ps wwp `pgrep –n java` | grep PreserveFramePointer
4. Profile Java
•  Using Linux perf_events to profile all processes, at 99
Hertz, for 30 seconds (as root):
•  Just profile one PID (broken on some older kernels):
•  These create a perf.data file
# perf record -F 99 -a -g -- sleep 30
# perf record -F 99 -p PID -g -- sleep 30
5. Dump Symbols
•  See perf-map-agent docs for updated usage
•  e.g., as the same user as java:
•  perf-map-agent contains helper scripts. I wrote my own:
–  https://github.com/brendangregg/Misc/blob/master/java/jmaps
•  Dump symbols quickly after perf record to minimize stale
symbols. How I do it:
$ cd /usr/lib/jvm/perf-map-agent/out
$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar 
net.virtualvoid.perf.AttachOnce PID
# perf record -F 99 -a -g -- sleep 30; jmaps
6. Generate a Mixed-Mode Flame Graph
•  Using my FlameGraph software:
–  perf script reads perf.data with /tmp/*.map
–  out.stacks01 is an intermediate file; can be handy to keep
•  Finally open flame01.svg in a browser
•  Check for newer flame graph implementations (e.g., d3)
# perf script > out.stacks01
# git clone --depth=1 https://github.com/brendangregg/FlameGraph
# cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | 
./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
Automation
Netflix Vector
Netflix Vector
Near real-time,
per-second metrics
Flame Graphs
Select
Metrics
Select Instance
Netflix Vector
•  Open source, on-demand, instance analysis tool
–  https://github.com/netflix/vector
•  Shows various real-time metrics
•  Flame graph support currently in development
–  Automating previous steps
–  Using it internally already
–  Also developing a new d3 front end
DEMO
d3-flame-graph
Advanced Analysis
Linux perf_events Coverage
… all possible with Java stacks
Advanced Flame Graphs
•  Examples:
–  Page faults
–  Context switches
–  Disk I/O requests
–  TCP events
–  CPU cache misses
–  CPI
•  Any event issued in synchronous Java context
Synchronous Java Context
•  Java thread still on-CPU, and event is directly triggered
•  Examples:
–  Disk I/O requests issued directly by Java à yes
•  direct reads, sync writes, page faults
–  Disk I/O completion interrupts à no*
–  Disk I/O requests triggered async, e.g., readahead à no*
* can be made yes by tracing and associating context
Page Faults
•  Show what triggered main memory (resident) to grow:
•  "fault" as (physical) main memory is allocated on-
demand, when a virtual page is first populated
•  Low overhead tool to solve some types of memory leak
# perf record -e page-faults -p PID -g -- sleep 120
RES column in top(1) grows
because
Page Fault Flame Graph
GC
Java code
epoll
Context Switches
•  Show why Java blocked and stopped running on-CPU:
•  Identifies locks, I/O, sleeps
–  If code path shouldn't block and looks random, it's an involuntary context switch. I
could filter these, but you should have solved them beforehand (CPU load).
•  e.g., was used to understand framework differences:
# perf record -e context-switches -p PID -g -- sleep 5
vs
rxNetty Tomcat
Context Switch Flame Graph (1/2)
rxNetty
epoll futex
Context Switch Flame Graph (2/2)
Tomcat sys_poll
futex
Disk I/O Requests
•  Shows who issued disk I/O (sync reads & writes):
•  e.g.: page faults in GC? This JVM has swapped out!:
# perf record -e block:block_rq_insert -a -g -- sleep 60
GC
TCP Events
•  TCP transmit, using dynamic tracing:
•  Note: can be high overhead for high packet rates
–  For the current perf trace, dump, post-process cycle
•  Can also trace TCP connect & accept (lower overhead)
•  TCP receive is async
–  Could trace via socket read
# perf probe tcp_sendmsg
# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks
# perf probe --del tcp_sendmsg
TCP Send Flame Graph
kernel
Java
JVM
Only one code-path
taken in this example
ab (client process)
CPU Cache Misses
•  In this example, sampling via Last Level Cache loads:
•  -c is the count (samples
once per count)
•  Use other CPU counters to
sample hits, misses, stalls
# perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
One Last Example
•  Back to a
mixed-mode
CPU flame graph
•  What else can we
show with color?
CPI Flame Graph
•  Cycles Per
Instruction!
–  red == instruction
heavy
–  blue == cycle
heavy (likely mem
stall cycles)
zoomed:
Links & References
•  Flame Graphs
–  http://www.brendangregg.com/flamegraphs.html
–  http://techblog.netflix.com/2015/07/java-in-flames.html
–  http://techblog.netflix.com/2014/11/nodejs-in-flames.html
–  http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
•  Linux perf_events
–  https://perf.wiki.kernel.org/index.php/Main_Page
–  http://www.brendangregg.com/perf.html
–  http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html
•  Netflix Vector
–  https://github.com/netflix/vector
–  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
•  JDK tickets
–  JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465
–  JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945
•  hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
Thanks
•  Questions?
•  http://techblog.netflix.com
•  http://slideshare.net/brendangregg
•  http://www.brendangregg.com
•  bgregg@netflix.com
•  @brendangregg
Oct	
  2015	
  

Contenu connexe

Tendances

Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Jagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchlinuxlab_conf
 
DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit) DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit) ymtech
 
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichDevOpsDays Tel Aviv
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device driversHoucheng Lin
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewRajKumar Rampelli
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...Adrian Huang
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Linaro
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 

Tendances (20)

Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Jagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratch
 
DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit) DPDK (Data Plane Development Kit)
DPDK (Data Plane Development Kit)
 
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
Embedded linux network device driver development
Embedded linux network device driver developmentEmbedded linux network device driver development
Embedded linux network device driver development
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道 淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver Overview
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Introduction to Perf
Introduction to PerfIntroduction to Perf
Introduction to Perf
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 

Similaire à Java Mixed-Mode Flame Graphs Reveal CPU Usage

USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsBrendan Gregg
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomValeriy Kravchuk
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in RustInfluxData
 
Systemtap
SystemtapSystemtap
SystemtapFeng Yu
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging BasicsBala Subra
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformWangda Tan
 
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggKernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggAnne Nicolas
 
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...HostedbyConfluent
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays
 
Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems PerformanceBrendan Gregg
 

Similaire à Java Mixed-Mode Flame Graphs Reveal CPU Usage (20)

USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
Systemtap
SystemtapSystemtap
Systemtap
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging Basics
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggKernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
 
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 

Plus de Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 

Plus de Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 

Dernier

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Dernier (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Java Mixed-Mode Flame Graphs Reveal CPU Usage

  • 1. Java Mixed-Mode Flame Graphs Brendan Gregg Senior Performance Architect Oct  2015  
  • 2. Understanding Java CPU usage quickly and completely
  • 3. Quickly •  Via SSH and open source tools (covered in this talk) •  Or using Netflix Vector GUI (also open source): 1.  Observe high CPU usage 2.  Generate a flame graph
  • 4. Java Mixed-Mode Flame Graph via Linux perf_events: Completely Java JVM Kernel GC
  • 5. Messy House Fallacy •  Don't overlook system code: kernel, libraries, etc. Fallacy:  my  code  is  a  mess,  I  bet  yours  is   immaculate,  therefore  the  bug  must  be  mine     Reality:  everyone's  code  is  terrible  and  buggy  
  • 7. •  Over 60 million subscribers –  Just launched in Spain! •  AWS EC2 Linux cloud •  FreeBSD CDN •  Awesome place to work
  • 8. Cloud •  Tens of thousands of AWS EC2 instances •  Mostly running Java applications (Oracle JVM) Linux  (usually  Ubuntu)   Java  (JDK  8)   Tomcat  GC  and   thread   dump   logging   hystrix,  metrics  (Servo),   health  check   OpMonal  Apache,   memcached,  Node.js,   …   Atlas,  S3  log  rotaMon,   sar,  Trace,  perf,  stap,   perf-­‐tools   Vector,  pcp   ApplicaMon  war  files,   plaYorm,  base  servlet  
  • 9. Why we need CPU profiling •  Improving performance –  Identify tuning targets –  Incident response –  Non-regression testing –  Software evaluations –  CPU workload characterization •  Cost savings –  ASGs often scale on load average (CPU), so CPU usage is proportional to cost Instance   Instance   Instance   Scaling  Policy   loadavg,  latency,  …     CloudWatch,  Servo   Auto  Scaling   Group  
  • 10. The Problem with Profilers
  • 12. Java Profilers •  Visibility –  Java method execution –  Object usage –  GC logs –  Custom Java context •  Typical problems: –  Sampling often happens at safety/yield points (skew) –  Method tracing has massive observer effect –  Misidentifies RUNNING as on-CPU (e.g., epoll) –  Doesn't include or profile GC or JVM CPU time –  Tree views not quick (proportional) to comprehend •  Inaccurate (skewed) and incomplete profiles
  • 14. System Profilers •  Visibility –  JVM (C++) –  GC (C++) –  libraries (C) –  kernel (C) •  Typical problems (x86): –  Stacks missing for Java –  Symbols missing for Java methods •  Other architectures (e.g., SPARC) have fared better •  Profile everything except Java
  • 15. Workaround •  Capture both Java and system profiles, and examine side by side •  An improvement, but Java context is often crucial for interpreting system profiles Java System
  • 16. Java Mixed-Mode Flame Graph Solution Java JVM Kernel GC
  • 17. Solution •  Fix system profiling –  Only way to see it all •  Visibility is everything: –  Java methods –  JVM (C++) –  GC (C++) –  libraries (C) –  kernel (C) •  Minor Problems: –  0-3% CPU overhead to enable frame pointers (usually <1%). –  Symbol dumps can consume a burst of CPU •  Complete and accurate (asynchronous) profiling Java JVM Kernel GC
  • 18. Simple Production Example 1.  Poor performance, and one CPU at 100% 2.  perf_events flame graph shows JVM stuck compiling
  • 19. Another System Example Exception handling consuming CPU
  • 21. Exonerating The System •  From last week: -  Frequent thread creation/ destruction assumed to be consuming CPU resources. Recode application? -  A flame graph quantified this CPU time: near zero -  Time mostly other Java methods
  • 24. CPU Profiling A B block interrupt on-CPU off-CPU A B A A B A syscall time •  Record stacks at a timed interval: simple and effective –  Pros: Low (deterministic) overhead –  Cons: Coarse accuracy, but usually sufficient stack samples: A
  • 25. Stack Traces •  A code path snapshot. e.g., from jstack(1): $ jstack 1819 […] "main" prio=10 tid=0x00007ff304009000 nid=0x7361 runnable [0x00007ff30d4f9000] java.lang.Thread.State: RUNNABLE at Func_abc.func_c(Func_abc.java:6) at Func_abc.func_b(Func_abc.java:16) at Func_abc.func_a(Func_abc.java:23) at Func_abc.main(Func_abc.java:27) running parent g.parent g.g.paren running codepath start
  • 26. System Profilers •  Linux –  perf_events (aka "perf") •  Oracle Solaris –  DTrace •  OS X –  Instruments •  Windows –  XPerf •  And many others…
  • 27. Linux perf_events •  Standard Linux profiler –  Provides the perf command (multi-tool) –  Usually pkg added by linux-tools-common, etc. •  Features: –  Timer-based sampling –  Hardware events –  Tracepoints –  Dynamic tracing •  Can sample stacks of (almost) everything on CPU –  Can miss hard interrupt ISRs, but these should be near-zero. They can be measured if needed (I wrote my own tools)
  • 28. perf record Profiling •  Stack profiling on all CPUs at 99 Hertz, then dump: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf script […] bash 13204 cpu-clock: 459c4c dequote_string (/root/bash-4.3/bash) 465c80 glob_expand_word_list (/root/bash-4.3/bash) 466569 expand_word_list_internal (/root/bash-4.3/bash) 465a13 expand_words (/root/bash-4.3/bash) 43bbf7 execute_simple_command (/root/bash-4.3/bash) 435f16 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 43a771 execute_while_or_until (/root/bash-4.3/bash) 43a636 execute_while_command (/root/bash-4.3/bash) 436129 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 420cd5 reader_loop (/root/bash-4.3/bash) 41ea58 main (/root/bash-4.3/bash) 7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so) [… ~47,000 lines truncated …] one stack sample
  • 29. perf report Summary •  Generates a call tree and combines samples: # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …] call tree summary
  • 31. perf report Verbosity •  Despite summarizing, output is still verbose # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …]
  • 33. … as a Flame Graph
  • 34. Flame Graphs •  Flame Graphs: –  x-axis: alphabetical stack sort, to maximize merging –  y-axis: stack depth –  color: random (default), or a dimension •  Currently made from Perl + SVG + JavaScript –  Multiple d3 versions are being developed •  Easy to get working –  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html –  Above commands are Linux; see URL for other OSes git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
  • 35. Linux perf_events Workflow perf stat perf record perf report perf script count events capture stacks text UI dump profile stackcollapse-perf.pl flamegraph.pl perf.data   flame graph visualization perf list list events Typical Workflow
  • 36. Flame Graph Interpretation a() b() h() c() d() e() f() g() i()
  • 37. Flame Graph Interpretation (1/3) Top edge shows who is running on-CPU, and how much (width) a() b() h() c() d() e() f() g() i()
  • 38. Flame Graph Interpretation (2/3) Top-down shows ancestry e.g., from g(): h() d() e() i() a() b() c() f() g()
  • 39. Flame Graph Interpretation (3/3) a() b() h() c() d() e() f() g() i() Widths are proportional to presence in samples e.g., comparing b() to h() (incl. children)
  • 40. Flame Graph Colors •  Randomized by default •  Can be used as a dimension. e.g.: –  Mixed-mode flame graphs –  Differential flame graphs –  Search
  • 41. Mixed-Mode Flame Graphs •  Hues: –  green == Java –  red == system –  yellow == C++ •  Intensity randomized to differentiate frames –  Or hashed based on function name Java JVM Kernel Mixed-Mode
  • 42. Differential Flame Graphs •  Hues: –  red == more samples –  blue == less samples •  Intensity shows the degree of difference •  Used for comparing two profiles •  Also used for showing other metrics: e.g., CPI Differential more less
  • 43. Flame Graph Search •  Color: magenta to show matched frames search button
  • 44. Flame Charts •  Flame charts: x-axis is time •  Flame graphs: x-axis is population (maximize merging) •  Final note: these are useful, but are not flame graphs
  • 46. System Profiling Java on x86 •  For example, using Linux perf •  The stacks are 1 or 2 levels deep, and have junk values # perf record –F 99 –a –g – sleep 30 # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) […]
  • 47. … as a Flame Graph Broken Java stacks (missing frame pointer)
  • 48. Why Stacks are Broken •  On x86 (x86_64), hotspot uses the frame pointer register (RBP) as general purpose •  This "compiler optimization" breaks (simple) stack walking •  Once upon a time, x86 had fewer registers, and this made much more sense •  gcc provides -fno-omit-frame-pointer to avoid doing this, but the JVM had no such option…
  • 49. Fixing Stack Walking Possibilities: A.  Fix frame pointer-based stack walking (the default) –  Pros: simple, supported by many tools –  Cons: might cost a little extra CPU B.  Use a custom walker (likely needing kernel support) –  Pros: full stack walking (incl. inlining) & arguments –  Cons: custom kernel code, can cost more CPU when in use C.  Try libunwind and DWARF –  Even feasible with JIT? Our current preference is (A)
  • 50. Hacking OpenJDK (1/2) •  As a proof of concept, I hacked hotspot to support an x86_64 frame pointer --- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 … +++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 … @@ -166,10 +166,9 @@ // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ ) // -// Class for all pointer registers (including RSP) +// Class for all pointer registers (including RSP, excluding RBP) reg_class any_reg(RAX, RAX_H, RDX, RDX_H, - RBP, RBP_H, RDI, RDI_H, RSI, RSI_H, RCX, RCX_H, [...] Remove RBP from register pools
  • 51. Hacking OpenJDK (2/2) •  We used this patched version successfully for some limited (and urgent) performance analysis --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04… +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 … @@ -5236,6 +5236,7 @@ // We always push rbp, so that on return to interpreter rbp, will be // restored correctly and we can correct the stack. push(rbp); + mov(rbp, rsp); // Remove word for ebp framesize -= wordSize; --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … [...] Fix x86_64 function prologues
  • 52. -XX:+PreserveFramePointer •  We shared our patch publicly –  See "A hotspot patch for stack profiling (frame pointer)" on the hotspot complier dev mailing list –  It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8, and the -XX:+PreserveFramePointer option •  Zoltán Majó (Oracle) took this on, rewrote it, and it is now: –  In JDK 9 –  In JDK 8 update 60 build 19 –  Thanks to Zoltán, Oracle, and the other hotspot engineers for helping get this done! •  It might cost 0 – 3% CPU, depending on workload
  • 53. Broken Java Stacks (before) •  Check with "perf script" to see stack samples •  These are 1 or 2 levels deep (junk values) # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f417908c194 [unknown] (/tmp/perf-4458.map) […]
  • 54. Fixed Java Stacks •  With -XX: +PreserveFramePointer stacks are full, and go all the way to start_thread() •  This is what the CPUs are really running: inlined frames are not present # perf script […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
  • 55. Fixed Stacks Flame Graph Java stacks (but no symbols)
  • 56. Stacks & Inlining •  Frames may be missing (inlined) •  Disabling inlining: –  -XX:-Inline –  Many more Java frames –  Can be 80% slower! •  May not be necessary –  Inlined flame graphs often make enough sense –  Or tune -XX:MaxInlineSize and -XX:InlineSmallCode a little to reveal more frames •  Can even improve performance! •  perf-map-agent (next) has experimental un-inline support No inlining
  • 58. Missing Symbols 12.06% 62 sed sed [.] re_search_internal | --- re_search_internal | |--96.78%-- re_search_stub | rpl_re_search | match_regex | do_subst | execute_program | process_files | main | __libc_start_main 71.79% 334 sed sed [.] 0x000000000001afc1 | |--11.65%-- 0x40a447 | 0x40659a | 0x408dd8 | 0x408ed1 | 0x402689 | 0x7fa1cd08aec5 broken not broken •  Missing symbols may show up as hex; e.g., Linux perf:
  • 59. Fixing Symbols •  For JIT'd code, Linux perf already looks for an externally provided symbol file: /tmp/perf-PID.map, and warns if it doesn't exist •  This file can be created by a Java agent # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […]
  • 60. Java Symbols for perf •  perf-map-agent –  https://github.com/jrudolph/perf-map-agent –  Agent attaches and writes the /tmp file on demand (previous versions attached on Java start, wrote continually) –  Thanks Johannes Rudolph! •  Use of a /tmp symbol file –  Pros: simple, can be low overhead (snapshot on demand) –  Cons: stale symbols •  Using a symbol logger with perf instead –  Patch by Stephane Eranian currently being discussed on lkml; see "perf: add support for profiling jitted code"
  • 61. Java Mixed-Mode Flame Graph Stacks & Symbols Java JVM Kernel GC
  • 64. Instructions 1.  Check Java version 2.  Install perf-map-agent 3.  Set -XX:+PreserveFramePointer 4.  Profile Java 5.  Dump symbols 6.  Generate Mixed-Mode Flame Graph Note these are unsupported: use at your own risk Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
  • 65. 1. Check Java Version •  Need JDK8u60 or better –  for -XX:+PreserveFramePointer •  Upgrade Java if necessary $ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
  • 66. 2. Install perf-map-agent •  Check https://github.com/jrudolph/perf-map-agent for the latest instructions. e.g.: $ sudo bash # apt-get install -y cmake # export JAVA_HOME=/usr/lib/jvm/java-8-oracle # cd /usr/lib/jvm # git clone --depth=1 https://github.com/jrudolph/perf-map-agent # cd perf-map-agent # cmake . # make
  • 67. 3. Set -XX:+PreserveFramePointer •  Needs to be set on Java startup •  Check it is enabled (on Linux): $ ps wwp `pgrep –n java` | grep PreserveFramePointer
  • 68. 4. Profile Java •  Using Linux perf_events to profile all processes, at 99 Hertz, for 30 seconds (as root): •  Just profile one PID (broken on some older kernels): •  These create a perf.data file # perf record -F 99 -a -g -- sleep 30 # perf record -F 99 -p PID -g -- sleep 30
  • 69. 5. Dump Symbols •  See perf-map-agent docs for updated usage •  e.g., as the same user as java: •  perf-map-agent contains helper scripts. I wrote my own: –  https://github.com/brendangregg/Misc/blob/master/java/jmaps •  Dump symbols quickly after perf record to minimize stale symbols. How I do it: $ cd /usr/lib/jvm/perf-map-agent/out $ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce PID # perf record -F 99 -a -g -- sleep 30; jmaps
  • 70. 6. Generate a Mixed-Mode Flame Graph •  Using my FlameGraph software: –  perf script reads perf.data with /tmp/*.map –  out.stacks01 is an intermediate file; can be handy to keep •  Finally open flame01.svg in a browser •  Check for newer flame graph implementations (e.g., d3) # perf script > out.stacks01 # git clone --depth=1 https://github.com/brendangregg/FlameGraph # cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
  • 73. Netflix Vector Near real-time, per-second metrics Flame Graphs Select Metrics Select Instance
  • 74. Netflix Vector •  Open source, on-demand, instance analysis tool –  https://github.com/netflix/vector •  Shows various real-time metrics •  Flame graph support currently in development –  Automating previous steps –  Using it internally already –  Also developing a new d3 front end
  • 77. Linux perf_events Coverage … all possible with Java stacks
  • 78. Advanced Flame Graphs •  Examples: –  Page faults –  Context switches –  Disk I/O requests –  TCP events –  CPU cache misses –  CPI •  Any event issued in synchronous Java context
  • 79. Synchronous Java Context •  Java thread still on-CPU, and event is directly triggered •  Examples: –  Disk I/O requests issued directly by Java à yes •  direct reads, sync writes, page faults –  Disk I/O completion interrupts à no* –  Disk I/O requests triggered async, e.g., readahead à no* * can be made yes by tracing and associating context
  • 80. Page Faults •  Show what triggered main memory (resident) to grow: •  "fault" as (physical) main memory is allocated on- demand, when a virtual page is first populated •  Low overhead tool to solve some types of memory leak # perf record -e page-faults -p PID -g -- sleep 120 RES column in top(1) grows because
  • 81. Page Fault Flame Graph GC Java code epoll
  • 82. Context Switches •  Show why Java blocked and stopped running on-CPU: •  Identifies locks, I/O, sleeps –  If code path shouldn't block and looks random, it's an involuntary context switch. I could filter these, but you should have solved them beforehand (CPU load). •  e.g., was used to understand framework differences: # perf record -e context-switches -p PID -g -- sleep 5 vs rxNetty Tomcat
  • 83. Context Switch Flame Graph (1/2) rxNetty epoll futex
  • 84. Context Switch Flame Graph (2/2) Tomcat sys_poll futex
  • 85. Disk I/O Requests •  Shows who issued disk I/O (sync reads & writes): •  e.g.: page faults in GC? This JVM has swapped out!: # perf record -e block:block_rq_insert -a -g -- sleep 60 GC
  • 86. TCP Events •  TCP transmit, using dynamic tracing: •  Note: can be high overhead for high packet rates –  For the current perf trace, dump, post-process cycle •  Can also trace TCP connect & accept (lower overhead) •  TCP receive is async –  Could trace via socket read # perf probe tcp_sendmsg # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks # perf probe --del tcp_sendmsg
  • 87. TCP Send Flame Graph kernel Java JVM Only one code-path taken in this example ab (client process)
  • 88. CPU Cache Misses •  In this example, sampling via Last Level Cache loads: •  -c is the count (samples once per count) •  Use other CPU counters to sample hits, misses, stalls # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
  • 89. One Last Example •  Back to a mixed-mode CPU flame graph •  What else can we show with color?
  • 90. CPI Flame Graph •  Cycles Per Instruction! –  red == instruction heavy –  blue == cycle heavy (likely mem stall cycles) zoomed:
  • 91. Links & References •  Flame Graphs –  http://www.brendangregg.com/flamegraphs.html –  http://techblog.netflix.com/2015/07/java-in-flames.html –  http://techblog.netflix.com/2014/11/nodejs-in-flames.html –  http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html •  Linux perf_events –  https://perf.wiki.kernel.org/index.php/Main_Page –  http://www.brendangregg.com/perf.html –  http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html •  Netflix Vector –  https://github.com/netflix/vector –  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  JDK tickets –  JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465 –  JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945 •  hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
  • 92. Thanks •  Questions? •  http://techblog.netflix.com •  http://slideshare.net/brendangregg •  http://www.brendangregg.com •  bgregg@netflix.com •  @brendangregg Oct  2015