SlideShare a Scribd company logo
1 of 40
Brendan Gregg, Senior Performance Architect
Designing
Tracing
Tools
Wielding Superpowers
I'm currently developing more tracing tools (bcc/BPF)
Tool Design
• For tool developers
• For everyone else: what you can ask for
– Tool templates
– GUI visualizations
• The following is applicable to all tracers
– sysdig, bcc/BPF, DTrace, SystemTap, LTTng, …
Methodologies
Methodology-driven Design
• Methodologies provide ideas for purposeful tools
• Find/draw a functional diagram, apply methods
See: http://www.brendangregg.com/methodology.html
Operating Systems
Methodology Examples
Eg, at the syscall layer (well known & documented):
• Workload Characterization
– exec() or open() per-event trace (execsnoop, opensnoop)
– connect()/accept() per-event trace (tcpconnect, tcpaccept)
– read()/write() size histogram (one-liners)
• Latency Analysis
– read()/write() latency histogram (biolatency, …)
• USE Method
– network utilization by thread (not done yet)
– syscall errors (fserrors, soerrors)
CLI Tracing Tools
CLI Templates
1. Per event output
– *snoop, *slower 0, …
2. Filtered event output
– *slower
3. Interval summary
– *stat, *top
4. Count summary
– *count
5. Histogram summary
– *dist, *latency
6. Heatmap summary
– spectrogram.lua, subsecoffset.lua, …
Template 1: Per Event Output
Examples: *snoop, *slower 0, …
# opensnoop
PID COMM FD ERR PATH
10085 sshd 3 0 /lib/x86_64-linux-gnu/libkeyutils.so.1
10085 sshd 3 0 /lib/x86_64-linux-gnu/libresolv.so.2
10085 sshd 3 0 /lib/x86_64-linux-gnu/libgpg-error.so.0
10085 sshd 3 0 /dev/urandom
10085 sshd -1 2 /lib/x86_64-linux-gnu/.libcrypto.so.1.0.0.hmac
10085 sshd -1 2 /proc/sys/crypto/fips_enabled
10085 sshd 3 0 /proc/filesystems
10085 sshd 3 0 /dev/null
10085 sshd 3 0 /proc/10085/fd
10085 sshd 3 0 /usr/lib/ssl/openssl.cnf
10085 sshd 3 0 /etc/gai.conf
10085 sshd 3 0 /etc/nsswitch.conf
10085 sshd 3 0 /etc/ld.so.cache
10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_compat.so.2
10085 sshd 3 0 /etc/ld.so.cache
10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_nis.so.2
[…]
Template 2: Filtered Event Output
Examples: *slower
Tools like this can also do all event output:
# sysdig -c fileslower 1
TIME PROCESS TYPE LAT(ms) FILE
2014-04-13 20:40:43.973 cksum read 2 /mnt/partial.0.0
2014-04-13 20:40:44.187 cksum read 1 /mnt/partial.0.0
2014-04-13 20:40:44.689 cksum read 2 /mnt/partial.0.0
2014-04-13 20:40:45.005 cksum read 2 /mnt/partial.0.0
2014-04-13 20:40:45.193 cksum read 1 /mnt/partial.0.0
[…]
# sysdig -c fileslower 0
TIME PROCESS TYPE LAT(ms) FILE
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/librt.so.1
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libacl.so.1
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libc.so.6
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libdl.so.2
2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-
gnu/libattr.so.1
2014-04-13 20:59:04.415 ls read 0 /proc/filesystems
2014-04-13 20:59:04.415 ls read 0 /proc/filesystems
[...]
Template 3: Interval Summary
Examples: *stat, *top
# dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:11:47: 2059 141 97 95.29
08:11:48: 79974 151 106 99.87
08:11:49: 192874 146 102 99.95
08:11:50: 2051 144 100 95.12
08:11:51: 73373 17239 17194 76.57
08:11:52: 54685 25431 25387 53.58
08:11:53: 18127 8182 8137 55.12
08:11:54: 22517 10345 10301 54.25
08:11:55: 7524 2881 2836 62.31
08:11:56: 2067 141 97 95.31
08:11:57: 2115 145 101 95.22
[…]
Template 4: Count Summary
Examples: *count
# funccount 'vfs_*'
Tracing... Ctrl-C to end.
^C
ADDR FUNC COUNT
ffffffff811efe81 vfs_create 1
ffffffff811f24a1 vfs_rename 1
ffffffff81215191 vfs_fsync_range 2
ffffffff81231df1 vfs_lock_file 30
ffffffff811e8dd1 vfs_fstatat 152
ffffffff811e8d71 vfs_fstat 154
ffffffff811e4381 vfs_write 166
ffffffff811e8c71 vfs_getattr_nosec 262
ffffffff811e8d41 vfs_getattr 262
ffffffff811e3221 vfs_open 264
ffffffff811e4251 vfs_read 470
Detaching...
Template 5: Histogram Summary
Examples: *dist, *latency
# biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
usecs : count distribution
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 12 |******** |
256 -> 511 : 15 |********** |
512 -> 1023 : 43 |******************************* |
1024 -> 2047 : 52 |**************************************|
2048 -> 4095 : 47 |********************************** |
4096 -> 8191 : 52 |**************************************|
8192 -> 16383 : 36 |************************** |
16384 -> 32767 : 15 |********** |
32768 -> 65535 : 2 |* |
65536 -> 131071 : 2 |* |
Template 6: Heatmap Summary
Example: subsecoffset.lua (aka "spectrogram")
Valuable
Know what already exists, and what doesn't
Low Overhead (or documented)
• Understand tracing internals
– For example, sysdig's design has ~20x lower overhead than strace
(it still has overhead: test and measure to see if it's acceptable)
– Tracing overhead is usually relative to event rate
• Design for low overhead, and document expectations
sysdig
1. enable
Kernel
syscalls
sysdig
driver
ring
buffer
lua
program
2. async
read
3. output
Documentation
• Good tools have 3 docs:
1. Code comments
2. Man page
3. Examples file
• Man page
– troff, docbook, …
• Examples file:
.TH Title heading
.SH Section heading
.IP Indented paragraph
.TP Indented paragraph with label
.B Bold
- -
common man macros (see groff_man(7))
Demonstrations of biosnoop, the Linux eBPF/bcc version.
biosnoop traces block device I/O (disk I/O), and prints a line of output
per I/O. Example:
# ./biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74
[...]
Concise, intuitive, self-explanatory
• Useful startup message
– What I'm tracing, when there's output, when I'll end
• Vigorous tooling is concise
– No wasted text; leave less useful output for non-default options
• Unix philosophy: do one thing and do it well
# ./iolatency
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 4381 |######################################|
1 -> 2 : 9 |# |
2 -> 4 : 5 |# |
4 -> 8 : 0 | |
8 -> 16 : 1 |# |
[…]
POSIX-style Arguments
# ./biolatency -h
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count]
Summarize block device I/O latency as a histogram
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-T, --timestamp include timestamp on output
-Q, --queued include OS queued time in I/O time
-m, --milliseconds millisecond histogram
-D, --disks print a histogram per disk device
examples:
./biolatency # summarize block I/O latency as a histogram
./biolatency 1 10 # print 1 second summaries, 10 times
./biolatency -mT 1 # 1s summaries, milliseconds, and timestamps
./biolatency -Q # include OS queued time in I/O time
./biolatency -D # show each disk device separately
Option Alternate Expectation
-a --all all events
-c CMD --cmd … run this command
-d SECONDS --duration … duration of tool execution
-h --help help
-i FILE --input … input file
-i SECONDS --interval … summary interval
-n name --name … this process name only
-o FILE --output … output file
-p PID --pid … this process ID only
-P --by-process per-process ID breakdown
-P PORT --port … this TCP port only
-t or -T --[no]timestamp include or exclude timestamps
-v --verbose verbose output
-x --extended, --errors extended output, or only failures
[interval [count]] - summary interval, and # of outputs
POSIX-style Arguments
Testing, Testing, Testing
• If you can't write the workload, you can't write the tool
– Be it 10 lines of C, some shell, or dd
– dd if=/dev/urandom of=/dev/null bs=1k count=23
• Known workload testing: create 23 events
• Testing can be time consuming
– I spend more time testing a tool than writing it
– Automatic tool testing is a difficult problem
Example: gethostlatency
# gethostlatency
TIME PID COMM LATms HOST
06:10:24 28011 wget 90.00 www.iovisor.org
06:10:28 28127 wget 0.00 www.iovisor.org
06:10:41 28404 wget 9.00 www.netflix.com
06:10:48 28544 curl 35.00 www.netflix.com.au
06:11:10 29054 curl 31.00 www.plumgrid.com
06:11:16 29195 curl 3.00 www.facebook.com
06:11:25 29404 curl 72.00 foo
06:11:28 29475 curl 1.00 foo
Example: ext4slower
# ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:49:17 bash 3616 R 128 0 7.75 cksum
06:49:17 cksum 3616 R 39552 0 1.34 [
06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7
06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4
06:49:17 cksum 3616 R 10320 0 6.82 411toppm
06:49:17 cksum 3616 R 65536 0 4.01 a2p
06:49:17 cksum 3616 R 55400 0 8.77 ab
06:49:17 cksum 3616 R 36792 0 16.34 aclocal-1.14
06:49:17 cksum 3616 R 15008 0 19.31 acpi_listen
06:49:17 cksum 3616 R 6123 0 17.23 add-apt-
repository
06:49:17 cksum 3616 R 6280 0 18.40 addpart
06:49:17 cksum 3616 R 27696 0 2.16 addr2line
06:49:17 cksum 3616 R 58080 0 10.11 ag
06:49:17 cksum 3616 R 906 0 6.30 ec2-meta-data
[…]
Example: biolatency
# biolatency -m 1 5
Tracing block device I/O... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 36 |**************************************|
2 -> 3 : 1 |* |
4 -> 7 : 3 |*** |
8 -> 15 : 17 |***************** |
16 -> 31 : 33 |********************************** |
32 -> 63 : 7 |******* |
64 -> 127 : 6 |****** |
[…]
GUI Tracing Tools
GUI Visualizations
1. Event logs
2. Tables
3. Line graphs
4. Histograms
5. Heatmaps (spectrographs)
6. Waterfall charts
7. Directed graphs
8. Flame graphs
9. Flame charts
Visualization 1: Event Logs
https://commons.wikimedia.org/wiki/File:Wireshark_screenshot.png
Visualization 2: Tables
Visualization 3: Line Graphs
http://www.paradyn.org/html/screen-shots.html
Visualization 4: Histograms
Or a density plot
Or as a frequency trail (can cascade)
Visualization 5: Heat Maps
eg, Oracle ZFS Storage Appliance Analytics (DTrace-based)
Visualization 5: Spectrograms
Visualization 6: Waterfall Charts
Visualization 7: Directed Graphs
Visualization 8: Flame Graphs
Commonly used with CPU profilers. Also useful for tracers: off-CPU time, ...
file read
from disk
directory read
from disk
pipe write
path read from disk
fstat from disk
Visualization 9: Flame Charts
Desirable Attributes
• Valuable
– Methodologies provide ideas for purposeful metrics
• Documented
– Tool tips, wikis
• Tested
• Real Time
• Dashboards
– To support methodologies
Thank You!
http://www.brendangregg.com
http://slideshare.net/brendangregg
bgregg@netflix.com
@brendangregg
References & Links:
– http://www.brendangregg.com/heatmaps.html
– http://www.brendangregg.com/flamegraphs.html
– http://www.slideshare.net/brendangregg/monitorama-2015-netflix-instance-analysis

More Related Content

What's hot

Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Amin Astaneh
 
Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)Dustin Kirkland
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker containerVinay Jindal
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Boden Russell
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizationsBrendan Gregg
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsScyllaDB
 
Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Dobrica Pavlinušić
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDPlcplcp1
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConJérôme Petazzoni
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerAndrey Vagin
 
DCSF 19 eBPF Superpowers
DCSF 19 eBPF SuperpowersDCSF 19 eBPF Superpowers
DCSF 19 eBPF SuperpowersDocker, Inc.
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopQuey-Liang Kao
 
Portable TeX Documents (PTD): PackagingCon 2021
Portable TeX Documents (PTD): PackagingCon 2021Portable TeX Documents (PTD): PackagingCon 2021
Portable TeX Documents (PTD): PackagingCon 2021Jonathan Fine
 
Infrastructure coders logstash
Infrastructure coders logstashInfrastructure coders logstash
Infrastructure coders logstashDavid Lutz
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthScyllaDB
 
Linux Containers From Scratch
Linux Containers From ScratchLinux Containers From Scratch
Linux Containers From Scratchjoshuasoundcloud
 

What's hot (20)

Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Lxc- Introduction
Lxc- IntroductionLxc- Introduction
Lxc- Introduction
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
FOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the cornerFOSDEM2015: Live migration for containers is around the corner
FOSDEM2015: Live migration for containers is around the corner
 
SystemV vs systemd
SystemV vs systemdSystemV vs systemd
SystemV vs systemd
 
DCSF 19 eBPF Superpowers
DCSF 19 eBPF SuperpowersDCSF 19 eBPF Superpowers
DCSF 19 eBPF Superpowers
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
 
Portable TeX Documents (PTD): PackagingCon 2021
Portable TeX Documents (PTD): PackagingCon 2021Portable TeX Documents (PTD): PackagingCon 2021
Portable TeX Documents (PTD): PackagingCon 2021
 
Infrastructure coders logstash
Infrastructure coders logstashInfrastructure coders logstash
Infrastructure coders logstash
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
Linux Containers From Scratch
Linux Containers From ScratchLinux Containers From Scratch
Linux Containers From Scratch
 

Viewers also liked

The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishSysdig
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutesSysdig
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!Sysdig
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with ChiselSysdig
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy ContainersSysdig
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Sysdig
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoSysdig
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Find the Hacker
Find the HackerFind the Hacker
Find the HackerSysdig
 
How to Secure Containers
How to Secure ContainersHow to Secure Containers
How to Secure ContainersSysdig
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting KubernetesSysdig
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongSysdig
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor MicroservicesSysdig
 

Viewers also liked (13)

The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Find the Hacker
Find the HackerFind the Hacker
Find the Hacker
 
How to Secure Containers
How to Secure ContainersHow to Secure Containers
How to Secure Containers
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
 

Similar to Designing Tracing Tools

Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesIO Visor Project
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactAlessandro Selli
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...confluent
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common CommandJeff Yang
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Servers and Processes: Behavior and Analysis
Servers and Processes: Behavior and AnalysisServers and Processes: Behavior and Analysis
Servers and Processes: Behavior and Analysisdreamwidth
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Anne Nicolas
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixDocker, Inc.
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems PerformanceBrendan Gregg
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sMydbops
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questionsTeja Bheemanapally
 

Similar to Designing Tracing Tools (20)

Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compact
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common Command
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Servers and Processes: Behavior and Analysis
Servers and Processes: Behavior and AnalysisServers and Processes: Behavior and Analysis
Servers and Processes: Behavior and Analysis
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questions
 

More from Sysdig

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionSysdig
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsSysdig
 
15 kubernetes failure points you should watch
15 kubernetes failure points you should watch15 kubernetes failure points you should watch
15 kubernetes failure points you should watchSysdig
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime SecuritySysdig
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesSysdig
 
Continuous Security
Continuous SecurityContinuous Security
Continuous SecuritySysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 

More from Sysdig (8)

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
 
15 kubernetes failure points you should watch
15 kubernetes failure points you should watch15 kubernetes failure points you should watch
15 kubernetes failure points you should watch
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 

Recently uploaded

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Recently uploaded (20)

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 

Designing Tracing Tools

  • 1. Brendan Gregg, Senior Performance Architect Designing Tracing Tools
  • 3. I'm currently developing more tracing tools (bcc/BPF)
  • 4. Tool Design • For tool developers • For everyone else: what you can ask for – Tool templates – GUI visualizations • The following is applicable to all tracers – sysdig, bcc/BPF, DTrace, SystemTap, LTTng, …
  • 6. Methodology-driven Design • Methodologies provide ideas for purposeful tools • Find/draw a functional diagram, apply methods See: http://www.brendangregg.com/methodology.html Operating Systems
  • 7. Methodology Examples Eg, at the syscall layer (well known & documented): • Workload Characterization – exec() or open() per-event trace (execsnoop, opensnoop) – connect()/accept() per-event trace (tcpconnect, tcpaccept) – read()/write() size histogram (one-liners) • Latency Analysis – read()/write() latency histogram (biolatency, …) • USE Method – network utilization by thread (not done yet) – syscall errors (fserrors, soerrors)
  • 9. CLI Templates 1. Per event output – *snoop, *slower 0, … 2. Filtered event output – *slower 3. Interval summary – *stat, *top 4. Count summary – *count 5. Histogram summary – *dist, *latency 6. Heatmap summary – spectrogram.lua, subsecoffset.lua, …
  • 10. Template 1: Per Event Output Examples: *snoop, *slower 0, … # opensnoop PID COMM FD ERR PATH 10085 sshd 3 0 /lib/x86_64-linux-gnu/libkeyutils.so.1 10085 sshd 3 0 /lib/x86_64-linux-gnu/libresolv.so.2 10085 sshd 3 0 /lib/x86_64-linux-gnu/libgpg-error.so.0 10085 sshd 3 0 /dev/urandom 10085 sshd -1 2 /lib/x86_64-linux-gnu/.libcrypto.so.1.0.0.hmac 10085 sshd -1 2 /proc/sys/crypto/fips_enabled 10085 sshd 3 0 /proc/filesystems 10085 sshd 3 0 /dev/null 10085 sshd 3 0 /proc/10085/fd 10085 sshd 3 0 /usr/lib/ssl/openssl.cnf 10085 sshd 3 0 /etc/gai.conf 10085 sshd 3 0 /etc/nsswitch.conf 10085 sshd 3 0 /etc/ld.so.cache 10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_compat.so.2 10085 sshd 3 0 /etc/ld.so.cache 10085 sshd 3 0 /lib/x86_64-linux-gnu/libnss_nis.so.2 […]
  • 11. Template 2: Filtered Event Output Examples: *slower Tools like this can also do all event output: # sysdig -c fileslower 1 TIME PROCESS TYPE LAT(ms) FILE 2014-04-13 20:40:43.973 cksum read 2 /mnt/partial.0.0 2014-04-13 20:40:44.187 cksum read 1 /mnt/partial.0.0 2014-04-13 20:40:44.689 cksum read 2 /mnt/partial.0.0 2014-04-13 20:40:45.005 cksum read 2 /mnt/partial.0.0 2014-04-13 20:40:45.193 cksum read 1 /mnt/partial.0.0 […] # sysdig -c fileslower 0 TIME PROCESS TYPE LAT(ms) FILE 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/librt.so.1 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libacl.so.1 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libc.so.6 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux-gnu/libdl.so.2 2014-04-13 20:59:04.414 ls read 0 /lib/x86_64-linux- gnu/libattr.so.1 2014-04-13 20:59:04.415 ls read 0 /proc/filesystems 2014-04-13 20:59:04.415 ls read 0 /proc/filesystems [...]
  • 12. Template 3: Interval Summary Examples: *stat, *top # dcstat TIME REFS/s SLOW/s MISS/s HIT% 08:11:47: 2059 141 97 95.29 08:11:48: 79974 151 106 99.87 08:11:49: 192874 146 102 99.95 08:11:50: 2051 144 100 95.12 08:11:51: 73373 17239 17194 76.57 08:11:52: 54685 25431 25387 53.58 08:11:53: 18127 8182 8137 55.12 08:11:54: 22517 10345 10301 54.25 08:11:55: 7524 2881 2836 62.31 08:11:56: 2067 141 97 95.31 08:11:57: 2115 145 101 95.22 […]
  • 13. Template 4: Count Summary Examples: *count # funccount 'vfs_*' Tracing... Ctrl-C to end. ^C ADDR FUNC COUNT ffffffff811efe81 vfs_create 1 ffffffff811f24a1 vfs_rename 1 ffffffff81215191 vfs_fsync_range 2 ffffffff81231df1 vfs_lock_file 30 ffffffff811e8dd1 vfs_fstatat 152 ffffffff811e8d71 vfs_fstat 154 ffffffff811e4381 vfs_write 166 ffffffff811e8c71 vfs_getattr_nosec 262 ffffffff811e8d41 vfs_getattr 262 ffffffff811e3221 vfs_open 264 ffffffff811e4251 vfs_read 470 Detaching...
  • 14. Template 5: Histogram Summary Examples: *dist, *latency # biolatency Tracing block device I/O... Hit Ctrl-C to end. ^C usecs : count distribution 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 1 | | 128 -> 255 : 12 |******** | 256 -> 511 : 15 |********** | 512 -> 1023 : 43 |******************************* | 1024 -> 2047 : 52 |**************************************| 2048 -> 4095 : 47 |********************************** | 4096 -> 8191 : 52 |**************************************| 8192 -> 16383 : 36 |************************** | 16384 -> 32767 : 15 |********** | 32768 -> 65535 : 2 |* | 65536 -> 131071 : 2 |* |
  • 15. Template 6: Heatmap Summary Example: subsecoffset.lua (aka "spectrogram")
  • 16.
  • 17. Valuable Know what already exists, and what doesn't
  • 18. Low Overhead (or documented) • Understand tracing internals – For example, sysdig's design has ~20x lower overhead than strace (it still has overhead: test and measure to see if it's acceptable) – Tracing overhead is usually relative to event rate • Design for low overhead, and document expectations sysdig 1. enable Kernel syscalls sysdig driver ring buffer lua program 2. async read 3. output
  • 19. Documentation • Good tools have 3 docs: 1. Code comments 2. Man page 3. Examples file • Man page – troff, docbook, … • Examples file: .TH Title heading .SH Section heading .IP Indented paragraph .TP Indented paragraph with label .B Bold - - common man macros (see groff_man(7)) Demonstrations of biosnoop, the Linux eBPF/bcc version. biosnoop traces block device I/O (disk I/O), and prints a line of output per I/O. Example: # ./biosnoop TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74 [...]
  • 20. Concise, intuitive, self-explanatory • Useful startup message – What I'm tracing, when there's output, when I'll end • Vigorous tooling is concise – No wasted text; leave less useful output for non-default options • Unix philosophy: do one thing and do it well # ./iolatency Tracing block I/O. Output every 1 seconds. Ctrl-C to end. >=(ms) .. <(ms) : I/O |Distribution | 0 -> 1 : 4381 |######################################| 1 -> 2 : 9 |# | 2 -> 4 : 5 |# | 4 -> 8 : 0 | | 8 -> 16 : 1 |# | […]
  • 21. POSIX-style Arguments # ./biolatency -h usage: biolatency [-h] [-T] [-Q] [-m] [-D] [interval] [count] Summarize block device I/O latency as a histogram positional arguments: interval output interval, in seconds count number of outputs optional arguments: -h, --help show this help message and exit -T, --timestamp include timestamp on output -Q, --queued include OS queued time in I/O time -m, --milliseconds millisecond histogram -D, --disks print a histogram per disk device examples: ./biolatency # summarize block I/O latency as a histogram ./biolatency 1 10 # print 1 second summaries, 10 times ./biolatency -mT 1 # 1s summaries, milliseconds, and timestamps ./biolatency -Q # include OS queued time in I/O time ./biolatency -D # show each disk device separately
  • 22. Option Alternate Expectation -a --all all events -c CMD --cmd … run this command -d SECONDS --duration … duration of tool execution -h --help help -i FILE --input … input file -i SECONDS --interval … summary interval -n name --name … this process name only -o FILE --output … output file -p PID --pid … this process ID only -P --by-process per-process ID breakdown -P PORT --port … this TCP port only -t or -T --[no]timestamp include or exclude timestamps -v --verbose verbose output -x --extended, --errors extended output, or only failures [interval [count]] - summary interval, and # of outputs POSIX-style Arguments
  • 23. Testing, Testing, Testing • If you can't write the workload, you can't write the tool – Be it 10 lines of C, some shell, or dd – dd if=/dev/urandom of=/dev/null bs=1k count=23 • Known workload testing: create 23 events • Testing can be time consuming – I spend more time testing a tool than writing it – Automatic tool testing is a difficult problem
  • 24. Example: gethostlatency # gethostlatency TIME PID COMM LATms HOST 06:10:24 28011 wget 90.00 www.iovisor.org 06:10:28 28127 wget 0.00 www.iovisor.org 06:10:41 28404 wget 9.00 www.netflix.com 06:10:48 28544 curl 35.00 www.netflix.com.au 06:11:10 29054 curl 31.00 www.plumgrid.com 06:11:16 29195 curl 3.00 www.facebook.com 06:11:25 29404 curl 72.00 foo 06:11:28 29475 curl 1.00 foo
  • 25. Example: ext4slower # ext4slower 1 Tracing ext4 operations slower than 1 ms TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 06:49:17 bash 3616 R 128 0 7.75 cksum 06:49:17 cksum 3616 R 39552 0 1.34 [ 06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7 06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4 06:49:17 cksum 3616 R 10320 0 6.82 411toppm 06:49:17 cksum 3616 R 65536 0 4.01 a2p 06:49:17 cksum 3616 R 55400 0 8.77 ab 06:49:17 cksum 3616 R 36792 0 16.34 aclocal-1.14 06:49:17 cksum 3616 R 15008 0 19.31 acpi_listen 06:49:17 cksum 3616 R 6123 0 17.23 add-apt- repository 06:49:17 cksum 3616 R 6280 0 18.40 addpart 06:49:17 cksum 3616 R 27696 0 2.16 addr2line 06:49:17 cksum 3616 R 58080 0 10.11 ag 06:49:17 cksum 3616 R 906 0 6.30 ec2-meta-data […]
  • 26. Example: biolatency # biolatency -m 1 5 Tracing block device I/O... Hit Ctrl-C to end. msecs : count distribution 0 -> 1 : 36 |**************************************| 2 -> 3 : 1 |* | 4 -> 7 : 3 |*** | 8 -> 15 : 17 |***************** | 16 -> 31 : 33 |********************************** | 32 -> 63 : 7 |******* | 64 -> 127 : 6 |****** | […]
  • 28. GUI Visualizations 1. Event logs 2. Tables 3. Line graphs 4. Histograms 5. Heatmaps (spectrographs) 6. Waterfall charts 7. Directed graphs 8. Flame graphs 9. Flame charts
  • 29. Visualization 1: Event Logs https://commons.wikimedia.org/wiki/File:Wireshark_screenshot.png
  • 31. Visualization 3: Line Graphs http://www.paradyn.org/html/screen-shots.html
  • 32. Visualization 4: Histograms Or a density plot Or as a frequency trail (can cascade)
  • 33. Visualization 5: Heat Maps eg, Oracle ZFS Storage Appliance Analytics (DTrace-based)
  • 37. Visualization 8: Flame Graphs Commonly used with CPU profilers. Also useful for tracers: off-CPU time, ... file read from disk directory read from disk pipe write path read from disk fstat from disk
  • 39. Desirable Attributes • Valuable – Methodologies provide ideas for purposeful metrics • Documented – Tool tips, wikis • Tested • Real Time • Dashboards – To support methodologies
  • 40. Thank You! http://www.brendangregg.com http://slideshare.net/brendangregg bgregg@netflix.com @brendangregg References & Links: – http://www.brendangregg.com/heatmaps.html – http://www.brendangregg.com/flamegraphs.html – http://www.slideshare.net/brendangregg/monitorama-2015-netflix-instance-analysis

Editor's Notes

  1. demo BPF iosnoop
  2. needs a better name. Linux FrogFilter.
  3. I don't want a floor wax and a desert topping! netstat doesn't have a tcpdump option (it's bad enough as it is). single tools: - simplify testing - free up argument options
  4. If you try to automate, then your tool output may be polluted with other system events (unless you can filter to a PID, but then, you aren't testing system-wide). Ensuring a tool doesn't over count from another workload is also difficult: what other workloads should you run during the test? undefined.
  5. now do demos
  6. <hi thanks for joining…> I’ll start with a really quick introduction and then I’d love to learn a little bit more about you and your environment, and why you’re interested in Sysdig Cloud. But to first set the stage… Sysdig Cloud. There are a million other monitoring tools out there – why does the world need another? Well with Sysdig Cloud, we set out to create the first and only comprehensive, container-native monitoring solution