Learn and use elements of statistics (distributions, standard deviation, linear correlation) in python is very simple.
The slides shows an example of managing some dataseries for network troubleshooting.
1. Statistics 101 for System
Administrators
EuroPython 2014, 22th
July - Berlin
Roberto Polli - roberto.polli@babel.it
Babel Srl P.zza S. Benedetto da Norcia, 33
00040, Pomezia (RM) - www.babel.it
22 July 2014
Roberto Polli - roberto.polli@babel.it
2. Who? What? Why?
• Using (and learning) elements of statistics with python.
• Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java
and Python. Red Hat Certified Engineer and Virtualization Administrator.
• Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures
based on Open Source software for Italian ISP and PA. Contributes to
various FLOSS.
Intro Roberto Polli - roberto.polli@babel.it
3. Agenda
• A latency issue: what happened?
• Correlation in 30”
• Combining data
• Plotting time
• modules: scipy, matplotlib
Intro Roberto Polli - roberto.polli@babel.it
4. A Latency Issue
• Episodic network latency issues
• Logs traces: message size, #peers, retransimissions
• Do we need to scale? Was a peak problem?
Find a rapid answer with python!
Intro Roberto Polli - roberto.polli@babel.it
5. Basic statistics
Python provides basic statistics, like
from scipy.stats import mean # ¯x
from scipy.stats import std # σX
T = { ’ts’: (1, 2, 3, .., ),
’late’: (0.12, 6.31, 0.43, .. ),
’peers’: (2313, 2313, 2312, ..),...}
print([k, max(X), min(X), mean(X), std(X) ]
for k, X in T.items() ])
Intro Roberto Polli - roberto.polli@babel.it
6. Distributions
Data distribution - aka δX - shows event frequency.
# The fastest way to get a
# distribution is
from matplotlib import pyplot as plt
freq, bins, _ = plt.hist(T[’late’])
# plt.hist returns a
distribution = zip(bins, freq)
A ping rtt distribution
158.0 158.5 159.0 159.5 160.0 160.5 161.0 161.5 162.0
rtt in ms
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 Ping RTT distribution
r
Intro Roberto Polli - roberto.polli@babel.it
7. Correlation I
Are two data series X, Y related?
Given ∆xi = xi − ¯x Mr. Pearson answered with this formula
ρ(X, Y ) = i ∆xi ∆yi
i ∆2xi ∆2yi
∈ [−1, +1] (1)
ρ identifies if the values of X and Y ‘move’ together on the same line.
Intro Roberto Polli - roberto.polli@babel.it
8. You must (scatter) plot
ρ doesn’t find non-linear correlation!
Intro Roberto Polli - roberto.polli@babel.it
9. Probability Indicator
Python scipy provides a correlation function, returning two values:
• the ρ correlation coefficient ∈ [−1, +1]
• the probability that such datasets are produced by uncorrelated systems
from scipy.stats.stats import pearsonr # our beloved ρ
a, b = range(0, 100), range(0, 400, 4)
c, d = [randint(0, 100) for x in a], [randint(0, 100) for x in a]
correlation, probability = pearsonr(a,b) # ρ = 1.000, p = 0.000
correlation, probability = pearsonr(c,d) # ρ = −0.041, p = 0.683
Intro Roberto Polli - roberto.polli@babel.it
10. Combinations
itertools is a gold pot of useful tools.
from itertools import combinations
# returns all possible combination of
# items grouped by N at a time
items = "heart spades clubs diamonds".split()
combinations(items, 2)
# And now all possible combinations between
# dataset fields!
combinations(T, 2)
Combinating 4 suites,
2 at a time.
♥♠
♥♣
♥♦
♠♣
♠♦
♣♦
Intro Roberto Polli - roberto.polli@babel.it
11. Netfishing correlation I
# Now we have all the ingredients for
# net-fishing relations between our data!
for (k1,v1), (k2,v2) in combinations(T.items(), 2):
# Look for correlations between every dataset!
corr, prob = pearsonr(v1, v2)
if corr > .6:
print("Series", k1, k2, "can be correlated", corr)
elif prob < 0.05:
print("Series", k1, k2, "probability lower than 5%%", prob)
Intro Roberto Polli - roberto.polli@babel.it
12. Netfishing correlation II
Now plot all combinations: there’s more to meet with eyes!
# Plot everything, and insert data in plots!
for (k1,v1), (k2,v2) in combinations(T.items(), 2):
corr, prob = pearsonr(v1, v2)
plt.scatter(v1, v2)
# 3 digit precision on title
plt.title("R={:0.3f} P={:0.3f}".format(corr, prob))
plt.xlabel(k1); plt.ylabel(k2)
# save and close the plot
plt.savefig("{}_{}.png".format(k1, k2)); plt.close()
Intro Roberto Polli - roberto.polli@babel.it
14. Color is the 3rd dimension
from itertools import cycle
colors = cycle("rgb") # use more than 3 colors!
labels = cycle("morning afternoon night".split())
size = datalen / 3 # 3 colors, right?
for (k1,v1), (k2,v2) in combinations(T.items(), 2):
[ plt.scatter( t1[i:i+size] , t2[i:i+size],
color=next(colors),
label=next(labels)
) for i in range(0, datalen, size) ]
# set title, save plot & co
Intro Roberto Polli - roberto.polli@babel.it
16. Latency Solution
• Latency wasn’t related to packet size or system throughput
• Errors were not related to packet size
• Discovered system throughput
Intro Roberto Polli - roberto.polli@babel.it
17. Wrap Up
• Use statistics: it’s easy
• Don’t use ρ to exclude relations
• Plot, Plot, Plot
• Continue collecting results
Intro Roberto Polli - roberto.polli@babel.it
18. That’s all folks!
Thank you for the attention!
Roberto Polli - roberto.polli@babel.it
Intro Roberto Polli - roberto.polli@babel.it