Network Measurement and Monitori - Assigment 1, Group3, "Classification"
1. Networking measurements and
monitoring
1st assigment: Oral Presentation
Classification
Patrick Herbeuval Valentin Thirion
University of Liège University of Liège
1st Master in Computer Science 1st Master in Computer Science
p.herbeuval@student.ulg.ac.be valentin.thirion@student.ulg.ac.be
Teacher: B. DONNET
benoit.donnet@ulg.ac.be
2. Plan
I. Introduction
Four papers
II. Early Application Identification
III. Multilevel classifier: BLINC
IV. Statistical: The ADSL Case
V. Application specific: Skype
VI. Comparative
VII. Conclusion
3. I - Introduction
Internet is more and more used today
We want to keep the network comfortable enough
The quality of service asked by consumers increases as
fast as applications consumes more bandwidth
ISPs, companies and universities want to ban P2P
Port based classifiers were good years ago, quite
inefficient now
4. Why classify?
Classification is today a key issue for today’s network
administrators and companies for the following reasons:
• Improve the network infrastructure
• Ban undesired traffic
• Protect the network against potential attacks
• Global knowledge of trends
5. How classify?
Deep Packet Inspection (DPI): verry precise technique but
lots of drawbacks:
Huge computation power needed
Unneficient if packets are crypted
Continuous need of database updates
Statistical analysis
Social
6. II - Early Application
Identification
Goal: determine the app with the first few packets
Advantage: knowing the kind of traffic in the
beginning, ability to block, redirect it
DPI consumes too much ressources and flows need to be
ended to be analysed
Statistical: usage of the mean sizes, durations, … these
are values that are not available for the first few packets
7. Clustering the flows
Techniques used: K-Means, Gaussian Mixture
Model, special
Values used:
Size of the first few packets
Duration of the first few packets (negociation phase)
8. Data set
4 packet traces
3 from a University network
1 from an enterprise network
Keep only TCP packets and trash the ones that flow
began before the trace capture
Features analysed: need for an efficient metric
Size and direction of the first 4 packets
We can observe that the range of theses values is very similar
across traces, see graph next slide
10. Classification, 2 phases
Training phase: offline at management sites.
Apply clustering techniques to samples of TCP connections
for all target applications
Creation of a spatial representation based on the sizes of the
first P packets (vector of P dimensions or HMM)
Then find applications that have the same behaviour
Best results: 40 clusters and the 4 first packets
Creation of two sets:
One with the description of each cluster
One with applications present in each cluster
11. Classification, 2 phases
Classification phase: online at management hosts
Extract the 5-tuple and analysis of the size of packets in all
directions
With this size, use the assigment module (associates a
connection to a cluster)
With the clusters, the labelling module selects the application
associated with the connection
12. Evaluation & Conclusion
Evaluation
Assigment accuracy: above 95% for all heuristics
Labbeling accuracy: between 85% and 98%
The size of first few packet is a good metric
Quality of clustering is richer with HMM but comparable
with Euclidean
GMM Clustering with TCP ports classifies over 98% of
know applications
Limitation: need the first 4 packets in the correct order
Heuristic: (Wikipedia) Where the exhaustive search is impractical (NP-
complete for instance), heuristic methods are used to speed up the
process of finding a satisfactory solution.
13. III – The BLINC Classifier
Stands for BLINd Classification
Avoid reading the whole content of the packet
Privacy, performance, cyphered packets
3 levels of classification
Social level
Functional level
Application level
14. The Social level
Finding host communities
Client-server, P2P, …
Analyse these communities
Perfect match : likely malicious
Partial overlap : P2P sources, websites, gaming, …
Partial overlap within the same subnet : farms
16. The functional level
Find if a host offers a service, uses it or both
Mostly depending on the port range used by this host
Works better when a host is connected to many servers
Typical schemes:
HTTP server: 1-2 ports
P2P: many ports (to 1 per host)
Mail server: depending on services available
17. The application level
Using the connections 4-tuple (+ maybe other
characteristics)
Create a model for every application type
Models are represented by little graphs called
« graphlets »
18. BLINC : Results
Uses 2 metrics to evaluate the classifier
Completeness (% classified traffic)
Accuracy (% correctly classified traffic)
Some parameters can be used to tune the classifier
Changing a threshold can improve the results for one of the
metrics, but significantly degrade the other one
19. Global results
GN : Genome campus (~1000 users), UN : university network (~20.000 users)
21. Results (2)
Good detection rate without reading any byte of the payload
Non payload flows classified as well.
Cyphering is not a problem
Low resource consumption
Good detection of unknown flows
Difficult to distinguish applications of the same type (e.g.a ll VoIP
protocols grouped as the same one)
Doesn’t work if the header are encrypted
Hard to identify multiple sources behind NATs
Results from the edge of the network, the classifier may work
differently at the backbone of the network
22. BLINC : conclusion
BLINC has a good detection rate without costing a lot of
processing and without being intrusive
It can detect attacks and unknown protocols
It can be improved in some situations
23. IV – The ADSL Case
Test statistical classifier on different sites, after having
been trained on some others.
Dataset:
4 packet traces collected at 3 different ADSL POPs from
Orange
2 traces at the same time, different locations
2 traces at the same location, 17 days between
Reference used: ODP tool (provided by Orange)
24. Classification methodology
3 algorithms used to classify the traces
Naïve Bayes Kernel Estimation
Bayesian Network
C4.5 Decision Tree
Traces analysed on the two features
SET_A: Packet Level Information
SET_B: Flow Level Statistics
3 filters:
S/S: flows with 3-way-Handshake
S/S+4D: same as S/S + at least 4 data packets
S/S+F/R: same as S/S + FIN or RST flag at the end
25. Classification, 2 cases
Static case: classification on each site independently
Ideal number of packets: 4
Accuracy: about 90%
Great classification of WEB and EDONKEY flows
Cross-site case:
SET_A: EDONKEY result immune, spatial similarity seems
more important than temporal similarity.
Classifier very sensitive to the context in which it is trained
MAIL is often taken for FTP due to the packet sizes similarities
Usage of Port number increases the quality of results
26. Classification, 2 cases
(continued)
SET_B: some degradations
Focus on a single feature: Port number
Results are the opposite from the static case
Prediction of traffic using non-legacy ports is non efficient
Due to the heavy-hitters (typically P2P)
Global results: C4.5 algorithm is the best in term of overall
accuracy for almost all cases (static + cross-site)
Degradation : C4.5 is comparable with other algorithms
(≤17%)
Data overfitting problem
27. Unknown class + Conclusion
Looking for the unknown marked flows
3 way handshake
Apply classifiers and get confidence level, this value is then
compared to the one returned by C4.5
Useful to detect malicious traffic and P2P
Should be integrated into existing DPI tool
Conclusion:
Statistical tools are very useful to identify unknown traffic
Good performances if used in the same site as training
Can detect applications among protocols
Really suffers from data overfitting (same behaviour from different
apps)
Great thing about this analysis: used commercial traffic, so very
differentiated
28. V – Skype case
We want to detect Skype traffic
It’s already possible to detect VoIP traffic with other
classifiers, but how to distinguish it ?
Skype is a closed and cyphered protocol, which has to be
analysed before starting the classification
29. Skype model
Using a controlled environment, detection of Skype traffic
characteristics
2 kinds of connections : E2E and E2O
E2E : End 2 End, Skype to Skype
E2O : End 2 Out, Skype to telephone network
Skype works on TCP and UDP
Skype can carry text, voice, video and files
Everything multiplexed in 1 packet
In this case, only voice traffic is treated
30. Skype SoM
TCP packets are entirely cyphered, they cannot be
analysed
UDP has a small uncyphered overhead, called Start of
Message (SoM)
E2E : id and message type (signaling or data)
E20 : unique connection identifier
Skype also always uses the same port number in UDP
(12340)
31. Classifiers
Chi-Square Classifier (CSC)
Based on the randomness of bits in packets
Doesn’t works on TCP since cyphered packets seems to be
completely random.
Naive Bayes Classifier (NBC)
Real-time voice protocol classifier
Based on message size (depending of the audio codec) and
on average inter-packet gap
Used on a short window of samples to cope with variability in
packet size
Payload based classifier
Used in the controlled environment to check if CSC and NBC
work well
32. Experiments
NBC detects all kinds of VoIP traffic
CSC detects all kinds of Skype traffic
Using both of them should detect Skype voice traffic
33. Results
N
N OK
OK FP
FP FP%
FP% FN
FN FN%
FN% N
N OK
OK FP
FP FP%
FP% FN
FN FN%
FN%
E2E
E2E 1014
1014 E2E
E2E 65
65
PBC
PBC —— —— —— —— —
— PBC
PBC —— —— —
— —— —
—
E2O
E2O 163
163 E2O
E2O 125
125
E2E
E2E 1236
1236 726
726 510
510 0.68
0.68 288
288 28.40
28.40 E2E
E2E 27437
27437 50
50 27387
27387 73.73
73.73 15
15 23.08
23.08
NBC
NBC NBC
NBC
E2O
E2O 441
441 153
153 288
288 0.38
0.38 10
10 6.13
6.13 E2O
E2O 295
295 124
124 171
171 0.46
0.46 1
1 0.80
0.80
E2E
E2E 2781
2781 984
984 1797
1797 2.40
2.40 30
30 2.96
2.96 E2E
E2E 191
191 57
57 134
134 0.36
0.36 8
8 12.31
12.31
CSC
CSC CSC
CSC
E2O
E2O 161
161 157
157 44 0.01
0.01 66 3.68
3.68 E2O
E2O 190
190 123
123 67
67 0.18
0.18 2
2 1.6
1.6
NBC ∧
NBC ∧ E2E
E2E 716
716 710
710 66 0.01
0.01 304
304 29.98
29.98 NBC ∧
NBC ∧ E2E
E2E 51
51 49
49 2
2 0.01
0.01 16
16 24.62
24.62
CSC
CSC E2O
E2O 147
147 147
147 00 0.00
0.00 16
16 9.82
9.82 CSC
CSC E2O
E2O 163
163 122
122 41
41 0.11
0.11 3
3 2.40
2.40
≥ 100
≥ 100 76025
76025 ≥ 100
≥ 100 37212
37212
TOT
TOT —
— —
— —
— —
— —
— TOT
TOT —
— —
— —
— —
— —
—
487729
487729 258634
258634
Table 3: Results for UDP flows, C AMPUS dataset.
Table 3: Results for UDP flows, C AMPUS dataset. Table 4: Results for UDP flows, ISP dataset.
Table 4: Results for UDP flows, ISP dataset.
C AMPUS
C AMPUS ISP
ISP
PBC as oracle, so that flows that pass the PBC classification form E2E
E2E 20910
20910 60
60
PBC as oracle, so that flows that pass the PBC classification form NBC
NBC E2O
E2O 2034
2034 646
646
aa reliable dataset. We refer to this set as the benchmark dataset.
reliable dataset. We refer to this set as the benchmark dataset. E2E
E2E
Very low false positive rate
In particular, this dataset is built by Skype voice flows considering
In particular, this dataset is built by Skype voice flows considering
the E2O case. In the E2E case, voice, video, data and chat flows
CSC
CSC E2O
E2O
403996
403996 46876
46876
the E2O case. In the E2E case, voice, video, data and chat flows NBC ∧ CSC
E2E
E2E 621
621 12
12
are present, since it is impossible to distinguish among them from NBC ∧ CSC E2O 313 0
are present, since it is impossible to distinguish among them from E2O 313 0
packet inspection. Our tests are the NBC, the CSC and the joint ≥ 100 1646424 108831
Bigger false negative rate
packet inspection. Our tests are the NBC, the CSC and the joint
NBC-CSC classifiers. Notice that the NBC test is expected to fail
NBC-CSC classifiers. Notice that the NBC test is expected to fail
TOT
TOT
≥ 100 1646424
23856424
23856424
108831
1614553
1614553
when aavideo/data/chat benchmark E2E flow is tested.
when video/data/chat benchmark E2E flow is tested.
From aapreliminary set of experiments on the testbed traces, con-
From preliminary set of experiments on the testbed traces, con-
taining more that 50 Skype voice calls, we tuned the PBC and CSC Table 5: Results for TCP flows, both datasets.
Table 5: Results for TCP flows, both datasets.
taining more that 50 Skype voice calls, we tuned the PBC and CSC
classifier thresholds to B m i inn = − 5 and χ 22(T hr) = 150, respec-
classifier thresholds to B m = − 5 and χ (T hr) = 150, respec-
tively. Using such choices, further discussed in Sec. 5.2, all flows
tively. Using such choices, further discussed in Sec. 5.2, all flows noticing that the NBC (correctly) identifies 27437 voice flows, most
were correctly identified as E2E or E2O, and neither FP nor FN noticing that the NBC (correctly) identifies 27437 voice flows, most
were correctly identified as E2E or E2O, and neither FP nor FN of which correspond to actual ISP’s VoIP flows carried over RTP.
of which correspond to actual ISP’s VoIP flows carried over RTP.
were identified. Using the same threshold setting, we then apply
were identified. Using the same threshold setting, we then apply Only combining the CSC allows to detect the true Skype voice
the classification to real traffic traces: the results are summarized Only combining the CSC allows to detect the true Skype voice
the classification to real traffic traces: the results are summarized flows. These results confirm that the NBC-FP may be due to non-
flows. These results confirm that the NBC-FP may be due to non-
34. Skype : Conclusion
Skype is hard to classify due to its cyphering
protocol, which makes its analysis hard to do
But with this classifier, we have good results on UDP
False positive is almost zero, good if the ISP wants to
prioritarize its traffic
False negative is bigger but not really a problem while the
ISP doesn’t want to block Skype
35. VI - Comparative
All these classifiers have good results, but each of them has its
strengths and weaknesses
ADSL needs specific training, but best detection rate
BLINC and Early are less precise but more flexible
They are also faster and good to detect attacks
BLINC detects unknown protocols but cannot discern apps
Early needs the 4 first packets in order, ADSL the 3-way handshake
Skype is more specific, cannot be compared immediately
Good false positive rate but higher false negative rate
36. VII – Conclusion
We have now solutions that can replace DPI’s
Each classifier is good in its domain
Important network: early app detection (detect attacks soon)
ADSL and commercial: statistical (user trends, adapt
infrastructure)
University or academy: BLINC (statistics, trends)
Everywhere we want to improve it: Skype classifier
Remarks:
Traces and classifiers are quite old (4 to 6 years)
What about mobile usage ? Multimedia over 3/4G networks ?
37. References:
K. Karagiannis, K. Papagiannaki, M. Faloutsos. BLINC: Multilevel Traffic
Classification in the Dark. In Proc. ACM SIGCOMM. August 2005.
L. Bernaille, R. Teixeira, K. Salamatian. Early Application Identification. In Proc.
ACM CoNEXT. December 2006.
M. Pietrzyk, J.-L. Costeux, G. Urvoy-Keller, T. En-Jajjary. Challenging Statistical
Classification for Operational Usage: the ADSL Case. In Proc. ACM/USENIX
Internet Measurement Conference (IMC). Novem- ber 2009.
D.Bonfiglio,M.Mellia,M.Meo,D.Rossi,P.Tofanelli.RevealingSkype Traffic: When
Randomness Plays with You. In Proc. ACM SIGCOMM. August 2007.
Thanks for your attention
Any questions ?