This document provides an overview of the visualization lifecycle process, including assessing data, parsing, cleaning, and visualizing data. It discusses exploring data, parsing and normalizing data, data cleansing techniques, feature selection, and choosing appropriate visualization tools and libraries. Key steps include parsing raw data into a structured format, filtering and aggregating data, loading data into databases, and iterating on visual transformations to create effective visualizations. A variety of open source tools and JavaScript libraries for data visualization are also presented.
2. “Transform a dataset into a captive story.”
‣ Assess Youʼre on your own Art
‣ Parse
‣ Clean
‣ Visualize
Visualization Tools and Libraries
pixlcloud | collect. visualize. understand. Copyright (c) 2011
6. Explore Data
‣ What is the data about?
‣ What are the data features/columns?
‣ Is there a common structure in the data?
‣ What are the data types?
Nov 7 09:14:46 fwbox kernel: DROPPED IN=eth0 OUT= MAC=00:0c:29:e3:45:bd:00:0c:
29:b5:5c:ee:08:00 SRC=10.1.222.31 DST=10.1.222.202 LEN=60 TOS=0x00 PREC=0x00
TTL=64 ID=63849 DF PROTO=TCP SPT=58485 DPT=9111 WINDOW=5840 RES=0x00 SYN URGP=0
May 25 20:24:20 ram-laptop kernel: BLOCK any in: IN=eth1 OUT=
MAC=00:13:02:ac:d8:ea:00:09:5b:3d:df:00:08:00 SRC=213.175.90.24 DST=192.168.0.15
LEN=576 TOS=0x00 PREC=0x00 TTL=115 ID=23513 PROTO=TCP SPT=9030 DPT=56772
WINDOW=65535 RES=0x00 ACK URGP=0
pixlcloud | collect. visualize. understand. Copyright (c) 2011
7. Parsing and Normalization
‣ Parsing
‣ extraction of entities / features
‣ imposing structure
Oct 13 20:00:43.874401 rule 193/0(match): block in on xl0:
212.251.89.126.3859 >: S 1818630320:1818630320(0) win 65535 <mss
1460,nop,nop,sackOK> (DF)
‣ often use regexes Oct 13 20:00:43 fwbox local4:warn|warning fw07 %PIX-4-106023: Deny tcp
src internet: 212.251.89.126/3859 dst 212.254.110.98/135 by access-
group "internet_access_in"
‣ Normalize Oct 13 20:00:43 fwbox kernel: DROPPED IN=eth0 OUT=
MAC=ff:ff:ff:ff:ff:ff:00:0f:cc:81:40:94:08:00 SRC=212.251.89.126
DST=212.254.110.98 LEN=576 TOS=0x00 PREC=0x00 TTL=255 ID=8624
PROTO=TCP SPT=3859 DPT=135 LEN=556
‣ field normalization
‣ term normalization: block, deny, dropped
‣ Generate a common output format for vis-tools (e.g., CSV)
pixlcloud | collect. visualize. understand. Copyright (c) 2011
8. Parser
Oct 13 20:00:38.018152 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 62.2.32.250.53: 34388 [1au][|domain] (DF)
Raw Oct 13 20:00:38.115862 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 192.134.0.49.53: 49962 [1au][|domain] (DF)
Oct 13 20:00:38.157238 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 194.25.2.133.53: 14434 [1au][|domain] (DF)
(.*) rule ([-d]+/d+)(.*?): (pass|block) (in|out) on (w+):
(d+.d+.d+.d+).?(d*) [<>]
Regex / Parser (d+.d+.d+.d+).?(d*): (.*)
Oct 13 20:00:38.018152,57/0,match,pass,in,xl1,195.141.69.45,1030,62.2.32.250,53,34388 [1au][|domain] (DF)
Normalized Oct 13 20:00:38.115862,57/0,match,pass,in,xl1,195.141.69.45,1030,192.134.0.49,53,49962 [1au][|domain] (DF)
(CSV) Oct 13 20:00:38.157238,57/0,match,pass,in,xl1,195.141.69.45,1030,194.25.2.133,53,14434 [1au][|domain] (DF)
pixlcloud | collect. visualize. understand. Copyright (c) 2011
11. Data Cleansing
‣ Filter
‣ Normalize (see earlier)
‣ Aggregation
pixlcloud | collect. visualize. understand. Copyright (c) 2011
12. Load CSV into Database
# mysql -u <user> -p Sometimes you just load
your data into a tool,
and you can omit this
mysql> create database data; step
mysql> create table set1 (id int, address
varchar(20), ...);
mysql> LOAD DATA LOCAL INFILE 'input_file' INTO
TABLE set1 FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n';
pixlcloud | collect. visualize. understand. Copyright (c) 2011
13. Contextual Data
‣ Either dump into DB or use via API calls to augment
‣ IP -> Geo mapping
‣ Information about countries
‣ Port number -> service name
pixlcloud | collect. visualize. understand. Copyright (c) 2011
14. Feature Selection
‣ What are the fields you are interested in?
‣ Compute new fields
‣start time, end time -> duration
‣IP subnets [ 10.2.4.2 -> 10.0.0.0/8 or 192.168.1.2 -> 192.168.1.0/24 ]
‣ Entropy: H ( X ) = E ( I ( X ) )
‣ Dimensionality reduction
‣See Bryan’s talk!
pixlcloud | collect. visualize. understand. Copyright (c) 2011