Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti

Modulo 5 (cod. LABCD5)  
Part I
Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation 
 
3 hours
Roberto Trasarti

Modulo 5 (cod. LABCD5)  
Part I
Extracting Knowledge from data… a twisted story. 
3 years to be shrinked in 3 hours. 
Roberto Trasarti

Definitions
Data collection is a systematic process of collecting detail information about desire objective
from selected sample under controlled settings.
Nature, scope and objective of research: The selected data
collection method should always maintain a balance among
nature, scope and objectives of the study.
Budget: Availability of funds for the research project
determines to a large extent which the method would be
suitable for the collection of data.
Time: Prefixed time frame for the research project has also to
be taken into account in deciding a particular method of data
collection.
Sufficient knowledge: Proper procedure and required

Primary Data
Primary data means original data that has been collected specially for the purpose
in mind. It means someone collected the data from the original source first hand.
Data collected this way is called primary data.
Primary data has not been published yet and is more reliable, authentic and
objective. Primary data has not been changed or altered by human beings;
therefore its validity is greater than secondary data.

Secondary Data
Secondary data is the data that has been already collected by and readily
available from other sources. When we use Statistical Method with Primary Data
from another purpose for our purpose we refer to it as Secondary Data. It means
that one purpose's Primary Data is another purpose's Secondary Data. So that
secondary data is data that is being reused. Such data are more quickly
obtainable than the primary data.
These secondary data may be obtained from many sources, including literature,
industry surveys, compilations from computerized databases and information
systems, and computerized or mathematical models of environmental processes.

Qualitative Methods
Exploratory in nature, these methods are mainly concerned at gaining insights and
understanding on underlying reasons and motivations, so they tend to dig deeper. Since they
cannot be quantified, measurability becomes an issue. This lack of measurability leads to the
preference for methods or tools that are largely unstructured or, in some cases, maybe
structured but only to a very small, limited extent. 
 
Generally, qualitative methods are time-consuming and expensive to conduct, and so
researchers try to lower the costs incurred by decreasing the sample size or number of
respondents.

Quantitative Methods
Data can be readily quantified and generated into numerical form, which will then
be converted and processed into useful information mathematically. The result is
often in the form of statistics that is meaningful and, therefore, useful. Unlike
qualitative methods, these quantitative techniques usually make use of larger
sample sizes because its measurable nature makes that possible and easier.

Face-to-Face Interviews
This is considered to be the most common data
collection instrument for qualitative research, primarily
because of its personal approach. The interviewer will
collect data directly from the subject (the interviewee),
on a one-on-one and face-to-face interaction. This is
ideal for when data to be obtained must be highly
personalized.
Generally the face-to-face is a qualitative method.

Surveys/Questionnaires 
Questionnaires often utilize a structure comprised of short
questions.
Qualitative questionnaires, they are usually open-ended,
with the respondents asked to provide detailed answers, in
their own words. It’s almost like answering essay questions. 
 
Quantitative paper surveys pose closed questions, with the
answer options provided. The respondents will only have to
choose their answer among the choices provided on the
questionnaire.

Observation
Can be be done with the researcher taking a participatory
stance, immersing himself or not in the setting where his
respondents are, and generally taking a look at everything, while
taking down notes.
Researcher taking notes and interacting is a qualitative method
Quantitative observation in the case the data is collected
through systematic observation and measuring specific aspects
or using devices recording events (such as GPS devices or
Mobile phones). 
 
.

Temporal dimension: Longitudinal data collection
This is a research or data collection method that is performed
repeatedly, on the same data sources, over an extended period of
time. It is an observational research method that could even cover a
span of years and, in some cases, even decades. The goal is to find
correlations through an empirical or observational study of subjects
with a common trait or characteristic.

Case of Study
Data is gathered by taking a close look and an in-depth analysis of a “case study”
or “case studies” – the unit or units of research that may be an individual, a group
of individuals, or an entire organization. This methodology’s versatility is
demonstrated in how it can be used to analyze both simple and complex subjects. 
There is the risk of having biases due the undersampling.

Can we estimate Country well-being using new Big Data sources?
We studied human behavior through the lens of phone data records by means of
new statistical indicators that quantify and possibly “nowcast” the well-being and the
socio-economic development of a territory.

What defines the human division of territory?  
cities are placed in particular areas for a number of good reasons: communication routes, natural
resources, migration flows. But once cities are located in a given spot, who decides where one
city ends and another begins?  
Network analysis can be useful in this context, because it can provide an objective way to divide the
territory according to a particular theory.

What is the effect of Topics/Posts Recommendation systems in Social
Networks?
Algorithmic bias amplifies opinion polarization of the users showing them only a
specific (their) view of the reality.

Big Data: How much data?
" Google processes 20 PB a day (2008)
" Wayback Machine has 3 PB + 100 TB/month
(3/2009)
" Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
" eBay has 6.5 PB of user data + 50 TB/day (5/2009)
" CERN’s Large Hydron Collider (LHC) generates 15
PB a year
640K ought to be enough
for anybody.

Velocity (Speed) 
" Data is begin generated fast and need to be processed fast
" Online Data Analytics
" Late decisions ➔ missing opportunities
" Examples
○ E-Promotions: Based on your current location, your purchase
history, what you like ➔ send promotions right now for store next
to you
○ Healthcare monitoring: sensors monitoring your activities and
body ➔ any abnormal measurements require immediate reaction
20

Real-time/Fast Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
" The progress and innovation is no longer hindered by the ability to collect data
" But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
21

Variety (Complexity)
" Relational Data (Tables/Transaction/Legacy Data)
" Text Data (Web)
" Semi-structured Data (XML)
" Graph Data
○ Social Network, Semantic Web (RDF), …
" Streaming Data
○ You can only scan the data once
" A single application can be generating/collecting
many types of data
" Big Public Data (online, weather, finance, etc)
22
To extract knowledge➔ all these types of data need to
linked together

Customer
Social
Media
Gaming
Entertain
Banking
Finance
Our
Known
History
Purchase
A Single View to the Customer

The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
24

Big Data vs Small Data
Not always the Big Data is the right choice: 
" Bigger data may lead to too simple general
understanding of the phenomena.
" It may contain Biases or Prejudices
" It may encourage bad analyses

Basic statistics: Mean
The arithmetic mean, more commonly known as “the average,” is the sum of a list of numbers divided by the
number of items on the list. The mean is useful in determining the overall trend of a data set or providing a rapid
snapshot of your data. Another advantage of the mean is that it’s very easy and quick to calculate.
Pitfall:
Taken alone, the mean is a dangerous tool. In some data sets, the mean is also closely related to the mode and
the median (two other measurements near the average). However, in a data set with a high number of outliers
or a skewed distribution, the mean simply doesn’t provide the accuracy you need for a nuanced decision.

Basic statistics: Standard Deviation
The standard deviation, often represented with the Greek letter sigma, is the measure of a spread of data
around the mean. A high standard deviation signifies that data is spread more widely from the mean, where a
low standard deviation signals that more data align with the mean. In a portfolio of data analysis methods, the
standard deviation is useful for quickly determining dispersion of data points.
Pitfall:
Just like the mean, the standard deviation is deceptive if taken alone. For example, if the data have a very
strange pattern such as a non-normal curve or a large amount of outliers, then the standard deviation won’t give
you all the information you need.

Basic statistics: Quartile/Percentile
The median is central to many experimental data sets, and to
calculate median in such examples is important, by not falling into
the trap of reporting the arithmetic mean.
Quartile is a useful concept in statistics and is conceptually similar
to the median. The first quartile is the data point at the
25th percentile, and the third quartile is the data point at the
75th percentile. The 50th percentile is the median.
the median is a measure of the central tendency of the data but
says nothing about how the data is distributed in the two arms on
either side of the median. Quartiles help us measure this.

Basic statistics: Regression
Regression models the relationships between dependent and explanatory variables, which are usually charted
on a scatterplot. The regression line also designates whether those relationships are strong or weak.
Regression is commonly taught in high school or college statistics courses with applications for science or
business in determining trends over time.
Pitfall:
Sometimes, the outliers on a scatterplot (and the reasons for them) matter significantly. For example, an outlying
data point may represent the input from your most critical supplier or your highest selling product. The nature of
a regression line, however, tempts you to ignore these outliers.

Redundant Attributes
" An attribute is redundant when it can be derived from another
attribute or set of them.
" Redundancy is a problem that should be avoided
○ It increments the data size ! modeling time for DM algorithms increase
○ It also may induce overfitting
" Redundancies in attributes can be detected using correlation
analysis

" Correlation Test quantifies the correlation among two nominal
attributes contain c and r different values each:
" where oij is the frequency of (Ai,Bj) and:

" for numerical attributes Pearson’s product moment coefficient is widely
" where m is the number of instances, and A̅ ,B̅ are the mean values of
attributes A and B.
" Values of r close to +1 or -1 may indicate a high correlation among A
and B.

What is Data Quality?
Data quality refers to the ability of a set of data to serve an intended purpose.
Low-quality data cannot be used effectively to do the thing with it that you wish to
do (really!?).
Remember that your data is rarely going to be perfect, and that you have to juggle
managing your data quality with actually using the data.

DQ Measures I
Completeness
Completeness is defined as how much of a data set is populated, as opposed to being left blank. For instance, a survey
would be 70% complete if it is completed by 70% of people. To ensure completeness, all data sets and data items must be
recorded.
Uniqueness
This metric assesses how unique a data entry is, and whether it is duplicated anywhere else within your database. Uniqueness is
ensured when the piece of data has only been recorded once. If there is no single view, you may have to dedupe it.
Timeliness
How recent is your data? This essential criteria assesses how useful or relevant your data may be based on its age. Naturally, if an
entry is dated, for instance, by 12 months, the scope for dramatic changes in the interim may render the data useless.

DQ Measures II
Validity
Simply put, does the data you've recorded reflect what type of data you set out to record? So if you ask for
somebody to enter their phone number into a form, and they type 'sjdhsjdshsj', that data isn't valid, because it
isn't a phone number - the data doesn't match the description of the type of data it should be.
Accuracy
Accuracy determines whether the information you hold is correct or not, and isn't to be confused with validity, a
measure of whether the data is actually the type you wanted.
Consistency
For anyone trying to analyse data, consistency is a fundamental consideration. Basically, you need to ensure
you can compare data across data sets and media (whether it's on paper, on a computer file, or in a database) -
is it all recorded in the same way, allowing you to compare the data and treat it as a whole?

Data Preparation and
Trasformation…

Data Cleaning
" The sources of dirty data include
○ data entry errors,
○ data update errors,
○ data transmission errors and even bugs in the data processing system.
" Dirty data usually is presented in two forms: missing data (MVs) and wrong (noisy) data.

Data Cleaning
" The way of handling MVs and noisy data is quite different:
○ The instances containing MVs can be ignored, filled in manually or with a constant or filled in by using estimations over
the data
○ For noise, basic statistical and descriptive techniques can be used to identify outliers, or filters can be applied to
eliminate noisy instances

What Are Outliers?
" Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different
mechanism
" Outliers are different from the noise data
○ Noise is random error or variance in a measured variable
○ Noise should be removed before outlier detection
" Outliers are interesting: It violates the mechanism that generates the normal data
" Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model

42
Types of Outliers
" Three kinds: global, contextual and collective outliers
" Global outlier (or point anomaly)
○ Object is Og if it significantly deviates from the rest of the data set
○ Ex. Intrusion detection in computer networks
○ Issue: Find an appropriate measurement of deviation
" Contextual outlier (or conditional outlier)
○ Object is Oc if it deviates significantly based on a selected context
○ Attributes of data objects should be divided into two groups
■ Contextual attributes: defines the context, e.g., time & location
■ Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
Global Outlier

Outlier Detection: Statistical Methods
" Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic
model)
○ The data not following the model are outliers.
■ Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
■ There are rich alternatives to use various statistical models
43

Outlier Detection: Proximity-Based Methods
" An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity
of the object is significantly deviates from the proximity of most of the other objects in the
same data set
" The effectiveness of proximity-based methods highly relies on the proximity measure.
" In some applications, proximity or distance measures cannot be obtained easily.
" Often have a difficulty in finding a group of outliers which stay close to each other
" Two major types of proximity-based outlier detection
○ Distance-based vs. density-based
44

Handle Missing Data 
 
There are two most commonly recommended ways of dealing with missing data:
" Dropping observations that have missing values
" Imputing the missing values based on other observations

Drop Data
Dropping missing values is sub-optimal because when you drop observations,
you drop information.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on new data even if
some of the features are missing!
You can drop vertically (a feature of the data) or orizontally (some entry in your
data)

Imputing missing values
Missing categorical data
The best way to handle missing data for categorical features is to simply label them as
’Missing’!
" You’re essentially adding a new class for the feature. This tells the algorithm that the
value was missing.
Missing numeric data
For missing numeric data, you can fill the empty data:
" Filling it in with the mean.
" Filling with a special value
" Allowing an algorithm to estimate the values

Data Normalization
" Sometimes the attributes selected are raw attributes.
○ They have a meaning in the original domain from where they were
obtained
○ They are designed to work with the operational system in which they are
being currently used
" Usually these original attributes are not good enough to obtain accurate
predictive models

Data Normalization
" It is common to perform a series of manipulation steps to transform the
original attributes or to generate new attributes
○ They will show better properties that will help the predictive power of the
model
" The new attributes are usually named modeling variables or analytic
variables.

Data Normalization
Min-Max Normalization
" The min-max normalization aims to scale all the numerical values v of a
numerical attribute A to a specified range denoted by [new − minA, new −
maxA].
" The following expression transforms v to the new value v’:

Data Normalization
Z-score Normalization
" If minimum or maximum values of attribute A are not known, or the data is
noisy, or is skewed, the min-max normalization is good
" Alternative: normalize the data of attribute A to obtain a new distribution with
mean 0 and std. deviation equal to 1

Data Transformation
" It is the process to create new attributes
○ Often called transforming the attributes or the attribute set.
" Data transformation usually combines the original raw attributes using
different mathematical formulas originated in business models or pure
mathematical formulas.

Data Transformation
Linear Transformations
" Normalizations may not be enough to adapt the data to improve the
generated model.
" Aggregating the information contained in various attributes might be beneficial
" If B is an attribute subset of the complete set A, a new attribute Z can be
obtained by a linear combination:

Data Transformation
Quadratic Transformations
" In quadratic transformations a new attribute is built as follows
" where ri,j is a real number.
" These kinds of transformations have been thoroughly studied and can help to
transform data to make it separable.

Data Reduction
" When the data set is very large, performing complex analysis and DM can
take a long computing time
" Data reduction techniques are applied in these domains to reduce the size
of the data set while trying to maintain the integrity and the information of the
original data set as much as possible
" Mining on the reduced data set will be much more efficient and it will also
resemble the results that would have been obtained using the original data
set.

Data Reduction
" The use of binning and discretization techniques is also useful to reduce the
dimensionality and complexity of the data set.
" They convert numerical attributes into nominal ones, thus drastically reducing
the cardinality of the attributes involved

Data Reduction
" Dimensional reduction techniques:
○ Projection
○ Low Variance Filter
○ High Correlation Filter
○ Principal Component Analysis (PCA)
○ Backward Feature Elimination

Data Mining/Machine Learning
" Objective: Fit data to a model
" Potential Result: Higher-level meta information that may not be obvious when
looking at raw data. Patterns and Models.

Find patterns and models?
" Clusters: Clustering algorithms are often applied to automatically group similar instances or objects in clusters (groups). The goal
is to summarize the data to better understand the data or take decision.  
" Classification models: Classification algorithms aims at extracting models that can be used to classify new instances or objects
into several categories (classes).
" Patterns and associations: Several techniques are developed to extract frequent patterns or associations between values in
database.
" Anomalies/outliers: The goal is to detect things that are abnormal in data (outliers or anomalies).
" Trends, regularities: Techniques can also be applied to find trends and regularities in data.
In general, the goal of data mining is to find interesting patterns. What is interesting? 
(1) it easy to understand,
(2) it is valid for new data (not just for previous data);
(3) it is useful,
(4) it is novel or unexpected (it is not something that we know already).

Classification problem
" What we have
○ A set of objects, each of them described by some features
■ people described by age, gender, height, etc.
■ bank transactions described by type, amount, time, etc.
" What we want to do
○ Associate the objects of a set to a class, taken from a predefined list
■ “good customer” vs. “churner”
■ “normal transaction” vs. “fraudulent”
■ “low risk patient” vs. “risky”
?
?
?
?
?
Feature 1
(e.g. Age)
Feature2
(e.g.Income)
15k€
50y
35k€
60y

Classification problem
" What we know
○ No domain knowledge or theory
○ Only examples: Training Set
■ Subset of labelled objects
" What we can do
○ Learn from examples
○ Make inferences about the other objects

The most stupid classifier
" Rote learner
○ To classify object X, check if there is a labelled example in the training set identical to X
○ Yes ! X has the same label
○ No ! I don’t know
?

Classify by similarity
" K-Nearest Neighbors
○ Decide label based on K most similar examples
K=3

Build a model
" Example 1: linear separation line

Build a model
" Example 2: Support Vector Machine (linear)

Build a model
" Example 3: Non-linear separation line

Build a model
" Decision Trees
Income
> 15k€ ?
Age >
50y ?
Age
Income
yes no
yes no

Clustering
What if no labels are known? We might lack examples 
Labels might actually not exist at all…

Clustering
Objective: find structure in the data
Group objects into clusters of similar entities

Clustering: K-means (family)
" Find k subgroups that form compact and well-separated clusters
K=3
Cluster compactness
Cluster separation

" Output 1: a partitioning of the initial set of objects
K=3

" Output 2: K representative objects (centroids)
" Centroid = average profile of the objects in the cluster
K=3
• Avg. age
• Avg. weight
• Avg. income
• Avg. .n children
• …

Clustering: hierarchical approaches
" Sometimes we can have (or desire) multiple levels of aggregation

Clustering: hierarchical approaches
" Sometimes we can have (or desire) multiple levels of aggregation
Dendogram

Community detection
" Equivalent to clustering in the world of networks
" Some of our objects are linked
" Linked objects are more  
likely to belong to the  
same group
○ E.g. users exchanging emails
" Links can be weighted
○ E.g.: n. of emails exchanged

Community detection
" Objective
○ Identify strongly connected  
subgroups that are weakly  
connected to the others
" General methodology
○ Find weak connections (small set  
of links that are “bridges”)
○ Remove them
○ Each connected component  
remaining is a community
x
x
x
x

Frequent patterns
" Events or combinations of events that appear frequently in the data
" E.g. items bought by customers of a supermarket

Frequent patterns
" Frequent itemsets w.r.t. minimum threshold
" E.g. with Min_freq = 5

Frequent patterns
Association rules 
If items A1, A2, … appear in a basket, then also B1, B2, … will appear
there 
Notation: A1, A2, … => B1, B2, … [ C%] 
C = confidence, i.e. conditional probability
=> [ 80% ]
=> [ 100% ]
=> [ 66% ]
=> [ 20% ]

Frequent patterns 
Complex domains
" Frequent sequences (a.k.a. Sequential patterns)
" Input: sequences of events (or of groups)

Frequent patterns 
Complex domains
" Objective: identify sequences that occur frequently
• Sequential pattern:

Collaborative Filtering
" Goal: predict what movies/books/… a person may be interested in, on the basis of
○ Past preferences of the person
○ Other people with similar past preferences
○ The preferences of such people for a new movie/book/…
" One approach based on repeated clustering
○ Cluster people on the basis of preferences for movies
○ Then cluster movies on the basis of being liked by the same clusters of people
○ Again cluster people based on their preferences for (the newly created clusters of) movies
○ Repeat above till equilibrium
" Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find
information of interest

Deep learning
• Age
• Weight
• Income
• Children
• Likes sport
• Likes reading
• Education
• …
Raw representation Higher-level representation
• Young parent
• Fit sportsman
• High-educated reader
• Rich obese
• …
35
65
23 k€
2
0.3
0.6
high
…
0.9
0.1
0.8
0.0
…
The Objective is to learn an high-level representation of the data automatically from (almost)
raw input. This is done automatically using examples and reinforcement.

How do we train?
𝒉 = 𝝈(𝐖𝟏 𝒙 + 𝒃𝟏)
𝒚 = 𝝈(𝑾𝟐 𝒉 + 𝒃𝟐)
𝒉
𝒚
𝒙 4 + 2 = 6 neurons (not counting inputs)
[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Weights
Activation functions
Simple Neural Network

Neural Network and Deep Learning

Multiple Levels Of Abstraction

Training and Test
The data is usually split into training data and test data. The training set
contains a known output and the model learns on this data in order to be
generalized to other data later on. The test dataset (or subset) in order to test
our model’s prediction on this subset.

Accuracy
 
The accuracy of a prediction is the number of correct prediction against the
wrong ones:
Accuracy = True Negative + True Positive / Total

Accuracy and F1 score
 
The F1 score is the harmonic average of the precision and recall, where an
F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Comparisons
In order to assess the quality of the results of a prediction, it is possible to use “alternative”
models:
" Constant
" Random Models
" Simple Probabilistic Models
" …
" Your Model
" …
" Ideal
Moreintelligence

Examples
• Boat Activity recognition for
advertisement
• Detecting and Understanding Events

Navionics
Warp – Water Activity Recognition Process

Objectives
Identify the category of each user (main activity to be chosen from fishing,
sailing, cruising and canoeing, as well as other categories such as boat type,
water type and type of area of preference). This will enable targeted marketing
operations, where the banners shown within Navionics app will take into
consideration the category associated to the user.

Input DATA (1)
" Tracks: records of the trip performed by a user with Navionics app actived. Tracks are basically
sequences of GPS points that allow to reconstruct where and when the trip took place.
" Land: contains the geographical borders of land, used to remove points outside water, deemed not
interesting for this project.
" Sea Bottom: description of the type of bottom in each point in water. Local areas having the same sea
bottom type are represented in the data as a single geometric region.

Input DATA (2)
" Sonar: measures the water depth at each geographical point, worldwide. As for the sea bottom, local
areas having similar depth are represented in the data as a single geometric region. For each region a
minum and maxim depth are given. Usually the regions correspond to fixed intervals of depth, e.g.
mininum 100 feet and maximum 200 feet, which are fine-grained on shallow waters (intervals in the order
of the foot neat the coast) and coarse-grained on deep waters (intervals in the order of thousands of feet
in the middle of the ocean).
" Wrecks: stores the position of the wrecks localized by Navionics users – obviously it is a small fraction of
the wrecks really existing worldwide, although the coverage is better in the areas that are more popular
among Navionics users.

Pre-Processed DATA
" Water Types: In general Navionics uses a space tessellation covering the
world where each cell corresponds to a square of 0.125° x 0.125°, about
10Km2. In the original data water and land are represented as geometries
included in those cells. Using a clustering algorithm we identified bodies
of water classifing them into lake (if they are closed), river or sea/ocean.
" Heat Map: a representative frequency map was extracted based on the
most recent segment of track data available, simply counting, for each
cell of the tesselation, the number of distinct users that visited it at least
once.
" Coastline: Joining several data sources from navionics we obtained the
costline in the entire world. A post-processing transformation is used in
order to simplify the geometries for computational issue.

Building the  
Water Activity Behavioral Model
The blue boxes represent the input data sources,
including the users’ tracks, which are the keystone of
the process. A first set of processes (Coastline,
Analyzing and Features) derive descriptive features out
of the raw track data, with the aid of the context
knowledge provided by the other data sources.
This set of features is then normalized in preparation of
a clustering process that extracts a set of
representative behaviours, till without a label
associated to them.
A set of tracks labeled by the domain experts is used to
assign label information to each cluster representative.
This information is later exploited as input for the
construction of a classification model to be used for
labeling new data.

From movement tracks  
to movement “components”
The raw track data has a few main issues that need to be treated
before any other step:
" Due to early switch ons and late switch offs of the app, some
tracks include points outside water, and therefore not useful
for our task. All these points are filtered out.
" A track very often contains a mix of different activities. In
particular, some parts of the track might be movement and
others are simple stops.
For these two reasons we proceeded in reconstruct the trajectories
considering spatio-temporal constraints instead of the track identifier
coming from the app and decompose them into move components
and stop components

COMPONENT Features (1)
" starthh, startdoy and startdow represent the hour of the day (0-23), the day
of the year (1-366) and the day of the week (1-7) of the beginning of the
component.
" lat represents the latitude of the beginning of the component.
" centercoast represents the distance between the central point (in terms of
time) of the component from the coastline, as computed for each point in
Section 5.
" freq represents the popularity of the cell (w.r.t. Navionics tessellation, see
Section 3) where the component spent most of the time.
" len and duration represent the duration of the component, respectively in terms
of points recorded and time spent.
" domwater and domsea report resp. the most frequent water type and most
frequent sea bottom type among the points in the component.
" domsea_perc is the percentage of points of the component that belong to the
dominant sea bottom type category.
Features vector

COMPONENT Features (2)
" depth, slope and speed are analyzed, represented by some standard indicators: 1Q, 2Q and
3Q (i.e. 25-th percentile, 50-th and 75-th) and the interquartile range (i.e. Q3 – Q1). This results
into 12 features, named qdepth_25, qdepth_50, …, qspeed_range.
" rangle represents the percentage (ratio) of points that have an angle larger than a fixed
threshold (by default 30°), thus measuring the frequency of turns of the boat.
" rwreck is the percentage of points that are close a wreck.
" type distinguishes stop components from move components (see Section 4.2).
" entropy is the mathematical entropy function computed over the set of heading values of the
component.
" Accelerations and decelerations. the speed at each point of a track is compared against
previous ones, in particular those that are more recent than a give temporal threshold (now fixed
to 2 minutes). If the present speed is higher than the minimum previous speed in such interval by
more than a fixed threshold (now fixed to a very small 0.55 m/s) and more than a fixed
percentage (now fixed to 20%), then the current point is considered an acceleration point
" Wandering. A rather frequent behaviour associate to fishing consists in wandering around the
same location, without ever really stopping, basically exploring an area and wait for the fish. In
terms of trajectories, that results in forming very entangled shapes.

CONTEXTS
All the features mentioned above where computed over the whole component. In order to get a more
detailed view of what happened during the component, we identify periods where something specific occurs,
named contexts, and then compute the same features mentioned above considering only the subgroups of
points just identified. In particular, we considered three contexts:
" Near-shore points
" Off-shore points
" Noodle points
In addition to the features described in the previous sections, we compute:
" the percentage of points of the component that belong to the context, e.g. the percentage of points
spent near-shore w.r.t. the total.
" maxl, the length of the longest contiguous sequence of points of the context, e.g. a boat might
perform several isolated noodles, therefore here we will measure only the longest one.

TRACK FEATURES
Finally, a few features are added to the component, that relate the component itself to the overall track it belongs
to:
" ncomponents is the number of components that compose the track.
" rcomponent is the percentage of points of the track that belong to our component.
" track_loop is the geographical distance between the first point of the first component and the last point of
the last component. Very small values identify loops, i.e. trips that “come back home” at the end, while high
values suggest that the track is part of a longer trip (e.g. a week-long cruise) or that the boat has no fixed
docking slot.

FEATURES SELECTION
Violating the non-redundancy assumption
considered by the clustering algorithm,
might lead to clusters that are dominated
by a few attributes and therefore do not
consider properly all the information
contained in all the other features. For this
reason, we started the clustering process
with a selection of the features that
appeared be well focused with our current
objectives, also avoiding excessive
correlations.

NORMALIZATION
we adopted a standard Z-score normalization, consisting in replacing each feature value with a new one as follows: 
new_value = (original_value – average_value) / standard_deviation

RESCALING
An ad hoc rescaling factors can be applied to the features in order to impose to the algorithms to give more or
less importance to a given attribute. Through discussions with the domain experts and preliminary experiments,
we decided to rescale the following attributes:
" components that occurred over different types of water should be clearly separated. For this reason
domwater was given an high weight by multiplying it by a factor 10.
" similarly, stop components and move components represent very different things, therefore should be
kept separated. Therefore, feature type was multiplied by 10.
" while the latitude feature is useful as proxy of general climate conditions (tropical vs. polar, southern vs.
northern hemisphere), it might risk to make clusters too location-specific. For this reason its weight was
reduced through a multiplicative factor of 0.25.

K-MEANS
The components represented by verctors of 32 features are the imput for a K-Means clustering. The K is selected in
order to obtain a trade-off between two objectives:
(i) have enough clusters to capture the different possible users’ behaviours.
(ii) keep the number of clusters small enough to make it feasible, for a domain expert, to observe and label a
reasonable number of sample components that belong to each cluster.
The clustering is an unsupervised algorithm, thus we discover a set of K unlabeled behaviors.

Design a  
Survey for the Experts
From the clustering result we created a survey
where the experts were asked to specify: 
" For each component, its associated
activity
" The overall activity performed during the
track
" The most likely type of boat adopted
" The area (inshore/offshore/intra-coastal)
and type of water (salt/lake)
" Optional notes
The world has been divided into 6 macro-areas:
" United States East coast (USE)
" United States West coast (USW)
" Australia (Aus)
" Mediterranean see (Med)
" Scandinavia (Scand)
" United Kingdom (UK)

Expert Knowledge
Cruising results to be by far the most popular
activity in the tracks of the training set, followed
by sailing and fishing. Very few canoeing tracks
were identified. Also, fishing and cruising tracks
tend to be formed by several components of the
same type (respectively ~2.9 and ~2.6
components per track, as compared to the 1.7 of
sailing)
Looking at the activities in the different
geographical areas, it is clear that in the USE and
USW areas the distribution is well balanced, while
in the Mediterranean fishing is slightly
underrepresented, in Australia both fishing and
sailing are weak, and the rest (UK and
Scandinavia) only sailing emerges significantly.

Building the  
Semantic Model
For each cluster we compute a probability distribution over the set of possible activities. This is done at two levels:
" component-level: the number of components labeled with that specific activity
" track-level: the number of components that belong to a track having that activity as overall labeling
The two counts, obtained for each activity, are summed up according to weights defined by the analyst: 0.85 for track-level labels, and
0.15 for component-level ones.
Moreover the uncertain information provided by the experts with the “?” sign, we counted also uncertain labels, yet with a weight set to
0.15.

Domain expert's rules meta-
features for tracks 
The domain experts provided a set of rules that tried to approximate their idea of fishing behaviour, cruising
behaviour, etc. We translated them into features-based rules. Example:
IF at least one of the following apply:
" the component is in a “noodle” shape (r_noodles>=0.2)
" the component is slower than 10 knots (qspeed_75 <5.14) AND follows a slope greater than 55%
(qslope_50>=5)
" the component is slower than 10 knots (qspeed_75 <5.14) AND is shorter than 328 ft (len<=100)
" the component moves in several directions (entropy>2) AND is longer than 54 nm (len>100000)
THEN the component has a Fishing behaviour

BUILDING THE CLASSIFIER
Distribution  
from cluster 
“Activity"
Distribution  
from cluster 
“Boat"
Distribution  
from cluster 
“Zone"
Distribution  
from Rules
A C4.5 algorithm is used to build a classification tree over vectors summarizing the result of all the previos processes. Each track can be represented by a vector containing:
In practice this new vector is a higher representation of the track defined by different distributions derived by its stop and move components.

TUNING THE CLASSIFIER
In order to find the decision tree that has the best accuracy and yet (where possible) does not loose any label, our
algorithms play with the two input parameters of C4.5:
" min-leaf : how many objects of the training set should end in each leaf of the model. The larger is this
value, the more “solid” will be the prediction provided by the leaf. Yet, larger values also imply that the tree
must have a smaller number of leaves, thus favoring simple models;
" conf-factor: confidence factor of leaves, i.e. how much should the dominant label of a leaf predominate on
the others. A very high value requires that leaves are basically pure, yet implying that several splits are
performed, and therefore the model is more detailed.

Classification Results
As we can observe, the distributions are similar
to those of the components in the training set.
The main differences include the fact that
“cruising” looks more present now in the USE,
USW and Mediterranean areas, whereas it
dropped dramatically in UK and Scandinavia.
Also, as already noticed in previous sections,
“sailing” completely disappeared in Australia,
since its model did not capture that category.

Distribution of Activities (USE) IN TIME
An interesting view on the data can be obtained
plotting the temporal distribution of the activities
along the whole duration of the data we had access
to, i.e. from May 2014 to April 2016.
In addition to the usual seasonal behaviours –
overall increase of all activities in the summer
months – we can observe that fishing increased
sensibly its presence in the data during the last
year. Possible causes might be an increased number
of fishermen among Navionics users, or an
increased propensity among fishermen to share their
tracks, or a combination of the two.

User Classification
The labels assigned to each single track can be simply aggregated (counted) to
infer the distribution of activities for each user. The next step, then, consists in
selecting the activity – or activities – that represents the user best.
After some trials and evaluation of the results with the domain experts, the following
approach was decided:
• If the user has a percentage of fishing tracks larger than 30%, we label the user
as “fisherman”, since at the present fishing is considered a strategic segment of
customers.
• Otherwise, the label with the largest percentage is selected, with no minimum
thresholds.

Adaptive highly Scalable Analytics
Platform 
 
Task: Event Detection analysis: detecting events in
a specific geographic area classifying the different
kind of users involved.

The Implemented ETL Process
A continuous flow of data from the users is stored in the
Wind servers. The first step to realize a realistic service in
the ASAP platform is to define and implement an ETL
(Extract Transform Load) process able to update the data
periodically (i.e. monthly)

The Collected Data
" Structured data: Charging Data Records (CDR) related to Voice, SMS, Traffic Data;
Customer Relationship Management (CRM) data containing users information
" Covered geographical region: city of Rome
" Dataset size per snapshot: ≈ 1.2 GBytes per day
" Number of records: ≈ 5.6 million lines per day
A dataset of about 50 GBytes per month. The dataset is appropriately anonymized to comply with
Italian and European privacy regulations. 
 
Seven months are now collected and stored.
City of Rome
Metropolitan area

The Configured Cluster
A cluster of 4 machines with 12 hyper-threading processors. Spark
installed as runtime context.

Spatio-temporal Statistics: Time Series
Simply statistics are not so
informative…

Adding a new Dimension: users’ classification
" The Sociometer is a methodology to classify the users considering their “call profile”: 
• A person is Resident in an area A when his/her home is inside A. Therefore the mobility tends to be from and towards his/
her home. 
• A person is a Commuter between an area B and an area A if his/her home is in B while the work/school place is in A.
Therefore the daily mobility of this person is mainly between B and A. 
• A person is a Dynamic Resident between an area A and an area B if his/her home is in A while the work/school place is
in B. A Dynamic Resident represents a sort of “opposite” of the Commuter.
• A person is a Visitor in an area A if his/her home and work/school places are outside A, and the presence inside the area
is limited to a certain period of time that can allow him/her to perform some activities in A.

User Profiling 
123643 Cell12 24/06/2015 14:05
123643 Cell12 24/06/2015 18:13
123643 Cell15 25/06/2015 11:05
123643 Cell15 25/06/2015 20:42
123643 Cell11 25/06/2015 21:05
123643 Cell12 26/06/2015 10:01
….
● Derive presence distribution for each < user, area>
t1 = [00:00-08:00)
t2 = [8:00-19:00)
t3 = [19:00-24:00)

Sociometer 
● Based on clustering
●
K-means: start with K random representatives, and iteratively refine them
●
Output: set of reference (unlabeled) profiles

Archetypes  
●
Archetypes represent the expert knowledge and represent the perfect “commute”, “resident”, “visitor”,
”dynamic resident”. More than an archetype may exist for the same class.
●
The centroid of each cluster is assigned to the most similar archetype. The class is than propagated to all the
users in the clusters.
Commuter “Static” resident
Visitors
“Dynamic” resident

Multiple profiles
Result for each user: set of individual profiles.

Post-processing: Passing By
1 single call Multiple calls
We distinguish between Visitors and the subclass of
Passing by which are people making a single call.
It’s an heuristic which allow to exclude highways in some
cases or characterize a different kind of visit

Rome Case of Study
In this case of study we show how the integration of presented
methods will be able to extract interesting knowledge from the
Wind CDR data.
City of Rome
Metropolitan area
Covered geographical region: city of Rome
Dataset size per snapshot: ≈ 1.2 GBytes per day
Number of records: ≈ 5.6 million lines per day
9 months between 2015 and 2016
January 2016 July 2016

The proposed methodology
The approach used focus the analysis on specific area
using the sociometer to classify the users and then
highlight different behaviors which can be studied in
details.
San Pietro Square
Olympic Stadium
Circo Massimo
San Giovanni Square

San Pietro Square
Residents are
the majority
and cover the
other classes
having a lower
impact on the
overall
distribution.
Anyway This
doesn’t mean
that they have
no effect on the
city!

San Pietro Square (Scaled)
Extracting the
typical behavior of
each class of users
the distribution may
be
“rescaled” (normali
zed) and the
anomalies
emerges. In other
words the real
events are spotted.
Moreover each
event is
represented by a
peak in one or
more classes of
users.

San Pietro Square (Interpretation)

San Pietro – Characterizing Padre Pio
event
Looking at the day of the
event (6th february)and the
day after compared to the
typical distribution in the
normal Saturday and
Sanday it’s evident how the
event change the
distribution.
In particular this even
involves both the
passingby and the
commuter types (people
working in the area and
people visiting the event
and than disappear)
Event Day after

San Pietro – Flows to Padre Pio event
Event Day after
FromareaN.
FromareaN.

San Pietro – Characterizing Jubilee B&G
Another event (24th April)
happening in the same days
of the week has a
completely different impact
involving dynamic
residents, hence the event
is more local than the
previous one.
EventDay before

San Pietro – Flows to Jubilee B&G
Day Before Event
FromareaN.
FromareaN.

Research infrastructures…
www.sobigdata.eu
Ethics…

References
Books:
" Introduction to Data Mining, by V. Kumar
" Mobility, Data Mining and Privacy, Geographic Knowledge Discovery, By F. Giannotti and D. Pedreschi
" Data Analytics Made Accessible, by A. Maheshwari
" Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by E. Siegel
" Too Big to Ignore: The Business Case for Big Data, by award-winning author P. Simon
" Lean Analytics: Use Data to Build a Better Startup Faster, by A. Croll and B. Yoskovitz
" Data Smart: Using Data Science to Transform Information into Insight, by J. W. Foreman
" Big Data: A Revolution That Will Transform How We Live, Work, and Think by V. Mayer-Schönberger and K. Cukier
" Business UnIntelligence: Insight and Innovation Beyond Analytics and Big Data, by B. Devlin
" Big Data at Work: Dispelling the Myths, Uncovering the Opportunities, by T. H. Davenport
" Analytics in a Big Data World: The Essential Guide to Data Science and its Applications, by B. Baesens
" Data Science For Business: What You Need to Know About Data Mining & Data-Analytic Thinking, by F. Provost & T. Fawcett
" Numsense! Data Science for the Layman: No Math Added by Annalyn Ng & Kenneth Soo
" Data-Driven HR: How to Use Analytics and Metrics to Drive Performance by Bernard Marr
" Creating Value With Social Media Analytics: Managing, Aligning, and Mining Social Media Text, Networks, Actions, Location,
Apps, Hyperlinks, Multimedia, & Search Engines Data by Gohar F. Khan
" Analytic Philosophy: A Very Short Introduction by Michael Beaney

Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti

Similaire à Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti (20)

Plus de Laboratorio di Cultura Digitale, labcd.humnet.unipi.it

Plus de Laboratorio di Cultura Digitale, labcd.humnet.unipi.it (20)

Dernier

Dernier (20)

Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti