SlideShare une entreprise Scribd logo
1  sur  143
Télécharger pour lire hors ligne
Modulo 5 (cod. LABCD5) 

Part I
Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation



3 hours
Roberto Trasarti
Modulo 5 (cod. LABCD5) 

Part I
Extracting Knowledge from data… a twisted story.

3 years to be shrinked in 3 hours.

Roberto Trasarti
Data Collection…
Definitions
Data collection is a systematic process of collecting detail information about desire objective
from selected sample under controlled settings.
Nature, scope and objective of research: The selected data
collection method should always maintain a balance among
nature, scope and objectives of the study.
Budget: Availability of funds for the research project
determines to a large extent which the method would be
suitable for the collection of data.
Time: Prefixed time frame for the research project has also to
be taken into account in deciding a particular method of data
collection.
Sufficient knowledge: Proper procedure and required
Primary Data
Primary data means original data that has been collected specially for the purpose
in mind. It means someone collected the data from the original source first hand.
Data collected this way is called primary data.
Primary data has not been published yet and is more reliable, authentic and
objective. Primary data has not been changed or altered by human beings;
therefore its validity is greater than secondary data.
Secondary Data
Secondary data is the data that has been already collected by and readily
available from other sources. When we use Statistical Method with Primary Data
from another purpose for our purpose we refer to it as Secondary Data. It means
that one purpose's Primary Data is another purpose's Secondary Data. So that
secondary data is data that is being reused. Such data are more quickly
obtainable than the primary data.
These secondary data may be obtained from many sources, including literature,
industry surveys, compilations from computerized databases and information
systems, and computerized or mathematical models of environmental processes.
Qualitative Methods
Exploratory in nature, these methods are mainly concerned at gaining insights and
understanding on underlying reasons and motivations, so they tend to dig deeper. Since they
cannot be quantified, measurability becomes an issue. This lack of measurability leads to the
preference for methods or tools that are largely unstructured or, in some cases, maybe
structured but only to a very small, limited extent.



Generally, qualitative methods are time-consuming and expensive to conduct, and so
researchers try to lower the costs incurred by decreasing the sample size or number of
respondents.
Quantitative Methods
Data can be readily quantified and generated into numerical form, which will then
be converted and processed into useful information mathematically. The result is
often in the form of statistics that is meaningful and, therefore, useful. Unlike
qualitative methods, these quantitative techniques usually make use of larger
sample sizes because its measurable nature makes that possible and easier.



Face-to-Face Interviews
This is considered to be the most common data
collection instrument for qualitative research, primarily
because of its personal approach. The interviewer will
collect data directly from the subject (the interviewee),
on a one-on-one and face-to-face interaction. This is
ideal for when data to be obtained must be highly
personalized.
Generally the face-to-face is a qualitative method.
Surveys/Questionnaires

Questionnaires often utilize a structure comprised of short
questions.
Qualitative questionnaires, they are usually open-ended,
with the respondents asked to provide detailed answers, in
their own words. It’s almost like answering essay questions.



Quantitative paper surveys pose closed questions, with the
answer options provided. The respondents will only have to
choose their answer among the choices provided on the
questionnaire.
Observation
Can be be done with the researcher taking a participatory
stance, immersing himself or not in the setting where his
respondents are, and generally taking a look at everything, while
taking down notes.
Researcher taking notes and interacting is a qualitative method
Quantitative observation in the case the data is collected
through systematic observation and measuring specific aspects
or using devices recording events (such as GPS devices or
Mobile phones).



.

Temporal dimension: Longitudinal data collection
This is a research or data collection method that is performed
repeatedly, on the same data sources, over an extended period of
time. It is an observational research method that could even cover a
span of years and, in some cases, even decades. The goal is to find
correlations through an empirical or observational study of subjects
with a common trait or characteristic.
Case of Study
Data is gathered by taking a close look and an in-depth analysis of a “case study”
or “case studies” – the unit or units of research that may be an individual, a group
of individuals, or an entire organization. This methodology’s versatility is
demonstrated in how it can be used to analyze both simple and complex subjects.

There is the risk of having biases due the undersampling.
Can we estimate Country well-being using new Big Data sources?
We studied human behavior through the lens of phone data records by means of
new statistical indicators that quantify and possibly “nowcast” the well-being and the
socio-economic development of a territory.
What defines the human division of territory? 

cities are placed in particular areas for a number of good reasons: communication routes, natural
resources, migration flows. But once cities are located in a given spot, who decides where one
city ends and another begins? 

Network analysis can be useful in this context, because it can provide an objective way to divide the
territory according to a particular theory.
What is the effect of Topics/Posts Recommendation systems in Social
Networks?
Algorithmic bias amplifies opinion polarization of the users showing them only a
specific (their) view of the reality.
Big Data…
Big Data: How much data?
" Google processes 20 PB a day (2008)
" Wayback Machine has 3 PB + 100 TB/month
(3/2009)
" Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
" eBay has 6.5 PB of user data + 50 TB/day (5/2009)
" CERN’s Large Hydron Collider (LHC) generates 15
PB a year
640K ought to be enough
for anybody.
Some Make it 4V’s
19
Velocity (Speed)

" Data is begin generated fast and need to be processed fast
" Online Data Analytics
" Late decisions ➔ missing opportunities
" Examples
○ E-Promotions: Based on your current location, your purchase
history, what you like ➔ send promotions right now for store next
to you
○ Healthcare monitoring: sensors monitoring your activities and
body ➔ any abnormal measurements require immediate reaction
20
Real-time/Fast Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
" The progress and innovation is no longer hindered by the ability to collect data
" But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
21
Variety (Complexity)
" Relational Data (Tables/Transaction/Legacy Data)
" Text Data (Web)
" Semi-structured Data (XML)
" Graph Data
○ Social Network, Semantic Web (RDF), …
" Streaming Data
○ You can only scan the data once
" A single application can be generating/collecting
many types of data
" Big Public Data (online, weather, finance, etc)
22
To extract knowledge➔ all these types of data need to
linked together
Customer
Social
Media
Gaming
Entertain
Banking
Finance
Our
Known
History
Purchase
A Single View to the Customer

The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
24
Big Data vs Small Data
Not always the Big Data is the right choice:

" Bigger data may lead to too simple general
understanding of the phenomena.
" It may contain Biases or Prejudices
" It may encourage bad analyses
Data Understanding…
Basic statistics: Mean
The arithmetic mean, more commonly known as “the average,” is the sum of a list of numbers divided by the
number of items on the list. The mean is useful in determining the overall trend of a data set or providing a rapid
snapshot of your data. Another advantage of the mean is that it’s very easy and quick to calculate.
Pitfall:
Taken alone, the mean is a dangerous tool. In some data sets, the mean is also closely related to the mode and
the median (two other measurements near the average). However, in a data set with a high number of outliers
or a skewed distribution, the mean simply doesn’t provide the accuracy you need for a nuanced decision.


Basic statistics: Standard Deviation
The standard deviation, often represented with the Greek letter sigma, is the measure of a spread of data
around the mean. A high standard deviation signifies that data is spread more widely from the mean, where a
low standard deviation signals that more data align with the mean. In a portfolio of data analysis methods, the
standard deviation is useful for quickly determining dispersion of data points.
Pitfall:
Just like the mean, the standard deviation is deceptive if taken alone. For example, if the data have a very
strange pattern such as a non-normal curve or a large amount of outliers, then the standard deviation won’t give
you all the information you need.


Basic statistics: Quartile/Percentile
The median is central to many experimental data sets, and to
calculate median in such examples is important, by not falling into
the trap of reporting the arithmetic mean.
Quartile is a useful concept in statistics and is conceptually similar
to the median. The first quartile is the data point at the
25th percentile, and the third quartile is the data point at the
75th percentile. The 50th percentile is the median.
the median is a measure of the central tendency of the data but
says nothing about how the data is distributed in the two arms on
either side of the median. Quartiles help us measure this.
Basic statistics: Regression
Regression models the relationships between dependent and explanatory variables, which are usually charted
on a scatterplot. The regression line also designates whether those relationships are strong or weak.
Regression is commonly taught in high school or college statistics courses with applications for science or
business in determining trends over time.
Pitfall:
Sometimes, the outliers on a scatterplot (and the reasons for them) matter significantly. For example, an outlying
data point may represent the input from your most critical supplier or your highest selling product. The nature of
a regression line, however, tempts you to ignore these outliers.
Redundant Attributes
" An attribute is redundant when it can be derived from another
attribute or set of them.
" Redundancy is a problem that should be avoided
○ It increments the data size ! modeling time for DM algorithms increase
○ It also may induce overfitting
" Redundancies in attributes can be detected using correlation
analysis
" Correlation Test quantifies the correlation among two nominal
attributes contain c and r different values each:
" where oij is the frequency of (Ai,Bj) and:
Redundant Attributes
" for numerical attributes Pearson’s product moment coefficient is widely
" where m is the number of instances, and A̅ ,B̅ are the mean values of
attributes A and B.
" Values of r close to +1 or -1 may indicate a high correlation among A
and B.
Redundant Attributes
Correlation matrix and graph
What is Data Quality?
Data quality refers to the ability of a set of data to serve an intended purpose.
Low-quality data cannot be used effectively to do the thing with it that you wish to
do (really!?).
Remember that your data is rarely going to be perfect, and that you have to juggle
managing your data quality with actually using the data.
DQ Measures I
Completeness
Completeness is defined as how much of a data set is populated, as opposed to being left blank. For instance, a survey
would be 70% complete if it is completed by 70% of people. To ensure completeness, all data sets and data items must be
recorded.
Uniqueness
This metric assesses how unique a data entry is, and whether it is duplicated anywhere else within your database. Uniqueness is
ensured when the piece of data has only been recorded once. If there is no single view, you may have to dedupe it.
Timeliness
How recent is your data? This essential criteria assesses how useful or relevant your data may be based on its age. Naturally, if an
entry is dated, for instance, by 12 months, the scope for dramatic changes in the interim may render the data useless.


DQ Measures II
Validity
Simply put, does the data you've recorded reflect what type of data you set out to record? So if you ask for
somebody to enter their phone number into a form, and they type 'sjdhsjdshsj', that data isn't valid, because it
isn't a phone number - the data doesn't match the description of the type of data it should be.
Accuracy
Accuracy determines whether the information you hold is correct or not, and isn't to be confused with validity, a
measure of whether the data is actually the type you wanted.
Consistency
For anyone trying to analyse data, consistency is a fundamental consideration. Basically, you need to ensure
you can compare data across data sets and media (whether it's on paper, on a computer file, or in a database) -
is it all recorded in the same way, allowing you to compare the data and treat it as a whole?
Data Preparation and
Trasformation…
Data Cleaning
" The sources of dirty data include
○ data entry errors,
○ data update errors,
○ data transmission errors and even bugs in the data processing system.
" Dirty data usually is presented in two forms: missing data (MVs) and wrong (noisy) data.
Data Cleaning
" The way of handling MVs and noisy data is quite different:
○ The instances containing MVs can be ignored, filled in manually or with a constant or filled in by using estimations over
the data
○ For noise, basic statistical and descriptive techniques can be used to identify outliers, or filters can be applied to
eliminate noisy instances
What Are Outliers?
" Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different
mechanism
" Outliers are different from the noise data
○ Noise is random error or variance in a measured variable
○ Noise should be removed before outlier detection
" Outliers are interesting: It violates the mechanism that generates the normal data
" Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
42
Types of Outliers
" Three kinds: global, contextual and collective outliers
" Global outlier (or point anomaly)
○ Object is Og if it significantly deviates from the rest of the data set
○ Ex. Intrusion detection in computer networks
○ Issue: Find an appropriate measurement of deviation
" Contextual outlier (or conditional outlier)
○ Object is Oc if it deviates significantly based on a selected context
○ Attributes of data objects should be divided into two groups
■ Contextual attributes: defines the context, e.g., time & location
■ Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
Global Outlier
Outlier Detection: Statistical Methods
" Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic
model)
○ The data not following the model are outliers.
■ Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
■ There are rich alternatives to use various statistical models
43
Outlier Detection: Proximity-Based Methods
" An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity
of the object is significantly deviates from the proximity of most of the other objects in the
same data set
" The effectiveness of proximity-based methods highly relies on the proximity measure.
" In some applications, proximity or distance measures cannot be obtained easily.
" Often have a difficulty in finding a group of outliers which stay close to each other
" Two major types of proximity-based outlier detection
○ Distance-based vs. density-based
44
Handle Missing Data



There are two most commonly recommended ways of dealing with missing data:
" Dropping observations that have missing values
" Imputing the missing values based on other observations
Drop Data
Dropping missing values is sub-optimal because when you drop observations,
you drop information.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on new data even if
some of the features are missing!
You can drop vertically (a feature of the data) or orizontally (some entry in your
data)
Imputing missing values
Missing categorical data
The best way to handle missing data for categorical features is to simply label them as
’Missing’!
" You’re essentially adding a new class for the feature. This tells the algorithm that the
value was missing.
Missing numeric data
For missing numeric data, you can fill the empty data:
" Filling it in with the mean.
" Filling with a special value
" Allowing an algorithm to estimate the values
Data Normalization
" Sometimes the attributes selected are raw attributes.
○ They have a meaning in the original domain from where they were
obtained
○ They are designed to work with the operational system in which they are
being currently used
" Usually these original attributes are not good enough to obtain accurate
predictive models
Data Normalization
" It is common to perform a series of manipulation steps to transform the
original attributes or to generate new attributes
○ They will show better properties that will help the predictive power of the
model
" The new attributes are usually named modeling variables or analytic
variables.
Data Normalization
Min-Max Normalization
" The min-max normalization aims to scale all the numerical values v of a
numerical attribute A to a specified range denoted by [new − minA, new −
maxA].
" The following expression transforms v to the new value v’:
Data Normalization
Z-score Normalization
" If minimum or maximum values of attribute A are not known, or the data is
noisy, or is skewed, the min-max normalization is good
" Alternative: normalize the data of attribute A to obtain a new distribution with
mean 0 and std. deviation equal to 1
Removing Redundant feature
Data Transformation
" It is the process to create new attributes
○ Often called transforming the attributes or the attribute set.
" Data transformation usually combines the original raw attributes using
different mathematical formulas originated in business models or pure
mathematical formulas.
Data Transformation
Linear Transformations
" Normalizations may not be enough to adapt the data to improve the
generated model.
" Aggregating the information contained in various attributes might be beneficial
" If B is an attribute subset of the complete set A, a new attribute Z can be
obtained by a linear combination:
Data Transformation
Quadratic Transformations
" In quadratic transformations a new attribute is built as follows
" where ri,j is a real number.
" These kinds of transformations have been thoroughly studied and can help to
transform data to make it separable.
Data Reduction
" When the data set is very large, performing complex analysis and DM can
take a long computing time
" Data reduction techniques are applied in these domains to reduce the size
of the data set while trying to maintain the integrity and the information of the
original data set as much as possible
" Mining on the reduced data set will be much more efficient and it will also
resemble the results that would have been obtained using the original data
set.
Data Reduction
" The use of binning and discretization techniques is also useful to reduce the
dimensionality and complexity of the data set.
" They convert numerical attributes into nominal ones, thus drastically reducing
the cardinality of the attributes involved
Data Reduction
" Dimensional reduction techniques:
○ Projection
○ Low Variance Filter
○ High Correlation Filter
○ Principal Component Analysis (PCA)
○ Backward Feature Elimination
Data Analysis…
From	Data	to	Knowledge
Data Mining/Machine Learning
" Objective: Fit data to a model
" Potential Result: Higher-level meta information that may not be obvious when
looking at raw data. Patterns and Models.
Find patterns and models?
" Clusters: Clustering algorithms are often applied to automatically group similar instances or objects in clusters (groups).  The goal
is to summarize the data to better understand the data or take decision. 

" Classification models: Classification algorithms aims at extracting models that can be used to classify new instances or objects
into several categories (classes). 
" Patterns and associations: Several techniques are developed to extract frequent patterns or associations between values in
database.
"  Anomalies/outliers: The goal is to detect things that are abnormal in data (outliers or anomalies).
" Trends, regularities:  Techniques can also be applied to find trends and regularities in data.  
In general, the goal of data mining is to find interesting patterns. What is interesting?

(1) it easy to understand,
(2) it is valid for new data (not just for previous data);
(3) it is useful,
(4) it is novel or unexpected (it is not something that we know already).
Supervised and Unsupervised
Classification problem
" What we have
○ A set of objects, each of them described by some features
■ people described by age, gender, height, etc.
■ bank transactions described by type, amount, time, etc.
" What we want to do
○ Associate the objects of a set to a class, taken from a predefined list
■ “good customer” vs. “churner”
■ “normal transaction” vs. “fraudulent”
■ “low risk patient” vs. “risky”
?
?
?
?
?
Feature 1
(e.g. Age)
Feature2
(e.g.Income)
15k€
50y
35k€
60y
Classification problem
" What we know
○ No domain knowledge or theory
○ Only examples: Training Set
■ Subset of labelled objects
" What we can do
○ Learn from examples
○ Make inferences about the other objects
The most stupid classifier
" Rote learner
○ To classify object X, check if there is a labelled example in the training set identical to X
○ Yes ! X has the same label
○ No ! I don’t know
?
Classify by similarity
" K-Nearest Neighbors
○ Decide label based on K most similar examples
K=3
Build a model
" Example 1: linear separation line
Build a model
" Example 2: Support Vector Machine (linear)
Build a model
" Example 3: Non-linear separation line
Build a model
" Decision Trees
Income
> 15k€ ?
Age >
50y ?
Age
Income
yes no
yes no
Clustering
What if no labels are known? We might lack examples

Labels might actually not exist at all…
Clustering
Objective: find structure in the data
Group objects into clusters of similar entities
Clustering: K-means (family)
" Find k subgroups that form compact and well-separated clusters
K=3
Cluster compactness
Cluster separation
Clustering: K-means (family)
" Output 1: a partitioning of the initial set of objects
K=3
Clustering: K-means (family)
" Output 2: K representative objects (centroids)
" Centroid = average profile of the objects in the cluster
K=3
• Avg. age
• Avg. weight
• Avg. income
• Avg. .n children
• …
Clustering: hierarchical approaches
" Sometimes we can have (or desire) multiple levels of aggregation
Clustering: hierarchical approaches
" Sometimes we can have (or desire) multiple levels of aggregation
Dendogram
Community detection
" Equivalent to clustering in the world of networks
" Some of our objects are linked
" Linked objects are more 

likely to belong to the 

same group
○ E.g. users exchanging emails
" Links can be weighted
○ E.g.: n. of emails exchanged
Community detection
" Objective
○ Identify strongly connected 

subgroups that are weakly 

connected to the others
" General methodology
○ Find weak connections (small set 

of links that are “bridges”)
○ Remove them
○ Each connected component 

remaining is a community
x
x
x
x
Frequent patterns
" Events or combinations of events that appear frequently in the data
" E.g. items bought by customers of a supermarket
Frequent patterns
" Frequent itemsets w.r.t. minimum threshold
" E.g. with Min_freq = 5
Frequent patterns
Association rules

If items A1, A2, … appear in a basket, then also B1, B2, … will appear
there

Notation: A1, A2, … => B1, B2, … [ C%]

C = confidence, i.e. conditional probability
=> [ 80% ]
=> [ 100% ]
=> [ 66% ]
=> [ 20% ]
Frequent patterns

Complex domains
" Frequent sequences (a.k.a. Sequential patterns)
" Input: sequences of events (or of groups)
Frequent patterns

Complex domains
" Objective: identify sequences that occur frequently
• Sequential pattern:
Collaborative Filtering
" Goal: predict what movies/books/… a person may be interested in, on the basis of
○ Past preferences of the person
○ Other people with similar past preferences
○ The preferences of such people for a new movie/book/…
" One approach based on repeated clustering
○ Cluster people on the basis of preferences for movies
○ Then cluster movies on the basis of being liked by the same clusters of people
○ Again cluster people based on their preferences for (the newly created clusters of) movies
○ Repeat above till equilibrium
" Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find
information of interest
Deep learning
• Age
• Weight
• Income
• Children
• Likes sport
• Likes reading
• Education
• …
Raw representation Higher-level representation
• Young parent
• Fit sportsman
• High-educated reader
• Rich obese
• …
35
65
23 k€
2
0.3
0.6
high
…
0.9
0.1
0.8
0.0
…
The Objective is to learn an high-level representation of the data automatically from (almost)
raw input. This is done automatically using examples and reinforcement.
How do we train?
𝒉 =   𝝈(𝐖𝟏 𝒙 + 𝒃𝟏)
𝒚 = 𝝈(𝑾𝟐 𝒉 + 𝒃𝟐)
𝒉
𝒚
𝒙 4 + 2 = 6 neurons (not counting inputs)
[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Weights
Activation functions
Simple Neural Network
Neural Network and Deep Learning
Multiple Levels Of Abstraction
Evaluation…
Training and Test
The data is usually split into training data and test data. The training set
contains a known output and the model learns on this data in order to be
generalized to other data later on. The test dataset (or subset) in order to test
our model’s prediction on this subset.
Accuracy


The accuracy of a prediction is the number of correct prediction against the
wrong ones:
Accuracy = True Negative + True Positive / Total
Precision and Recall
Accuracy and F1 score


 The F1 score is the harmonic average of the precision and recall, where an
F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
Comparisons
In order to assess the quality of the results of a prediction, it is possible to use “alternative”
models:
" Constant
" Random Models
" Simple Probabilistic Models
" …
" Your Model
" …
" Ideal
Moreintelligence
Examples
• Boat Activity recognition for
advertisement
• Detecting and Understanding Events
Navionics
Warp – Water Activity Recognition Process
Objectives
Identify the category of each user (main activity to be chosen from fishing,
sailing, cruising and canoeing, as well as other categories such as boat type,
water type and type of area of preference). This will enable targeted marketing
operations, where the banners shown within Navionics app will take into
consideration the category associated to the user.
Input DATA (1)
" Tracks: records of the trip performed by a user with Navionics app actived. Tracks are basically
sequences of GPS points that allow to reconstruct where and when the trip took place.
" Land: contains the geographical borders of land, used to remove points outside water, deemed not
interesting for this project.
" Sea Bottom: description of the type of bottom in each point in water. Local areas having the same sea
bottom type are represented in the data as a single geometric region.
Input DATA (2)
" Sonar: measures the water depth at each geographical point, worldwide. As for the sea bottom, local
areas having similar depth are represented in the data as a single geometric region. For each region a
minum and maxim depth are given. Usually the regions correspond to fixed intervals of depth, e.g.
mininum 100 feet and maximum 200 feet, which are fine-grained on shallow waters (intervals in the order
of the foot neat the coast) and coarse-grained on deep waters (intervals in the order of thousands of feet
in the middle of the ocean).
" Wrecks: stores the position of the wrecks localized by Navionics users – obviously it is a small fraction of
the wrecks really existing worldwide, although the coverage is better in the areas that are more popular
among Navionics users.
Pre-Processed DATA
" Water Types: In general Navionics uses a space tessellation covering the
world where each cell corresponds to a square of 0.125° x 0.125°, about
10Km2. In the original data water and land are represented as geometries
included in those cells. Using a clustering algorithm we identified bodies
of water classifing them into lake (if they are closed), river or sea/ocean.
" Heat Map: a representative frequency map was extracted based on the
most recent segment of track data available, simply counting, for each
cell of the tesselation, the number of distinct users that visited it at least
once.
" Coastline: Joining several data sources from navionics we obtained the
costline in the entire world. A post-processing transformation is used in
order to simplify the geometries for computational issue.
Building the 

Water Activity Behavioral Model
The blue boxes represent the input data sources,
including the users’ tracks, which are the keystone of
the process. A first set of processes (Coastline,
Analyzing and Features) derive descriptive features out
of the raw track data, with the aid of the context
knowledge provided by the other data sources.
This set of features is then normalized in preparation of
a clustering process that extracts a set of
representative behaviours, till without a label
associated to them.
A set of tracks labeled by the domain experts is used to
assign label information to each cluster representative.
This information is later exploited as input for the
construction of a classification model to be used for
labeling new data.
From movement tracks 

to movement “components”
The raw track data has a few main issues that need to be treated
before any other step:
" Due to early switch ons and late switch offs of the app, some
tracks include points outside water, and therefore not useful
for our task. All these points are filtered out.
" A track very often contains a mix of different activities. In
particular, some parts of the track might be movement and
others are simple stops.
For these two reasons we proceeded in reconstruct the trajectories
considering spatio-temporal constraints instead of the track identifier
coming from the app and decompose them into move components
and stop components
COMPONENT Features (1)
" starthh, startdoy and startdow represent the hour of the day (0-23), the day
of the year (1-366) and the day of the week (1-7) of the beginning of the
component.
" lat represents the latitude of the beginning of the component.
" centercoast represents the distance between the central point (in terms of
time) of the component from the coastline, as computed for each point in
Section 5.
" freq represents the popularity of the cell (w.r.t. Navionics tessellation, see
Section 3) where the component spent most of the time.
" len and duration represent the duration of the component, respectively in terms
of points recorded and time spent.
" domwater and domsea report resp. the most frequent water type and most
frequent sea bottom type among the points in the component.
" domsea_perc is the percentage of points of the component that belong to the
dominant sea bottom type category.
Features vector
COMPONENT Features (2)
" depth, slope and speed are analyzed, represented by some standard indicators: 1Q, 2Q and
3Q (i.e. 25-th percentile, 50-th and 75-th) and the interquartile range (i.e. Q3 – Q1). This results
into 12 features, named qdepth_25, qdepth_50, …, qspeed_range.
" rangle represents the percentage (ratio) of points that have an angle larger than a fixed
threshold (by default 30°), thus measuring the frequency of turns of the boat.
" rwreck is the percentage of points that are close a wreck.
" type distinguishes stop components from move components (see Section 4.2).
" entropy is the mathematical entropy function computed over the set of heading values of the
component.
" Accelerations and decelerations. the speed at each point of a track is compared against
previous ones, in particular those that are more recent than a give temporal threshold (now fixed
to 2 minutes). If the present speed is higher than the minimum previous speed in such interval by
more than a fixed threshold (now fixed to a very small 0.55 m/s) and more than a fixed
percentage (now fixed to 20%), then the current point is considered an acceleration point
" Wandering. A rather frequent behaviour associate to fishing consists in wandering around the
same location, without ever really stopping, basically exploring an area and wait for the fish. In
terms of trajectories, that results in forming very entangled shapes.
CONTEXTS
All the features mentioned above where computed over the whole component. In order to get a more
detailed view of what happened during the component, we identify periods where something specific occurs,
named contexts, and then compute the same features mentioned above considering only the subgroups of
points just identified. In particular, we considered three contexts:
" Near-shore points
" Off-shore points
" Noodle points
In addition to the features described in the previous sections, we compute:
" the percentage of points of the component that belong to the context, e.g. the percentage of points
spent near-shore w.r.t. the total.
" maxl, the length of the longest contiguous sequence of points of the context, e.g. a boat might
perform several isolated noodles, therefore here we will measure only the longest one.
TRACK FEATURES
Finally, a few features are added to the component, that relate the component itself to the overall track it belongs
to:
" ncomponents is the number of components that compose the track.
" rcomponent is the percentage of points of the track that belong to our component.
" track_loop is the geographical distance between the first point of the first component and the last point of
the last component. Very small values identify loops, i.e. trips that “come back home” at the end, while high
values suggest that the track is part of a longer trip (e.g. a week-long cruise) or that the boat has no fixed
docking slot.
FEATURES SELECTION
Violating the non-redundancy assumption
considered by the clustering algorithm,
might lead to clusters that are dominated
by a few attributes and therefore do not
consider properly all the information
contained in all the other features. For this
reason, we started the clustering process
with a selection of the features that
appeared be well focused with our current
objectives, also avoiding excessive
correlations.
NORMALIZATION
we adopted a standard Z-score normalization, consisting in replacing each feature value with a new one as follows:

new_value = (original_value – average_value) / standard_deviation
RESCALING
An ad hoc rescaling factors can be applied to the features in order to impose to the algorithms to give more or
less importance to a given attribute. Through discussions with the domain experts and preliminary experiments,
we decided to rescale the following attributes:
" components that occurred over different types of water should be clearly separated. For this reason
domwater was given an high weight by multiplying it by a factor 10.
" similarly, stop components and move components represent very different things, therefore should be
kept separated. Therefore, feature type was multiplied by 10.
" while the latitude feature is useful as proxy of general climate conditions (tropical vs. polar, southern vs.
northern hemisphere), it might risk to make clusters too location-specific. For this reason its weight was
reduced through a multiplicative factor of 0.25.
K-MEANS
The components represented by verctors of 32 features are the imput for a K-Means clustering. The K is selected in
order to obtain a trade-off between two objectives:
(i) have enough clusters to capture the different possible users’ behaviours.
(ii) keep the number of clusters small enough to make it feasible, for a domain expert, to observe and label a
reasonable number of sample components that belong to each cluster.
The clustering is an unsupervised algorithm, thus we discover a set of K unlabeled behaviors.
Design a 

Survey for the Experts
From the clustering result we created a survey
where the experts were asked to specify:

" For each component, its associated
activity
" The overall activity performed during the
track
" The most likely type of boat adopted
" The area (inshore/offshore/intra-coastal)
and type of water (salt/lake)
" Optional notes
The world has been divided into 6 macro-areas:
" United States East coast (USE)
" United States West coast (USW)
" Australia (Aus)
" Mediterranean see (Med)
" Scandinavia (Scand)
" United Kingdom (UK)
Expert Knowledge
Cruising results to be by far the most popular
activity in the tracks of the training set, followed
by sailing and fishing. Very few canoeing tracks
were identified. Also, fishing and cruising tracks
tend to be formed by several components of the
same type (respectively ~2.9 and ~2.6
components per track, as compared to the 1.7 of
sailing)
Looking at the activities in the different
geographical areas, it is clear that in the USE and
USW areas the distribution is well balanced, while
in the Mediterranean fishing is slightly
underrepresented, in Australia both fishing and
sailing are weak, and the rest (UK and
Scandinavia) only sailing emerges significantly.
Building the 

Semantic Model
For each cluster we compute a probability distribution over the set of possible activities. This is done at two levels:
" component-level: the number of components labeled with that specific activity
" track-level: the number of components that belong to a track having that activity as overall labeling
The two counts, obtained for each activity, are summed up according to weights defined by the analyst: 0.85 for track-level labels, and
0.15 for component-level ones.
Moreover the uncertain information provided by the experts with the “?” sign, we counted also uncertain labels, yet with a weight set to
0.15.
Domain expert's rules meta-
features for tracks

The domain experts provided a set of rules that tried to approximate their idea of fishing behaviour, cruising
behaviour, etc. We translated them into features-based rules. Example:
IF at least one of the following apply:
" the component is in a “noodle” shape (r_noodles>=0.2)
" the component is slower than 10 knots (qspeed_75 <5.14) AND follows a slope greater than 55%
(qslope_50>=5)
" the component is slower than 10 knots (qspeed_75 <5.14) AND is shorter than 328 ft (len<=100)
" the component moves in several directions (entropy>2) AND is longer than 54 nm (len>100000)
THEN the component has a Fishing behaviour
BUILDING THE CLASSIFIER
Distribution 

from cluster

“Activity"
Distribution 

from cluster

“Boat"
Distribution 

from cluster

“Zone"
Distribution 

from Rules
A C4.5 algorithm is used to build a classification tree over vectors summarizing the result of all the previos processes. Each track can be represented by a vector containing:
In practice this new vector is a higher representation of the track defined by different distributions derived by its stop and move components.
TUNING THE CLASSIFIER
In order to find the decision tree that has the best accuracy and yet (where possible) does not loose any label, our
algorithms play with the two input parameters of C4.5:
" min-leaf : how many objects of the training set should end in each leaf of the model. The larger is this
value, the more “solid” will be the prediction provided by the leaf. Yet, larger values also imply that the tree
must have a smaller number of leaves, thus favoring simple models;
" conf-factor: confidence factor of leaves, i.e. how much should the dominant label of a leaf predominate on
the others. A very high value requires that leaves are basically pure, yet implying that several splits are
performed, and therefore the model is more detailed.
Classification Results
As we can observe, the distributions are similar
to those of the components in the training set.
The main differences include the fact that
“cruising” looks more present now in the USE,
USW and Mediterranean areas, whereas it
dropped dramatically in UK and Scandinavia.
Also, as already noticed in previous sections,
“sailing” completely disappeared in Australia,
since its model did not capture that category.
Distribution of Activities (USE) IN TIME
An interesting view on the data can be obtained
plotting the temporal distribution of the activities
along the whole duration of the data we had access
to, i.e. from May 2014 to April 2016.
In addition to the usual seasonal behaviours –
overall increase of all activities in the summer
months – we can observe that fishing increased
sensibly its presence in the data during the last
year. Possible causes might be an increased number
of fishermen among Navionics users, or an
increased propensity among fishermen to share their
tracks, or a combination of the two.
User Classification
The labels assigned to each single track can be simply aggregated (counted) to
infer the distribution of activities for each user. The next step, then, consists in
selecting the activity – or activities – that represents the user best.
After some trials and evaluation of the results with the domain experts, the following
approach was decided:
• If the user has a percentage of fishing tracks larger than 30%, we label the user
as “fisherman”, since at the present fishing is considered a strategic segment of
customers.
• Otherwise, the label with the largest percentage is selected, with no minimum
thresholds.




Adaptive highly Scalable Analytics
Platform



Task: Event Detection analysis: detecting events in
a specific geographic area classifying the different
kind of users involved. 

The Implemented ETL Process
A continuous flow of data from the users is stored in the
Wind servers. The first step to realize a realistic service in
the ASAP platform is to define and implement an ETL
(Extract Transform Load) process able to update the data
periodically (i.e. monthly)
The Collected Data
" Structured data: Charging Data Records (CDR) related to Voice, SMS, Traffic Data;
Customer Relationship Management (CRM) data containing users information
" Covered geographical region: city of Rome
" Dataset size per snapshot: ≈ 1.2 GBytes per day
" Number of records: ≈ 5.6 million lines per day
A dataset of about 50 GBytes per month. The dataset is appropriately anonymized to comply with
Italian and European privacy regulations.



Seven months are now collected and stored.
City of Rome
Metropolitan area
The Configured Cluster
A cluster of 4 machines with 12 hyper-threading processors. Spark
installed as runtime context.
Spatio-temporal Statistics: Time Series
Simply statistics are not so
informative…
Adding a new Dimension: users’ classification
" The Sociometer is a methodology to classify the users considering their “call profile”:

• A person is Resident in an area A when his/her home is inside A. Therefore the mobility tends to be from and towards his/
her home.

• A person is a Commuter between an area B and an area A if his/her home is in B while the work/school place is in A.
Therefore the daily mobility of this person is mainly between B and A.

• A person is a Dynamic Resident between an area A and an area B if his/her home is in A while the work/school place is
in B. A Dynamic Resident represents a sort of “opposite” of the Commuter.
• A person is a Visitor in an area A if his/her home and work/school places are outside A, and the presence inside the area
is limited to a certain period of time that can allow him/her to perform some activities in A.
User Profiling

123643 Cell12 24/06/2015 14:05
123643 Cell12 24/06/2015 18:13
123643 Cell15 25/06/2015 11:05
123643 Cell15 25/06/2015 20:42
123643 Cell11 25/06/2015 21:05
123643 Cell12 26/06/2015 10:01
….
● Derive presence distribution for each < user, area>
t1 = [00:00-08:00)
t2 = [8:00-19:00)
t3 = [19:00-24:00)
Sociometer

● Based on clustering
●
K-means: start with K random representatives, and iteratively refine them
●
Output: set of reference (unlabeled) profiles
Archetypes 

●
Archetypes represent the expert knowledge and represent the perfect “commute”, “resident”, “visitor”,
”dynamic resident”. More than an archetype may exist for the same class.
●
The centroid of each cluster is assigned to the most similar archetype. The class is than propagated to all the
users in the clusters.
Commuter “Static” resident
Visitors
“Dynamic” resident
Multiple profiles
Result for each user: set of individual profiles.
Post-processing: Passing By
1 single call Multiple calls
We distinguish between Visitors and the subclass of
Passing by which are people making a single call.
It’s an heuristic which allow to exclude highways in some
cases or characterize a different kind of visit

Rome Case of Study
In this case of study we show how the integration of presented
methods will be able to extract interesting knowledge from the
Wind CDR data.
City of Rome
Metropolitan area
Covered geographical region: city of Rome
Dataset size per snapshot: ≈ 1.2 GBytes per day
Number of records: ≈ 5.6 million lines per day
9 months between 2015 and 2016
January 2016 July 2016
The proposed methodology
The approach used focus the analysis on specific area
using the sociometer to classify the users and then
highlight different behaviors which can be studied in
details.
San Pietro Square
Olympic Stadium
Circo Massimo
San Giovanni Square
San Pietro Square
Residents are
the majority
and cover the
other classes
having a lower
impact on the
overall
distribution.
Anyway This
doesn’t mean
that they have
no effect on the
city!
San Pietro Square (Scaled)
Extracting the
typical behavior of
each class of users
the distribution may
be
“rescaled” (normali
zed) and the
anomalies
emerges. In other
words the real
events are spotted.
Moreover each
event is
represented by a
peak in one or
more classes of
users.
San Pietro Square (Interpretation)
San Pietro – Characterizing Padre Pio
event
Looking at the day of the
event (6th february)and the
day after compared to the
typical distribution in the
normal Saturday and
Sanday it’s evident how the
event change the
distribution.
In particular this even
involves both the
passingby and the
commuter types (people
working in the area and
people visiting the event
and than disappear)
Event Day after
San Pietro – Flows to Padre Pio event
Event Day after
FromareaN.
FromareaN.
San Pietro – Characterizing Jubilee B&G
Another event (24th April)
happening in the same days
of the week has a
completely different impact
involving dynamic
residents, hence the event
is more local than the
previous one.
EventDay before
San Pietro – Flows to Jubilee B&G
Day Before Event
FromareaN.
FromareaN.
Research infrastructures…
www.sobigdata.eu
Ethics…
References
Books:
" Introduction to Data Mining, by V. Kumar
" Mobility, Data Mining and Privacy, Geographic Knowledge Discovery, By F. Giannotti and D. Pedreschi
" Data Analytics Made Accessible, by A. Maheshwari
" Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by E. Siegel
" Too Big to Ignore: The Business Case for Big Data, by award-winning author P. Simon
" Lean Analytics: Use Data to Build a Better Startup Faster, by A. Croll and B. Yoskovitz
" Data Smart: Using Data Science to Transform Information into Insight, by J. W. Foreman
" Big Data: A Revolution That Will Transform How We Live, Work, and Think by V. Mayer-Schönberger and K. Cukier
" Business UnIntelligence: Insight and Innovation Beyond Analytics and Big Data, by B. Devlin
" Big Data at Work: Dispelling the Myths, Uncovering the Opportunities, by T. H. Davenport
" Analytics in a Big Data World: The Essential Guide to Data Science and its Applications, by B. Baesens
" Data Science For Business: What You Need to Know About Data Mining & Data-Analytic Thinking, by F. Provost & T. Fawcett
" Numsense! Data Science for the Layman: No Math Added by Annalyn Ng & Kenneth Soo
" Data-Driven HR: How to Use Analytics and Metrics to Drive Performance by Bernard Marr
" Creating Value With Social Media Analytics: Managing, Aligning, and Mining Social Media Text, Networks, Actions, Location,
Apps, Hyperlinks, Multimedia, & Search Engines Data by Gohar F. Khan
" Analytic Philosophy: A Very Short Introduction by Michael Beaney

Contenu connexe

Tendances

ICPSR Data Exploration Tools
ICPSR Data Exploration ToolsICPSR Data Exploration Tools
ICPSR Data Exploration ToolsICPSR
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersIncisive_Events
 
Altmetrics Overview
Altmetrics OverviewAltmetrics Overview
Altmetrics OverviewPaul Groth
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research RequirementsICPSR
 
Introduction to DATS v2.2 - NIH May 2017
Introduction to DATS v2.2 - NIH May 2017Introduction to DATS v2.2 - NIH May 2017
Introduction to DATS v2.2 - NIH May 2017Susanna-Assunta Sansone
 
Wilson-npg-scientific data-nfdp13
Wilson-npg-scientific data-nfdp13Wilson-npg-scientific data-nfdp13
Wilson-npg-scientific data-nfdp13DataDryad
 
Green, gold, uncle sam, and information literacy
Green, gold, uncle sam, and information literacyGreen, gold, uncle sam, and information literacy
Green, gold, uncle sam, and information literacySeth Porter, MA, MLIS
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsBeth Plale
 
Publishing and impact 20141028
Publishing and impact 20141028Publishing and impact 20141028
Publishing and impact 20141028Hugo Besemer
 
Dk net webinar tutorial pen
Dk net webinar tutorial penDk net webinar tutorial pen
Dk net webinar tutorial penMaryann Martone
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FutureASIS&T
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Privacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesPrivacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesMicah Altman
 
Transparency and reproducibility in research
Transparency and reproducibility in researchTransparency and reproducibility in research
Transparency and reproducibility in researchLouise Corti
 

Tendances (20)

ICPSR Data Exploration Tools
ICPSR Data Exploration ToolsICPSR Data Exploration Tools
ICPSR Data Exploration Tools
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producers
 
Altmetrics Overview
Altmetrics OverviewAltmetrics Overview
Altmetrics Overview
 
Sansone mibbi-intro
Sansone mibbi-introSansone mibbi-intro
Sansone mibbi-intro
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina LeonelliLa ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
 
Introduction to DATS v2.2 - NIH May 2017
Introduction to DATS v2.2 - NIH May 2017Introduction to DATS v2.2 - NIH May 2017
Introduction to DATS v2.2 - NIH May 2017
 
Wilson-npg-scientific data-nfdp13
Wilson-npg-scientific data-nfdp13Wilson-npg-scientific data-nfdp13
Wilson-npg-scientific data-nfdp13
 
Green, gold, uncle sam, and information literacy
Green, gold, uncle sam, and information literacyGreen, gold, uncle sam, and information literacy
Green, gold, uncle sam, and information literacy
 
CISER & the Data Reference Interview
CISER & the Data Reference InterviewCISER & the Data Reference Interview
CISER & the Data Reference Interview
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Publishing and impact 20141028
Publishing and impact 20141028Publishing and impact 20141028
Publishing and impact 20141028
 
Aep mc nairguide
Aep mc nairguideAep mc nairguide
Aep mc nairguide
 
Dk net webinar tutorial pen
Dk net webinar tutorial penDk net webinar tutorial pen
Dk net webinar tutorial pen
 
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and FuturePoster RDAP13: Research Data in eCommons @ Cornell: Present and Future
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Payton Eliminating Conflicts in Ebook Metadata
Payton Eliminating Conflicts in Ebook MetadataPayton Eliminating Conflicts in Ebook Metadata
Payton Eliminating Conflicts in Ebook Metadata
 
Privacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesPrivacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use Cases
 
Transparency and reproducibility in research
Transparency and reproducibility in researchTransparency and reproducibility in research
Transparency and reproducibility in research
 
Concept on e-Research
Concept on e-ResearchConcept on e-Research
Concept on e-Research
 

Similaire à Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti

Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Stats Statswork
 
data science course with placement in hyderabad
data science course with placement in hyderabaddata science course with placement in hyderabad
data science course with placement in hyderabadmaneesha2312
 
Characteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptxCharacteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptxJHANMARKLOGENIO1
 
Data Presentation & Analysis.pptx
Data Presentation & Analysis.pptxData Presentation & Analysis.pptx
Data Presentation & Analysis.pptxheencomm
 
Edited assignment in research
Edited assignment in researchEdited assignment in research
Edited assignment in researchA4D PRINTS
 
Seminar on tools of data collection Research Methodology
Seminar on tools of data collection Research MethodologySeminar on tools of data collection Research Methodology
Seminar on tools of data collection Research Methodologyprajwalshetty86
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Reviewijdpsjournal
 
Quantitative search and_qualitative_research by mubarak
Quantitative search and_qualitative_research by mubarakQuantitative search and_qualitative_research by mubarak
Quantitative search and_qualitative_research by mubarakHafiza Abas
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online caniceconsulting
 
Running Head Data Mining in The Cloud .docx
Running Head Data Mining in The Cloud                            .docxRunning Head Data Mining in The Cloud                            .docx
Running Head Data Mining in The Cloud .docxhealdkathaleen
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxTake1As
 
what is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysiswhat is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysisData analysis ireland
 
Researchpe-5.pptx
Researchpe-5.pptxResearchpe-5.pptx
Researchpe-5.pptxParwez17
 
Uncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncodemy
 
An Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In ResearchAn Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In ResearchFinni Rice
 

Similaire à Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti (20)

Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Data mining
Data miningData mining
Data mining
 
data science course with placement in hyderabad
data science course with placement in hyderabaddata science course with placement in hyderabad
data science course with placement in hyderabad
 
Characteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptxCharacteristic of a Quantitative Research PPT.pptx
Characteristic of a Quantitative Research PPT.pptx
 
Data Presentation & Analysis.pptx
Data Presentation & Analysis.pptxData Presentation & Analysis.pptx
Data Presentation & Analysis.pptx
 
Sample Methodology Essay
Sample Methodology EssaySample Methodology Essay
Sample Methodology Essay
 
Edited assignment in research
Edited assignment in researchEdited assignment in research
Edited assignment in research
 
Seminar on tools of data collection Research Methodology
Seminar on tools of data collection Research MethodologySeminar on tools of data collection Research Methodology
Seminar on tools of data collection Research Methodology
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 
Quantitative search and_qualitative_research by mubarak
Quantitative search and_qualitative_research by mubarakQuantitative search and_qualitative_research by mubarak
Quantitative search and_qualitative_research by mubarak
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online Module 3 - Improving Current Business with External Data- Online
Module 3 - Improving Current Business with External Data- Online
 
Running Head Data Mining in The Cloud .docx
Running Head Data Mining in The Cloud                            .docxRunning Head Data Mining in The Cloud                            .docx
Running Head Data Mining in The Cloud .docx
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
what is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysiswhat is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysis
 
Researchpe-5.pptx
Researchpe-5.pptxResearchpe-5.pptx
Researchpe-5.pptx
 
Uncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdfUncover Trends and Patterns with Data Science.pdf
Uncover Trends and Patterns with Data Science.pdf
 
An Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In ResearchAn Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In Research
 
Chapter-Four.pdf
Chapter-Four.pdfChapter-Four.pdf
Chapter-Four.pdf
 

Plus de Laboratorio di Cultura Digitale, labcd.humnet.unipi.it

Plus de Laboratorio di Cultura Digitale, labcd.humnet.unipi.it (20)

La riscoperta di un manoscritto 'francescano' di Lunigiana: il Beinecke MS 1153
La riscoperta di un manoscritto 'francescano' di Lunigiana: il Beinecke MS 1153 La riscoperta di un manoscritto 'francescano' di Lunigiana: il Beinecke MS 1153
La riscoperta di un manoscritto 'francescano' di Lunigiana: il Beinecke MS 1153
 
Le province di Lunigiana e le novità dell'edizione digitale
Le province di Lunigiana e le novità dell'edizione digitaleLe province di Lunigiana e le novità dell'edizione digitale
Le province di Lunigiana e le novità dell'edizione digitale
 
ChatGPT, parlami di Ceccardo di Luni. La nuova vitalità dei miti lunigianesi.
ChatGPT, parlami di Ceccardo di Luni. La nuova vitalità dei miti lunigianesi.ChatGPT, parlami di Ceccardo di Luni. La nuova vitalità dei miti lunigianesi.
ChatGPT, parlami di Ceccardo di Luni. La nuova vitalità dei miti lunigianesi.
 
uale medioevo spezzino? Sei punti di partenza per riscoprire il golfo nell'et...
uale medioevo spezzino? Sei punti di partenza per riscoprire il golfo nell'et...uale medioevo spezzino? Sei punti di partenza per riscoprire il golfo nell'et...
uale medioevo spezzino? Sei punti di partenza per riscoprire il golfo nell'et...
 
Confini di Lunigiana
Confini di LunigianaConfini di Lunigiana
Confini di Lunigiana
 
Le incursioni su Luni. Vero e immaginario, documenti e miti .
Le incursioni su Luni. Vero e immaginario, documenti e miti .Le incursioni su Luni. Vero e immaginario, documenti e miti .
Le incursioni su Luni. Vero e immaginario, documenti e miti .
 
Designing a project in Digital Humanities
Designing a project in Digital HumanitiesDesigning a project in Digital Humanities
Designing a project in Digital Humanities
 
Simbologia e oralità nel dominio signorile del vescovo di Luni
(XII-XIII seco...
Simbologia e oralità nel dominio signorile del vescovo di Luni
(XII-XIII seco...Simbologia e oralità nel dominio signorile del vescovo di Luni
(XII-XIII seco...
Simbologia e oralità nel dominio signorile del vescovo di Luni
(XII-XIII seco...
 
Orientarsi tra Digital Humanities, Digital History e Digital Public History. ...
Orientarsi tra Digital Humanities, Digital History e Digital Public History. ...Orientarsi tra Digital Humanities, Digital History e Digital Public History. ...
Orientarsi tra Digital Humanities, Digital History e Digital Public History. ...
 
Quale storia? 
Le sfide della Digital&Public History (DPHy)
Quale storia? 
Le sfide della Digital&Public History (DPHy)Quale storia? 
Le sfide della Digital&Public History (DPHy)
Quale storia? 
Le sfide della Digital&Public History (DPHy)
 
Responsabilità  condivisa e questioni di storia nei progetti di Digital (Publ...
Responsabilità  condivisa e questioni di storia nei progetti di Digital (Publ...Responsabilità  condivisa e questioni di storia nei progetti di Digital (Publ...
Responsabilità  condivisa e questioni di storia nei progetti di Digital (Publ...
 
Pubblicare un articolo scientifico (di DH) linee guida e consigli
Pubblicare un articolo scientifico (di DH) linee guida e consigliPubblicare un articolo scientifico (di DH) linee guida e consigli
Pubblicare un articolo scientifico (di DH) linee guida e consigli
 
Nuove scoperte su San TerenzIo
Nuove scoperte su San TerenzIoNuove scoperte su San TerenzIo
Nuove scoperte su San TerenzIo
 
1343: importanza e significato della costituzione della Podesteria della Spezia
1343: importanza e significato della costituzione della Podesteria della Spezia1343: importanza e significato della costituzione della Podesteria della Spezia
1343: importanza e significato della costituzione della Podesteria della Spezia
 
Storia (in) digitale. Strategie e strumenti multimediali per facilitare la co...
Storia (in) digitale. Strategie e strumenti multimediali per facilitare la co...Storia (in) digitale. Strategie e strumenti multimediali per facilitare la co...
Storia (in) digitale. Strategie e strumenti multimediali per facilitare la co...
 
1165: Pisani, Genovesi e un corsaro nel Golfo
1165: Pisani, Genovesi e un corsaro nel Golfo1165: Pisani, Genovesi e un corsaro nel Golfo
1165: Pisani, Genovesi e un corsaro nel Golfo
 
Bagni di Lucca e Lucchio tra la tarda antichità e la prima età moderna: doman...
Bagni di Lucca e Lucchio tra la tarda antichità e la prima età moderna: doman...Bagni di Lucca e Lucchio tra la tarda antichità e la prima età moderna: doman...
Bagni di Lucca e Lucchio tra la tarda antichità e la prima età moderna: doman...
 
“Chi è il forest(ier)o?” Levanto e Monterosso negli statuti medievali
“Chi è il forest(ier)o?”  Levanto e Monterosso negli statuti medievali“Chi è il forest(ier)o?”  Levanto e Monterosso negli statuti medievali
“Chi è il forest(ier)o?” Levanto e Monterosso negli statuti medievali
 
Storia, Storia Digitale e Digital Humanities: i problemi aperti
Storia, Storia Digitale e Digital Humanities: i problemi aperti Storia, Storia Digitale e Digital Humanities: i problemi aperti
Storia, Storia Digitale e Digital Humanities: i problemi aperti
 
Mapping the Middle Ages From primary, lacunous, unstandardized, different sou...
Mapping the Middle Ages From primary, lacunous, unstandardized, different sou...Mapping the Middle Ages From primary, lacunous, unstandardized, different sou...
Mapping the Middle Ages From primary, lacunous, unstandardized, different sou...
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 

Dernier (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 

Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation- Roberto Trasarti

  • 1. Modulo 5 (cod. LABCD5) 
 Part I Data collection, Data Integration, Data Understanding e Data Cleaning & Preparation
 
 3 hours Roberto Trasarti
  • 2. Modulo 5 (cod. LABCD5) 
 Part I Extracting Knowledge from data… a twisted story.
 3 years to be shrinked in 3 hours.
 Roberto Trasarti
  • 4. Definitions Data collection is a systematic process of collecting detail information about desire objective from selected sample under controlled settings. Nature, scope and objective of research: The selected data collection method should always maintain a balance among nature, scope and objectives of the study. Budget: Availability of funds for the research project determines to a large extent which the method would be suitable for the collection of data. Time: Prefixed time frame for the research project has also to be taken into account in deciding a particular method of data collection. Sufficient knowledge: Proper procedure and required
  • 5. Primary Data Primary data means original data that has been collected specially for the purpose in mind. It means someone collected the data from the original source first hand. Data collected this way is called primary data. Primary data has not been published yet and is more reliable, authentic and objective. Primary data has not been changed or altered by human beings; therefore its validity is greater than secondary data.
  • 6. Secondary Data Secondary data is the data that has been already collected by and readily available from other sources. When we use Statistical Method with Primary Data from another purpose for our purpose we refer to it as Secondary Data. It means that one purpose's Primary Data is another purpose's Secondary Data. So that secondary data is data that is being reused. Such data are more quickly obtainable than the primary data. These secondary data may be obtained from many sources, including literature, industry surveys, compilations from computerized databases and information systems, and computerized or mathematical models of environmental processes.
  • 7. Qualitative Methods Exploratory in nature, these methods are mainly concerned at gaining insights and understanding on underlying reasons and motivations, so they tend to dig deeper. Since they cannot be quantified, measurability becomes an issue. This lack of measurability leads to the preference for methods or tools that are largely unstructured or, in some cases, maybe structured but only to a very small, limited extent.
 
 Generally, qualitative methods are time-consuming and expensive to conduct, and so researchers try to lower the costs incurred by decreasing the sample size or number of respondents.
  • 8. Quantitative Methods Data can be readily quantified and generated into numerical form, which will then be converted and processed into useful information mathematically. The result is often in the form of statistics that is meaningful and, therefore, useful. Unlike qualitative methods, these quantitative techniques usually make use of larger sample sizes because its measurable nature makes that possible and easier.
 

  • 9. Face-to-Face Interviews This is considered to be the most common data collection instrument for qualitative research, primarily because of its personal approach. The interviewer will collect data directly from the subject (the interviewee), on a one-on-one and face-to-face interaction. This is ideal for when data to be obtained must be highly personalized. Generally the face-to-face is a qualitative method.
  • 10. Surveys/Questionnaires
 Questionnaires often utilize a structure comprised of short questions. Qualitative questionnaires, they are usually open-ended, with the respondents asked to provide detailed answers, in their own words. It’s almost like answering essay questions.
 
 Quantitative paper surveys pose closed questions, with the answer options provided. The respondents will only have to choose their answer among the choices provided on the questionnaire.
  • 11. Observation Can be be done with the researcher taking a participatory stance, immersing himself or not in the setting where his respondents are, and generally taking a look at everything, while taking down notes. Researcher taking notes and interacting is a qualitative method Quantitative observation in the case the data is collected through systematic observation and measuring specific aspects or using devices recording events (such as GPS devices or Mobile phones).
 
 .

  • 12. Temporal dimension: Longitudinal data collection This is a research or data collection method that is performed repeatedly, on the same data sources, over an extended period of time. It is an observational research method that could even cover a span of years and, in some cases, even decades. The goal is to find correlations through an empirical or observational study of subjects with a common trait or characteristic.
  • 13. Case of Study Data is gathered by taking a close look and an in-depth analysis of a “case study” or “case studies” – the unit or units of research that may be an individual, a group of individuals, or an entire organization. This methodology’s versatility is demonstrated in how it can be used to analyze both simple and complex subjects.
 There is the risk of having biases due the undersampling.
  • 14. Can we estimate Country well-being using new Big Data sources? We studied human behavior through the lens of phone data records by means of new statistical indicators that quantify and possibly “nowcast” the well-being and the socio-economic development of a territory.
  • 15. What defines the human division of territory? 
 cities are placed in particular areas for a number of good reasons: communication routes, natural resources, migration flows. But once cities are located in a given spot, who decides where one city ends and another begins? 
 Network analysis can be useful in this context, because it can provide an objective way to divide the territory according to a particular theory.
  • 16. What is the effect of Topics/Posts Recommendation systems in Social Networks? Algorithmic bias amplifies opinion polarization of the users showing them only a specific (their) view of the reality.
  • 18. Big Data: How much data? " Google processes 20 PB a day (2008) " Wayback Machine has 3 PB + 100 TB/month (3/2009) " Facebook has 2.5 PB of user data + 15 TB/day (4/2009) " eBay has 6.5 PB of user data + 50 TB/day (5/2009) " CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.
  • 19. Some Make it 4V’s 19
  • 20. Velocity (Speed)
 " Data is begin generated fast and need to be processed fast " Online Data Analytics " Late decisions ➔ missing opportunities " Examples ○ E-Promotions: Based on your current location, your purchase history, what you like ➔ send promotions right now for store next to you ○ Healthcare monitoring: sensors monitoring your activities and body ➔ any abnormal measurements require immediate reaction 20
  • 21. Real-time/Fast Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) " The progress and innovation is no longer hindered by the ability to collect data " But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 21
  • 22. Variety (Complexity) " Relational Data (Tables/Transaction/Legacy Data) " Text Data (Web) " Semi-structured Data (XML) " Graph Data ○ Social Network, Semantic Web (RDF), … " Streaming Data ○ You can only scan the data once " A single application can be generating/collecting many types of data " Big Public Data (online, weather, finance, etc) 22 To extract knowledge➔ all these types of data need to linked together
  • 24. The Model Has Changed… Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 24
  • 25. Big Data vs Small Data Not always the Big Data is the right choice:
 " Bigger data may lead to too simple general understanding of the phenomena. " It may contain Biases or Prejudices " It may encourage bad analyses
  • 27. Basic statistics: Mean The arithmetic mean, more commonly known as “the average,” is the sum of a list of numbers divided by the number of items on the list. The mean is useful in determining the overall trend of a data set or providing a rapid snapshot of your data. Another advantage of the mean is that it’s very easy and quick to calculate. Pitfall: Taken alone, the mean is a dangerous tool. In some data sets, the mean is also closely related to the mode and the median (two other measurements near the average). However, in a data set with a high number of outliers or a skewed distribution, the mean simply doesn’t provide the accuracy you need for a nuanced decision. 

  • 28. Basic statistics: Standard Deviation The standard deviation, often represented with the Greek letter sigma, is the measure of a spread of data around the mean. A high standard deviation signifies that data is spread more widely from the mean, where a low standard deviation signals that more data align with the mean. In a portfolio of data analysis methods, the standard deviation is useful for quickly determining dispersion of data points. Pitfall: Just like the mean, the standard deviation is deceptive if taken alone. For example, if the data have a very strange pattern such as a non-normal curve or a large amount of outliers, then the standard deviation won’t give you all the information you need. 

  • 29. Basic statistics: Quartile/Percentile The median is central to many experimental data sets, and to calculate median in such examples is important, by not falling into the trap of reporting the arithmetic mean. Quartile is a useful concept in statistics and is conceptually similar to the median. The first quartile is the data point at the 25th percentile, and the third quartile is the data point at the 75th percentile. The 50th percentile is the median. the median is a measure of the central tendency of the data but says nothing about how the data is distributed in the two arms on either side of the median. Quartiles help us measure this.
  • 30. Basic statistics: Regression Regression models the relationships between dependent and explanatory variables, which are usually charted on a scatterplot. The regression line also designates whether those relationships are strong or weak. Regression is commonly taught in high school or college statistics courses with applications for science or business in determining trends over time. Pitfall: Sometimes, the outliers on a scatterplot (and the reasons for them) matter significantly. For example, an outlying data point may represent the input from your most critical supplier or your highest selling product. The nature of a regression line, however, tempts you to ignore these outliers.
  • 31. Redundant Attributes " An attribute is redundant when it can be derived from another attribute or set of them. " Redundancy is a problem that should be avoided ○ It increments the data size ! modeling time for DM algorithms increase ○ It also may induce overfitting " Redundancies in attributes can be detected using correlation analysis
  • 32. " Correlation Test quantifies the correlation among two nominal attributes contain c and r different values each: " where oij is the frequency of (Ai,Bj) and: Redundant Attributes
  • 33. " for numerical attributes Pearson’s product moment coefficient is widely " where m is the number of instances, and A̅ ,B̅ are the mean values of attributes A and B. " Values of r close to +1 or -1 may indicate a high correlation among A and B. Redundant Attributes
  • 35. What is Data Quality? Data quality refers to the ability of a set of data to serve an intended purpose. Low-quality data cannot be used effectively to do the thing with it that you wish to do (really!?). Remember that your data is rarely going to be perfect, and that you have to juggle managing your data quality with actually using the data.
  • 36. DQ Measures I Completeness Completeness is defined as how much of a data set is populated, as opposed to being left blank. For instance, a survey would be 70% complete if it is completed by 70% of people. To ensure completeness, all data sets and data items must be recorded. Uniqueness This metric assesses how unique a data entry is, and whether it is duplicated anywhere else within your database. Uniqueness is ensured when the piece of data has only been recorded once. If there is no single view, you may have to dedupe it. Timeliness How recent is your data? This essential criteria assesses how useful or relevant your data may be based on its age. Naturally, if an entry is dated, for instance, by 12 months, the scope for dramatic changes in the interim may render the data useless. 

  • 37. DQ Measures II Validity Simply put, does the data you've recorded reflect what type of data you set out to record? So if you ask for somebody to enter their phone number into a form, and they type 'sjdhsjdshsj', that data isn't valid, because it isn't a phone number - the data doesn't match the description of the type of data it should be. Accuracy Accuracy determines whether the information you hold is correct or not, and isn't to be confused with validity, a measure of whether the data is actually the type you wanted. Consistency For anyone trying to analyse data, consistency is a fundamental consideration. Basically, you need to ensure you can compare data across data sets and media (whether it's on paper, on a computer file, or in a database) - is it all recorded in the same way, allowing you to compare the data and treat it as a whole?
  • 39. Data Cleaning " The sources of dirty data include ○ data entry errors, ○ data update errors, ○ data transmission errors and even bugs in the data processing system. " Dirty data usually is presented in two forms: missing data (MVs) and wrong (noisy) data.
  • 40. Data Cleaning " The way of handling MVs and noisy data is quite different: ○ The instances containing MVs can be ignored, filled in manually or with a constant or filled in by using estimations over the data ○ For noise, basic statistical and descriptive techniques can be used to identify outliers, or filters can be applied to eliminate noisy instances
  • 41. What Are Outliers? " Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism " Outliers are different from the noise data ○ Noise is random error or variance in a measured variable ○ Noise should be removed before outlier detection " Outliers are interesting: It violates the mechanism that generates the normal data " Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
  • 42. 42 Types of Outliers " Three kinds: global, contextual and collective outliers " Global outlier (or point anomaly) ○ Object is Og if it significantly deviates from the rest of the data set ○ Ex. Intrusion detection in computer networks ○ Issue: Find an appropriate measurement of deviation " Contextual outlier (or conditional outlier) ○ Object is Oc if it deviates significantly based on a selected context ○ Attributes of data objects should be divided into two groups ■ Contextual attributes: defines the context, e.g., time & location ■ Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature Global Outlier
  • 43. Outlier Detection: Statistical Methods " Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model) ○ The data not following the model are outliers. ■ Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data ■ There are rich alternatives to use various statistical models 43
  • 44. Outlier Detection: Proximity-Based Methods " An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set " The effectiveness of proximity-based methods highly relies on the proximity measure. " In some applications, proximity or distance measures cannot be obtained easily. " Often have a difficulty in finding a group of outliers which stay close to each other " Two major types of proximity-based outlier detection ○ Distance-based vs. density-based 44
  • 45. Handle Missing Data
 
 There are two most commonly recommended ways of dealing with missing data: " Dropping observations that have missing values " Imputing the missing values based on other observations
  • 46. Drop Data Dropping missing values is sub-optimal because when you drop observations, you drop information. The fact that the value was missing may be informative in itself. Plus, in the real world, you often need to make predictions on new data even if some of the features are missing! You can drop vertically (a feature of the data) or orizontally (some entry in your data)
  • 47. Imputing missing values Missing categorical data The best way to handle missing data for categorical features is to simply label them as ’Missing’! " You’re essentially adding a new class for the feature. This tells the algorithm that the value was missing. Missing numeric data For missing numeric data, you can fill the empty data: " Filling it in with the mean. " Filling with a special value " Allowing an algorithm to estimate the values
  • 48. Data Normalization " Sometimes the attributes selected are raw attributes. ○ They have a meaning in the original domain from where they were obtained ○ They are designed to work with the operational system in which they are being currently used " Usually these original attributes are not good enough to obtain accurate predictive models
  • 49. Data Normalization " It is common to perform a series of manipulation steps to transform the original attributes or to generate new attributes ○ They will show better properties that will help the predictive power of the model " The new attributes are usually named modeling variables or analytic variables.
  • 50. Data Normalization Min-Max Normalization " The min-max normalization aims to scale all the numerical values v of a numerical attribute A to a specified range denoted by [new − minA, new − maxA]. " The following expression transforms v to the new value v’:
  • 51. Data Normalization Z-score Normalization " If minimum or maximum values of attribute A are not known, or the data is noisy, or is skewed, the min-max normalization is good " Alternative: normalize the data of attribute A to obtain a new distribution with mean 0 and std. deviation equal to 1
  • 53. Data Transformation " It is the process to create new attributes ○ Often called transforming the attributes or the attribute set. " Data transformation usually combines the original raw attributes using different mathematical formulas originated in business models or pure mathematical formulas.
  • 54. Data Transformation Linear Transformations " Normalizations may not be enough to adapt the data to improve the generated model. " Aggregating the information contained in various attributes might be beneficial " If B is an attribute subset of the complete set A, a new attribute Z can be obtained by a linear combination:
  • 55. Data Transformation Quadratic Transformations " In quadratic transformations a new attribute is built as follows " where ri,j is a real number. " These kinds of transformations have been thoroughly studied and can help to transform data to make it separable.
  • 56. Data Reduction " When the data set is very large, performing complex analysis and DM can take a long computing time " Data reduction techniques are applied in these domains to reduce the size of the data set while trying to maintain the integrity and the information of the original data set as much as possible " Mining on the reduced data set will be much more efficient and it will also resemble the results that would have been obtained using the original data set.
  • 57. Data Reduction " The use of binning and discretization techniques is also useful to reduce the dimensionality and complexity of the data set. " They convert numerical attributes into nominal ones, thus drastically reducing the cardinality of the attributes involved
  • 58. Data Reduction " Dimensional reduction techniques: ○ Projection ○ Low Variance Filter ○ High Correlation Filter ○ Principal Component Analysis (PCA) ○ Backward Feature Elimination
  • 61. Data Mining/Machine Learning " Objective: Fit data to a model " Potential Result: Higher-level meta information that may not be obvious when looking at raw data. Patterns and Models.
  • 62. Find patterns and models? " Clusters: Clustering algorithms are often applied to automatically group similar instances or objects in clusters (groups).  The goal is to summarize the data to better understand the data or take decision. 
 " Classification models: Classification algorithms aims at extracting models that can be used to classify new instances or objects into several categories (classes).  " Patterns and associations: Several techniques are developed to extract frequent patterns or associations between values in database. "  Anomalies/outliers: The goal is to detect things that are abnormal in data (outliers or anomalies). " Trends, regularities:  Techniques can also be applied to find trends and regularities in data.   In general, the goal of data mining is to find interesting patterns. What is interesting?
 (1) it easy to understand, (2) it is valid for new data (not just for previous data); (3) it is useful, (4) it is novel or unexpected (it is not something that we know already).
  • 64. Classification problem " What we have ○ A set of objects, each of them described by some features ■ people described by age, gender, height, etc. ■ bank transactions described by type, amount, time, etc. " What we want to do ○ Associate the objects of a set to a class, taken from a predefined list ■ “good customer” vs. “churner” ■ “normal transaction” vs. “fraudulent” ■ “low risk patient” vs. “risky” ? ? ? ? ? Feature 1 (e.g. Age) Feature2 (e.g.Income) 15k€ 50y 35k€ 60y
  • 65. Classification problem " What we know ○ No domain knowledge or theory ○ Only examples: Training Set ■ Subset of labelled objects " What we can do ○ Learn from examples ○ Make inferences about the other objects
  • 66. The most stupid classifier " Rote learner ○ To classify object X, check if there is a labelled example in the training set identical to X ○ Yes ! X has the same label ○ No ! I don’t know ?
  • 67. Classify by similarity " K-Nearest Neighbors ○ Decide label based on K most similar examples K=3
  • 68. Build a model " Example 1: linear separation line
  • 69. Build a model " Example 2: Support Vector Machine (linear)
  • 70. Build a model " Example 3: Non-linear separation line
  • 71. Build a model " Decision Trees Income > 15k€ ? Age > 50y ? Age Income yes no yes no
  • 72. Clustering What if no labels are known? We might lack examples
 Labels might actually not exist at all…
  • 73. Clustering Objective: find structure in the data Group objects into clusters of similar entities
  • 74. Clustering: K-means (family) " Find k subgroups that form compact and well-separated clusters K=3 Cluster compactness Cluster separation
  • 75. Clustering: K-means (family) " Output 1: a partitioning of the initial set of objects K=3
  • 76. Clustering: K-means (family) " Output 2: K representative objects (centroids) " Centroid = average profile of the objects in the cluster K=3 • Avg. age • Avg. weight • Avg. income • Avg. .n children • …
  • 77. Clustering: hierarchical approaches " Sometimes we can have (or desire) multiple levels of aggregation
  • 78. Clustering: hierarchical approaches " Sometimes we can have (or desire) multiple levels of aggregation Dendogram
  • 79. Community detection " Equivalent to clustering in the world of networks " Some of our objects are linked " Linked objects are more 
 likely to belong to the 
 same group ○ E.g. users exchanging emails " Links can be weighted ○ E.g.: n. of emails exchanged
  • 80. Community detection " Objective ○ Identify strongly connected 
 subgroups that are weakly 
 connected to the others " General methodology ○ Find weak connections (small set 
 of links that are “bridges”) ○ Remove them ○ Each connected component 
 remaining is a community x x x x
  • 81. Frequent patterns " Events or combinations of events that appear frequently in the data " E.g. items bought by customers of a supermarket
  • 82. Frequent patterns " Frequent itemsets w.r.t. minimum threshold " E.g. with Min_freq = 5
  • 83. Frequent patterns Association rules
 If items A1, A2, … appear in a basket, then also B1, B2, … will appear there
 Notation: A1, A2, … => B1, B2, … [ C%]
 C = confidence, i.e. conditional probability => [ 80% ] => [ 100% ] => [ 66% ] => [ 20% ]
  • 84. Frequent patterns
 Complex domains " Frequent sequences (a.k.a. Sequential patterns) " Input: sequences of events (or of groups)
  • 85. Frequent patterns
 Complex domains " Objective: identify sequences that occur frequently • Sequential pattern:
  • 86. Collaborative Filtering " Goal: predict what movies/books/… a person may be interested in, on the basis of ○ Past preferences of the person ○ Other people with similar past preferences ○ The preferences of such people for a new movie/book/… " One approach based on repeated clustering ○ Cluster people on the basis of preferences for movies ○ Then cluster movies on the basis of being liked by the same clusters of people ○ Again cluster people based on their preferences for (the newly created clusters of) movies ○ Repeat above till equilibrium " Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest
  • 87. Deep learning • Age • Weight • Income • Children • Likes sport • Likes reading • Education • … Raw representation Higher-level representation • Young parent • Fit sportsman • High-educated reader • Rich obese • … 35 65 23 k€ 2 0.3 0.6 high … 0.9 0.1 0.8 0.0 … The Objective is to learn an high-level representation of the data automatically from (almost) raw input. This is done automatically using examples and reinforcement.
  • 88. How do we train? 𝒉 =   𝝈(𝐖𝟏 𝒙 + 𝒃𝟏) 𝒚 = 𝝈(𝑾𝟐 𝒉 + 𝒃𝟐) 𝒉 𝒚 𝒙 4 + 2 = 6 neurons (not counting inputs) [3 x 4] + [4 x 2] = 20 weights 4 + 2 = 6 biases 26 learnable parameters Weights Activation functions Simple Neural Network
  • 89. Neural Network and Deep Learning
  • 90. Multiple Levels Of Abstraction
  • 92. Training and Test The data is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The test dataset (or subset) in order to test our model’s prediction on this subset.
  • 93. Accuracy 
 The accuracy of a prediction is the number of correct prediction against the wrong ones: Accuracy = True Negative + True Positive / Total
  • 95. Accuracy and F1 score 
  The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
  • 96. Comparisons In order to assess the quality of the results of a prediction, it is possible to use “alternative” models: " Constant " Random Models " Simple Probabilistic Models " … " Your Model " … " Ideal Moreintelligence
  • 97. Examples • Boat Activity recognition for advertisement • Detecting and Understanding Events
  • 98. Navionics Warp – Water Activity Recognition Process
  • 99. Objectives Identify the category of each user (main activity to be chosen from fishing, sailing, cruising and canoeing, as well as other categories such as boat type, water type and type of area of preference). This will enable targeted marketing operations, where the banners shown within Navionics app will take into consideration the category associated to the user.
  • 100. Input DATA (1) " Tracks: records of the trip performed by a user with Navionics app actived. Tracks are basically sequences of GPS points that allow to reconstruct where and when the trip took place. " Land: contains the geographical borders of land, used to remove points outside water, deemed not interesting for this project. " Sea Bottom: description of the type of bottom in each point in water. Local areas having the same sea bottom type are represented in the data as a single geometric region.
  • 101. Input DATA (2) " Sonar: measures the water depth at each geographical point, worldwide. As for the sea bottom, local areas having similar depth are represented in the data as a single geometric region. For each region a minum and maxim depth are given. Usually the regions correspond to fixed intervals of depth, e.g. mininum 100 feet and maximum 200 feet, which are fine-grained on shallow waters (intervals in the order of the foot neat the coast) and coarse-grained on deep waters (intervals in the order of thousands of feet in the middle of the ocean). " Wrecks: stores the position of the wrecks localized by Navionics users – obviously it is a small fraction of the wrecks really existing worldwide, although the coverage is better in the areas that are more popular among Navionics users.
  • 102. Pre-Processed DATA " Water Types: In general Navionics uses a space tessellation covering the world where each cell corresponds to a square of 0.125° x 0.125°, about 10Km2. In the original data water and land are represented as geometries included in those cells. Using a clustering algorithm we identified bodies of water classifing them into lake (if they are closed), river or sea/ocean. " Heat Map: a representative frequency map was extracted based on the most recent segment of track data available, simply counting, for each cell of the tesselation, the number of distinct users that visited it at least once. " Coastline: Joining several data sources from navionics we obtained the costline in the entire world. A post-processing transformation is used in order to simplify the geometries for computational issue.
  • 103. Building the 
 Water Activity Behavioral Model The blue boxes represent the input data sources, including the users’ tracks, which are the keystone of the process. A first set of processes (Coastline, Analyzing and Features) derive descriptive features out of the raw track data, with the aid of the context knowledge provided by the other data sources. This set of features is then normalized in preparation of a clustering process that extracts a set of representative behaviours, till without a label associated to them. A set of tracks labeled by the domain experts is used to assign label information to each cluster representative. This information is later exploited as input for the construction of a classification model to be used for labeling new data.
  • 104. From movement tracks 
 to movement “components” The raw track data has a few main issues that need to be treated before any other step: " Due to early switch ons and late switch offs of the app, some tracks include points outside water, and therefore not useful for our task. All these points are filtered out. " A track very often contains a mix of different activities. In particular, some parts of the track might be movement and others are simple stops. For these two reasons we proceeded in reconstruct the trajectories considering spatio-temporal constraints instead of the track identifier coming from the app and decompose them into move components and stop components
  • 105. COMPONENT Features (1) " starthh, startdoy and startdow represent the hour of the day (0-23), the day of the year (1-366) and the day of the week (1-7) of the beginning of the component. " lat represents the latitude of the beginning of the component. " centercoast represents the distance between the central point (in terms of time) of the component from the coastline, as computed for each point in Section 5. " freq represents the popularity of the cell (w.r.t. Navionics tessellation, see Section 3) where the component spent most of the time. " len and duration represent the duration of the component, respectively in terms of points recorded and time spent. " domwater and domsea report resp. the most frequent water type and most frequent sea bottom type among the points in the component. " domsea_perc is the percentage of points of the component that belong to the dominant sea bottom type category. Features vector
  • 106. COMPONENT Features (2) " depth, slope and speed are analyzed, represented by some standard indicators: 1Q, 2Q and 3Q (i.e. 25-th percentile, 50-th and 75-th) and the interquartile range (i.e. Q3 – Q1). This results into 12 features, named qdepth_25, qdepth_50, …, qspeed_range. " rangle represents the percentage (ratio) of points that have an angle larger than a fixed threshold (by default 30°), thus measuring the frequency of turns of the boat. " rwreck is the percentage of points that are close a wreck. " type distinguishes stop components from move components (see Section 4.2). " entropy is the mathematical entropy function computed over the set of heading values of the component. " Accelerations and decelerations. the speed at each point of a track is compared against previous ones, in particular those that are more recent than a give temporal threshold (now fixed to 2 minutes). If the present speed is higher than the minimum previous speed in such interval by more than a fixed threshold (now fixed to a very small 0.55 m/s) and more than a fixed percentage (now fixed to 20%), then the current point is considered an acceleration point " Wandering. A rather frequent behaviour associate to fishing consists in wandering around the same location, without ever really stopping, basically exploring an area and wait for the fish. In terms of trajectories, that results in forming very entangled shapes.
  • 107. CONTEXTS All the features mentioned above where computed over the whole component. In order to get a more detailed view of what happened during the component, we identify periods where something specific occurs, named contexts, and then compute the same features mentioned above considering only the subgroups of points just identified. In particular, we considered three contexts: " Near-shore points " Off-shore points " Noodle points In addition to the features described in the previous sections, we compute: " the percentage of points of the component that belong to the context, e.g. the percentage of points spent near-shore w.r.t. the total. " maxl, the length of the longest contiguous sequence of points of the context, e.g. a boat might perform several isolated noodles, therefore here we will measure only the longest one.
  • 108. TRACK FEATURES Finally, a few features are added to the component, that relate the component itself to the overall track it belongs to: " ncomponents is the number of components that compose the track. " rcomponent is the percentage of points of the track that belong to our component. " track_loop is the geographical distance between the first point of the first component and the last point of the last component. Very small values identify loops, i.e. trips that “come back home” at the end, while high values suggest that the track is part of a longer trip (e.g. a week-long cruise) or that the boat has no fixed docking slot.
  • 109. FEATURES SELECTION Violating the non-redundancy assumption considered by the clustering algorithm, might lead to clusters that are dominated by a few attributes and therefore do not consider properly all the information contained in all the other features. For this reason, we started the clustering process with a selection of the features that appeared be well focused with our current objectives, also avoiding excessive correlations.
  • 110. NORMALIZATION we adopted a standard Z-score normalization, consisting in replacing each feature value with a new one as follows:
 new_value = (original_value – average_value) / standard_deviation
  • 111. RESCALING An ad hoc rescaling factors can be applied to the features in order to impose to the algorithms to give more or less importance to a given attribute. Through discussions with the domain experts and preliminary experiments, we decided to rescale the following attributes: " components that occurred over different types of water should be clearly separated. For this reason domwater was given an high weight by multiplying it by a factor 10. " similarly, stop components and move components represent very different things, therefore should be kept separated. Therefore, feature type was multiplied by 10. " while the latitude feature is useful as proxy of general climate conditions (tropical vs. polar, southern vs. northern hemisphere), it might risk to make clusters too location-specific. For this reason its weight was reduced through a multiplicative factor of 0.25.
  • 112. K-MEANS The components represented by verctors of 32 features are the imput for a K-Means clustering. The K is selected in order to obtain a trade-off between two objectives: (i) have enough clusters to capture the different possible users’ behaviours. (ii) keep the number of clusters small enough to make it feasible, for a domain expert, to observe and label a reasonable number of sample components that belong to each cluster. The clustering is an unsupervised algorithm, thus we discover a set of K unlabeled behaviors.
  • 113. Design a 
 Survey for the Experts From the clustering result we created a survey where the experts were asked to specify:
 " For each component, its associated activity " The overall activity performed during the track " The most likely type of boat adopted " The area (inshore/offshore/intra-coastal) and type of water (salt/lake) " Optional notes The world has been divided into 6 macro-areas: " United States East coast (USE) " United States West coast (USW) " Australia (Aus) " Mediterranean see (Med) " Scandinavia (Scand) " United Kingdom (UK)
  • 114. Expert Knowledge Cruising results to be by far the most popular activity in the tracks of the training set, followed by sailing and fishing. Very few canoeing tracks were identified. Also, fishing and cruising tracks tend to be formed by several components of the same type (respectively ~2.9 and ~2.6 components per track, as compared to the 1.7 of sailing) Looking at the activities in the different geographical areas, it is clear that in the USE and USW areas the distribution is well balanced, while in the Mediterranean fishing is slightly underrepresented, in Australia both fishing and sailing are weak, and the rest (UK and Scandinavia) only sailing emerges significantly.
  • 115. Building the 
 Semantic Model For each cluster we compute a probability distribution over the set of possible activities. This is done at two levels: " component-level: the number of components labeled with that specific activity " track-level: the number of components that belong to a track having that activity as overall labeling The two counts, obtained for each activity, are summed up according to weights defined by the analyst: 0.85 for track-level labels, and 0.15 for component-level ones. Moreover the uncertain information provided by the experts with the “?” sign, we counted also uncertain labels, yet with a weight set to 0.15.
  • 116. Domain expert's rules meta- features for tracks
 The domain experts provided a set of rules that tried to approximate their idea of fishing behaviour, cruising behaviour, etc. We translated them into features-based rules. Example: IF at least one of the following apply: " the component is in a “noodle” shape (r_noodles>=0.2) " the component is slower than 10 knots (qspeed_75 <5.14) AND follows a slope greater than 55% (qslope_50>=5) " the component is slower than 10 knots (qspeed_75 <5.14) AND is shorter than 328 ft (len<=100) " the component moves in several directions (entropy>2) AND is longer than 54 nm (len>100000) THEN the component has a Fishing behaviour
  • 117. BUILDING THE CLASSIFIER Distribution 
 from cluster
 “Activity" Distribution 
 from cluster
 “Boat" Distribution 
 from cluster
 “Zone" Distribution 
 from Rules A C4.5 algorithm is used to build a classification tree over vectors summarizing the result of all the previos processes. Each track can be represented by a vector containing: In practice this new vector is a higher representation of the track defined by different distributions derived by its stop and move components.
  • 118. TUNING THE CLASSIFIER In order to find the decision tree that has the best accuracy and yet (where possible) does not loose any label, our algorithms play with the two input parameters of C4.5: " min-leaf : how many objects of the training set should end in each leaf of the model. The larger is this value, the more “solid” will be the prediction provided by the leaf. Yet, larger values also imply that the tree must have a smaller number of leaves, thus favoring simple models; " conf-factor: confidence factor of leaves, i.e. how much should the dominant label of a leaf predominate on the others. A very high value requires that leaves are basically pure, yet implying that several splits are performed, and therefore the model is more detailed.
  • 119. Classification Results As we can observe, the distributions are similar to those of the components in the training set. The main differences include the fact that “cruising” looks more present now in the USE, USW and Mediterranean areas, whereas it dropped dramatically in UK and Scandinavia. Also, as already noticed in previous sections, “sailing” completely disappeared in Australia, since its model did not capture that category.
  • 120. Distribution of Activities (USE) IN TIME An interesting view on the data can be obtained plotting the temporal distribution of the activities along the whole duration of the data we had access to, i.e. from May 2014 to April 2016. In addition to the usual seasonal behaviours – overall increase of all activities in the summer months – we can observe that fishing increased sensibly its presence in the data during the last year. Possible causes might be an increased number of fishermen among Navionics users, or an increased propensity among fishermen to share their tracks, or a combination of the two.
  • 121. User Classification The labels assigned to each single track can be simply aggregated (counted) to infer the distribution of activities for each user. The next step, then, consists in selecting the activity – or activities – that represents the user best. After some trials and evaluation of the results with the domain experts, the following approach was decided: • If the user has a percentage of fishing tracks larger than 30%, we label the user as “fisherman”, since at the present fishing is considered a strategic segment of customers. • Otherwise, the label with the largest percentage is selected, with no minimum thresholds.
  • 122. 
 
 Adaptive highly Scalable Analytics Platform
 
 Task: Event Detection analysis: detecting events in a specific geographic area classifying the different kind of users involved. 

  • 123. The Implemented ETL Process A continuous flow of data from the users is stored in the Wind servers. The first step to realize a realistic service in the ASAP platform is to define and implement an ETL (Extract Transform Load) process able to update the data periodically (i.e. monthly)
  • 124. The Collected Data " Structured data: Charging Data Records (CDR) related to Voice, SMS, Traffic Data; Customer Relationship Management (CRM) data containing users information " Covered geographical region: city of Rome " Dataset size per snapshot: ≈ 1.2 GBytes per day " Number of records: ≈ 5.6 million lines per day A dataset of about 50 GBytes per month. The dataset is appropriately anonymized to comply with Italian and European privacy regulations.
 
 Seven months are now collected and stored. City of Rome Metropolitan area
  • 125. The Configured Cluster A cluster of 4 machines with 12 hyper-threading processors. Spark installed as runtime context.
  • 126. Spatio-temporal Statistics: Time Series Simply statistics are not so informative…
  • 127. Adding a new Dimension: users’ classification " The Sociometer is a methodology to classify the users considering their “call profile”:
 • A person is Resident in an area A when his/her home is inside A. Therefore the mobility tends to be from and towards his/ her home.
 • A person is a Commuter between an area B and an area A if his/her home is in B while the work/school place is in A. Therefore the daily mobility of this person is mainly between B and A.
 • A person is a Dynamic Resident between an area A and an area B if his/her home is in A while the work/school place is in B. A Dynamic Resident represents a sort of “opposite” of the Commuter. • A person is a Visitor in an area A if his/her home and work/school places are outside A, and the presence inside the area is limited to a certain period of time that can allow him/her to perform some activities in A.
  • 128. User Profiling
 123643 Cell12 24/06/2015 14:05 123643 Cell12 24/06/2015 18:13 123643 Cell15 25/06/2015 11:05 123643 Cell15 25/06/2015 20:42 123643 Cell11 25/06/2015 21:05 123643 Cell12 26/06/2015 10:01 …. ● Derive presence distribution for each < user, area> t1 = [00:00-08:00) t2 = [8:00-19:00) t3 = [19:00-24:00)
  • 129. Sociometer
 ● Based on clustering ● K-means: start with K random representatives, and iteratively refine them ● Output: set of reference (unlabeled) profiles
  • 130. Archetypes 
 ● Archetypes represent the expert knowledge and represent the perfect “commute”, “resident”, “visitor”, ”dynamic resident”. More than an archetype may exist for the same class. ● The centroid of each cluster is assigned to the most similar archetype. The class is than propagated to all the users in the clusters. Commuter “Static” resident Visitors “Dynamic” resident
  • 131. Multiple profiles Result for each user: set of individual profiles.
  • 132. Post-processing: Passing By 1 single call Multiple calls We distinguish between Visitors and the subclass of Passing by which are people making a single call. It’s an heuristic which allow to exclude highways in some cases or characterize a different kind of visit

  • 133. Rome Case of Study In this case of study we show how the integration of presented methods will be able to extract interesting knowledge from the Wind CDR data. City of Rome Metropolitan area Covered geographical region: city of Rome Dataset size per snapshot: ≈ 1.2 GBytes per day Number of records: ≈ 5.6 million lines per day 9 months between 2015 and 2016 January 2016 July 2016
  • 134. The proposed methodology The approach used focus the analysis on specific area using the sociometer to classify the users and then highlight different behaviors which can be studied in details. San Pietro Square Olympic Stadium Circo Massimo San Giovanni Square
  • 135. San Pietro Square Residents are the majority and cover the other classes having a lower impact on the overall distribution. Anyway This doesn’t mean that they have no effect on the city!
  • 136. San Pietro Square (Scaled) Extracting the typical behavior of each class of users the distribution may be “rescaled” (normali zed) and the anomalies emerges. In other words the real events are spotted. Moreover each event is represented by a peak in one or more classes of users.
  • 137. San Pietro Square (Interpretation)
  • 138. San Pietro – Characterizing Padre Pio event Looking at the day of the event (6th february)and the day after compared to the typical distribution in the normal Saturday and Sanday it’s evident how the event change the distribution. In particular this even involves both the passingby and the commuter types (people working in the area and people visiting the event and than disappear) Event Day after
  • 139. San Pietro – Flows to Padre Pio event Event Day after FromareaN. FromareaN.
  • 140. San Pietro – Characterizing Jubilee B&G Another event (24th April) happening in the same days of the week has a completely different impact involving dynamic residents, hence the event is more local than the previous one. EventDay before
  • 141. San Pietro – Flows to Jubilee B&G Day Before Event FromareaN. FromareaN.
  • 143. References Books: " Introduction to Data Mining, by V. Kumar " Mobility, Data Mining and Privacy, Geographic Knowledge Discovery, By F. Giannotti and D. Pedreschi " Data Analytics Made Accessible, by A. Maheshwari " Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by E. Siegel " Too Big to Ignore: The Business Case for Big Data, by award-winning author P. Simon " Lean Analytics: Use Data to Build a Better Startup Faster, by A. Croll and B. Yoskovitz " Data Smart: Using Data Science to Transform Information into Insight, by J. W. Foreman " Big Data: A Revolution That Will Transform How We Live, Work, and Think by V. Mayer-Schönberger and K. Cukier " Business UnIntelligence: Insight and Innovation Beyond Analytics and Big Data, by B. Devlin " Big Data at Work: Dispelling the Myths, Uncovering the Opportunities, by T. H. Davenport " Analytics in a Big Data World: The Essential Guide to Data Science and its Applications, by B. Baesens " Data Science For Business: What You Need to Know About Data Mining & Data-Analytic Thinking, by F. Provost & T. Fawcett " Numsense! Data Science for the Layman: No Math Added by Annalyn Ng & Kenneth Soo " Data-Driven HR: How to Use Analytics and Metrics to Drive Performance by Bernard Marr " Creating Value With Social Media Analytics: Managing, Aligning, and Mining Social Media Text, Networks, Actions, Location, Apps, Hyperlinks, Multimedia, & Search Engines Data by Gohar F. Khan " Analytic Philosophy: A Very Short Introduction by Michael Beaney