Data mining and data aggregation basics

data & content design
Frieda Brioschi - frieda.brioschi@gmail.com
Emma Tracanella - emma.tracanella@gmail.com
DATA MINING AND DATA AGGREGATION BASICS
LESSON 6 - 2020

DATA MINING
CLASSICAL
Photo by ev on Unsplash

LESSON 5
CONTEXT
You don’t have to be a fancy statistician to do data mining, but you do
have to know something about what the data signiﬁes and how the
business works.
Only when you understand the data and the problem that you need to
solve can data-mining processes help you to discover useful
information and put it to use.
3

LESSON 5
NINE LAWS OF DATA MINING - 1
Pioneering data miner Thomas Khabaza developed his “Nine Laws of Data Mining”
to guide new data miners as they get down to work
▸ 1 - “Business Goals Law”  
Business objectives are the origin of every data mining solution.
A data miner is someone who discovers useful information from data to support
specific business goals. Data mining isn’t defined by the tool you use.
▸ 2 - “Business Knowledge Law” 
Business Knowledge is central to every step of the data mining process.
You don’t have to be a fancy statistician to do data mining, but you do have to
know something about what the data signifies and how the business works.
4

LESSON 5
▸ 3. “Data Preparation Law” 
Data preparation is more than half of every data mining process.
Pretty much every data miner will spend more time on data preparation than on
analysis.
▸ 4. “No Free Lunch for the Data Miner” 
The right model for a given application can only be discovered by experiment.
In data mining, models are selected through trial and error.
▸ 5 - “Patterns” 
There are always patterns in the data.
As a data miner, you explore data in search of useful patterns. Understanding patterns
in the data enables you to inﬂuence what happens in the future.
5

LESSON 5
▸ 6. “Insight Law” 
Data mining amplifies perception in the business domain.
Data mining methods enable you to understand your business better than you
could have done without them.
▸ 7 - “Prediction Law” 
Prediction increases information locally by generalization.
Data mining helps us use what we know to make better predictions (or
estimates) of things we don’t know.
6

LESSON 5
▸ 8. “Value Law” 
The value of data mining results is not determined by the accuracy or stability
of predictive models.
Your model must produce good predictions, consistently. That’s it.
▸ 9. “Law of Change” 
All patterns are subject to change.
Any model that gives you great predictions today may be useless tomorrow.
7

LESSON 5
PHASES OF THE DATA MINING PROCESS
The Cross-Industry Standard Process for
Data Mining (CRISP-DM) is the dominant
data-mining process framework. It’s an
open standard; anyone may use it.
8

LESSON 5
BUSINESS UNDERSTANDING
Get a clear understanding of the problem you’re out to solve, how it impacts your
organization, and your goals for addressing it.
Tasks in this phase include:
▸ Identifying your business goals
▸ Assessing your situation
▸ Deﬁning your data mining goals
▸ Producing your project plan
9

LESSON 5
DATA UNDERSTANDING
Review the data that you have, document it, identify data management and data quality
issues.
▸ Gathering data
▸ Describing
▸ Exploring
▸ Verifying quality
10

LESSON 5
DATA PREPARATION
Get your data ready to use for modeling.
▸ Selecting data
▸ Cleaning data
▸ Constructing
▸ Integrating
▸ Formatting
11

LESSON 5
MODELING
Use mathematical techniques to identify patterns within your data.
▸ Selecting techniques
▸ Designing tests
▸ Building models
▸ Assessing models
12

LESSON 5
EVALUATION
Review the patterns you have discovered and assess their potential for business
use.
▸ Evaluating results
▸ Reviewing the process
▸ Determining the next steps
13

LESSON 5
DEPLOYMENT
Put your discoveries to work in everyday business.
▸ Planning deployment (your methods for integrating data mining discoveries
into use)
▸ Reporting ﬁnal results
▸ Reviewing ﬁnal results
14

DATA AGGREGATION
CLASSICAL

LESSON 5
DATA AGGREGATION
Data aggregation is the process where raw data is gathered and expressed in a summary
form for statistical analysis.
For example, raw data can be aggregated over a given time period to provide statistics. After
the data is aggregated and written to a view or report, you can analyze the aggregated data
to gain insights about particular resources or resource groups.
There are two types of data aggregation:
▸ Time aggregation - All data points for a single resource over a speciﬁed time period.
▸ Spatial aggregation - All data points for a group of resources over a speciﬁed
geographical area.
16

LESSON 5
SUMMARY STATISTICS
When data is aggregated, groups of observations are replaced with summary statistics based on those observations.
Summary statistics are used tto communicate the largest amount of information as simply as possible.
▸ Mean
▸ Count
▸ Maximum
▸ Median
▸ Minimum
▸ Mode
▸ Range
▸ Sum
17

LESSON 5
TABLES
Tables are the format in which most numerical data are initially stored and analysed and
are likely to be the means you use to organise data collected during experiments and
dissertation research.
Tables are an effective way of presenting data:
• when you wish to show how a single category of information varies when
measured at different points (in time or space).
• when the dataset contains relatively few numbers.
• when the precise value is crucial to your argument and a graph would not convey
18

LESSON 5
BAR CHARTS
Bar charts are one of the most commonly
used types of graph and are used to display
and compare the number, frequency or other
measure for different discrete categories or
groups.
The bars can be drawn either vertically or
horizontally depending upon the number of
categories and length or complexity of the
category labels.
19

LESSON 5
HISTOGRAMS
Histograms are a special form of bar chart
where the data represent continuous rather
than discrete categories. Since a
continuous category may have a large
number of possible values the data are
often grouped to reduce the number of data
points.
20

LESSON 5
PIE CHARTS
Pie charts are a visual way of displaying how
the total data are distributed between different
categories. Pie charts should only be used for
displaying nominal data. They are generally
best for showing information grouped into a
small number of categories and are a
graphical way of displaying data that might
otherwise be presented as a simple table.
21
Pie chart of populations of English native speakers

LESSON 5
LINE GRAPHS
Line graphs are usually used to show time
series data – that is how one or more
variables vary over a continuous period of
time. Line graphs are particularly useful for
identifying patterns and trends in the data
such as seasonal effects, large changes and
turning points. As well as time series data,
line graphs can also be appropriate for
displaying data that are measured over other
continuous variables such as distance.
22

DATA SCIENCE
WHAT IS

LESSON 5
DEFINITION
Data Science is a blend of various tools, algorithms, and machine learning
principles with the goal to discover hidden patterns from the raw data and solve
analytically complicated problems.
24

LESSON 5
APPLICATION OF DATA SCIENCE
25

LESSON 5
26

LESSON 5
EXPLAINING VS PREDICTING
27
By 2020 more than 80 % of the data
will be unstructured. This data is
generated from different sources like
ﬁnancial logs, text ﬁles, multimedia
forms, sensors, and instruments.

LESSON 5
28https://databasetown.com/introduction-to-data-science-a-beginners-guide/#What_is_Data_Science

LESSON 5
29

LESSON 5
30
The Data Scientist has the ability to handle the crude data using the latest
technologies and techniques, can perform the necessary analysis, and can
present the acquired knowledge to his associates in an informative way.

LESSON 5
31
The Data Analyst works with R, Python and SQL; the role combines technical
and analytical knowledge.

LESSON 5
32
The Data Architect integrates, centralizes, protects and maintains data
sources.

LESSON 5
33
The Statistician can be seen as the pioneer of the data science ﬁeld. It is often
he who reaps the information from the data and transforms it into actionable
insights.

LESSON 5
34
The Database Administrator ensures that the database is accessible to every
stakeholder in the organizations and performs the necessary safety measures
to keep the stored data safe.

LESSON 5
35
The Business Analyst is probably the least technical proﬁle, he has a deep
understanding of the various business processes that are in place. He often
performs the role of the middle person between the business folks and the
technicians.

LESSON 5
36
The Data and Analytics Manager steers the direction of the data science
team. He consolidates strong and specialized skills in a various arrangement
of advancements (SQL, R, SAS, … ) with the social aptitudes required to deal
with a group.

EXAMPLES
SOME
PHOTO BY JAREDD CRAIG ON UNSPLASH

LESSON 5
THE NY TIMES
https://www.nytimes.com/interactive/2019/11/02/us/politics/trump-twitter-
disinformation.html
38

Data mining and data aggregation basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data mining and data aggregation basics

Similar to Data mining and data aggregation basics (20)

More from Frieda Brioschi

More from Frieda Brioschi (17)

Recently uploaded

Recently uploaded (20)

Data mining and data aggregation basics