This document provides an overview of data mining and data aggregation basics. It discusses key concepts such as the phases of the data mining process according to the CRISP-DM framework which includes business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It also discusses different types of data aggregation including time and spatial aggregation and summarization techniques such as calculating the mean, count, maximum, minimum, mode, range, and sum. Additionally, it presents different ways of visualizing data including tables, bar charts, histograms, pie charts, and line graphs.
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
Data mining and data aggregation basics
1. data & content design
Frieda Brioschi - frieda.brioschi@gmail.com
Emma Tracanella - emma.tracanella@gmail.com
DATA MINING AND DATA AGGREGATION BASICS
LESSON 6 - 2020
3. data & content design
LESSON 5
CONTEXT
You don’t have to be a fancy statistician to do data mining, but you do
have to know something about what the data signifies and how the
business works.
Only when you understand the data and the problem that you need to
solve can data-mining processes help you to discover useful
information and put it to use.
3
4. data & content design
LESSON 5
NINE LAWS OF DATA MINING - 1
Pioneering data miner Thomas Khabaza developed his “Nine Laws of Data Mining”
to guide new data miners as they get down to work
▸ 1 - “Business Goals Law”
Business objectives are the origin of every data mining solution.
A data miner is someone who discovers useful information from data to support
specific business goals. Data mining isn’t defined by the tool you use.
▸ 2 - “Business Knowledge Law”
Business Knowledge is central to every step of the data mining process.
You don’t have to be a fancy statistician to do data mining, but you do have to
know something about what the data signifies and how the business works.
4
5. data & content design
LESSON 5
NINE LAWS OF DATA MINING - 2
▸ 3. “Data Preparation Law”
Data preparation is more than half of every data mining process.
Pretty much every data miner will spend more time on data preparation than on
analysis.
▸ 4. “No Free Lunch for the Data Miner”
The right model for a given application can only be discovered by experiment.
In data mining, models are selected through trial and error.
▸ 5 - “Patterns”
There are always patterns in the data.
As a data miner, you explore data in search of useful patterns. Understanding patterns
in the data enables you to influence what happens in the future.
5
6. data & content design
LESSON 5
NINE LAWS OF DATA MINING - 3
▸ 6. “Insight Law”
Data mining amplifies perception in the business domain.
Data mining methods enable you to understand your business better than you
could have done without them.
▸ 7 - “Prediction Law”
Prediction increases information locally by generalization.
Data mining helps us use what we know to make better predictions (or
estimates) of things we don’t know.
6
7. data & content design
LESSON 5
NINE LAWS OF DATA MINING - 4
▸ 8. “Value Law”
The value of data mining results is not determined by the accuracy or stability
of predictive models.
Your model must produce good predictions, consistently. That’s it.
▸ 9. “Law of Change”
All patterns are subject to change.
Any model that gives you great predictions today may be useless tomorrow.
7
8. data & content design
LESSON 5
PHASES OF THE DATA MINING PROCESS
The Cross-Industry Standard Process for
Data Mining (CRISP-DM) is the dominant
data-mining process framework. It’s an
open standard; anyone may use it.
8
9. data & content design
LESSON 5
BUSINESS UNDERSTANDING
Get a clear understanding of the problem you’re out to solve, how it impacts your
organization, and your goals for addressing it.
Tasks in this phase include:
▸ Identifying your business goals
▸ Assessing your situation
▸ Defining your data mining goals
▸ Producing your project plan
9
10. data & content design
LESSON 5
DATA UNDERSTANDING
Review the data that you have, document it, identify data management and data quality
issues.
Tasks in this phase include:
▸ Gathering data
▸ Describing
▸ Exploring
▸ Verifying quality
10
11. data & content design
LESSON 5
DATA PREPARATION
Get your data ready to use for modeling.
Tasks in this phase include:
▸ Selecting data
▸ Cleaning data
▸ Constructing
▸ Integrating
▸ Formatting
11
12. data & content design
LESSON 5
MODELING
Use mathematical techniques to identify patterns within your data.
Tasks in this phase include:
▸ Selecting techniques
▸ Designing tests
▸ Building models
▸ Assessing models
12
13. data & content design
LESSON 5
EVALUATION
Review the patterns you have discovered and assess their potential for business
use.
Tasks in this phase include:
▸ Evaluating results
▸ Reviewing the process
▸ Determining the next steps
13
14. data & content design
LESSON 5
DEPLOYMENT
Put your discoveries to work in everyday business.
Tasks in this phase include:
▸ Planning deployment (your methods for integrating data mining discoveries
into use)
▸ Reporting final results
▸ Reviewing final results
14
16. data & content design
LESSON 5
DATA AGGREGATION
Data aggregation is the process where raw data is gathered and expressed in a summary
form for statistical analysis.
For example, raw data can be aggregated over a given time period to provide statistics. After
the data is aggregated and written to a view or report, you can analyze the aggregated data
to gain insights about particular resources or resource groups.
There are two types of data aggregation:
▸ Time aggregation - All data points for a single resource over a specified time period.
▸ Spatial aggregation - All data points for a group of resources over a specified
geographical area.
16
17. data & content design
LESSON 5
SUMMARY STATISTICS
When data is aggregated, groups of observations are replaced with summary statistics based on those observations.
Summary statistics are used tto communicate the largest amount of information as simply as possible.
▸ Mean
▸ Count
▸ Maximum
▸ Median
▸ Minimum
▸ Mode
▸ Range
▸ Sum
17
18. data & content design
LESSON 5
TABLES
Tables are the format in which most numerical data are initially stored and analysed and
are likely to be the means you use to organise data collected during experiments and
dissertation research.
Tables are an effective way of presenting data:
• when you wish to show how a single category of information varies when
measured at different points (in time or space).
• when the dataset contains relatively few numbers.
• when the precise value is crucial to your argument and a graph would not convey
18
19. data & content design
LESSON 5
BAR CHARTS
Bar charts are one of the most commonly
used types of graph and are used to display
and compare the number, frequency or other
measure for different discrete categories or
groups.
The bars can be drawn either vertically or
horizontally depending upon the number of
categories and length or complexity of the
category labels.
19
20. data & content design
LESSON 5
HISTOGRAMS
Histograms are a special form of bar chart
where the data represent continuous rather
than discrete categories. Since a
continuous category may have a large
number of possible values the data are
often grouped to reduce the number of data
points.
20
21. data & content design
LESSON 5
PIE CHARTS
Pie charts are a visual way of displaying how
the total data are distributed between different
categories. Pie charts should only be used for
displaying nominal data. They are generally
best for showing information grouped into a
small number of categories and are a
graphical way of displaying data that might
otherwise be presented as a simple table.
21
Pie chart of populations of English native speakers
22. data & content design
LESSON 5
LINE GRAPHS
Line graphs are usually used to show time
series data – that is how one or more
variables vary over a continuous period of
time. Line graphs are particularly useful for
identifying patterns and trends in the data
such as seasonal effects, large changes and
turning points. As well as time series data,
line graphs can also be appropriate for
displaying data that are measured over other
continuous variables such as distance.
22
24. data & content design
LESSON 5
DEFINITION
Data Science is a blend of various tools, algorithms, and machine learning
principles with the goal to discover hidden patterns from the raw data and solve
analytically complicated problems.
24
25. data & content design
LESSON 5
APPLICATION OF DATA SCIENCE
25
27. data & content design
LESSON 5
EXPLAINING VS PREDICTING
27
By 2020 more than 80 % of the data
will be unstructured. This data is
generated from different sources like
financial logs, text files, multimedia
forms, sensors, and instruments.
28. data & content design
LESSON 5
28https://databasetown.com/introduction-to-data-science-a-beginners-guide/#What_is_Data_Science
30. data & content design
LESSON 5
30
The Data Scientist has the ability to handle the crude data using the latest
technologies and techniques, can perform the necessary analysis, and can
present the acquired knowledge to his associates in an informative way.
31. data & content design
LESSON 5
31
The Data Analyst works with R, Python and SQL; the role combines technical
and analytical knowledge.
32. data & content design
LESSON 5
32
The Data Architect integrates, centralizes, protects and maintains data
sources.
33. data & content design
LESSON 5
33
The Statistician can be seen as the pioneer of the data science field. It is often
he who reaps the information from the data and transforms it into actionable
insights.
34. data & content design
LESSON 5
34
The Database Administrator ensures that the database is accessible to every
stakeholder in the organizations and performs the necessary safety measures
to keep the stored data safe.
35. data & content design
LESSON 5
35
The Business Analyst is probably the least technical profile, he has a deep
understanding of the various business processes that are in place. He often
performs the role of the middle person between the business folks and the
technicians.
36. data & content design
LESSON 5
36
The Data and Analytics Manager steers the direction of the data science
team. He consolidates strong and specialized skills in a various arrangement
of advancements (SQL, R, SAS, … ) with the social aptitudes required to deal
with a group.