SlideShare une entreprise Scribd logo
1  sur  106
Télécharger pour lire hors ligne
Enhancing educational data
quality in heterogeneous
learning contexts using
Pentaho Data Integration
Learning Analytics Summer Institute, 2015
Alex Rayón Jerez
@alrayon, alex.rayon@deusto.es
June, 22nd, 2015
2
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
3
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
4
Introduction
Fuente: http://hardcoremind.com/?p=823
5
Introduction (II)
6
Introduction(III)
7
Introduction (IV)
8
Introduction(V)
9
Introduction(VI)
10
Introduction(VII)
Source: http://www.economist.com/news/finance-and-economics/21578041-containers-have-been-more-important-globalisation-freer-trade-humble
11
Introduction(VIII)
What about
education?
And learning?
12
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
13
Why data quality?
Data sources
Today we have so much data
that come in an unstructured
or semi-structured form that
may nonetheless be of value in
understanding more about our
learners
14
Why data quality?
Data sources (II)
“Learning is a complex social activity”
[Siemens2012]
Lots of data
Lots of tools
Humans to make sense
15
Why data quality?
Data sources (III)
● The world of technology has changed
[Eaton2012]
o 80% of the world’s information is unstructured
o Unstructured data are growing at 15 times the rate
of structured information
o Raw computational power is growing at such an
enormous rate that we almost have a supercomputer
in our hands
o Access to information is available to all
16
Why data quality?
Data sources (IV)
Source: http://www.bigdata-startups.com/BigData-startup/understanding-sources-big-data-infographic/
17
Why data quality?
Data sources (V)
● RDBMS (SQL Server, DB2, Oracle, MySQL,
PostgreSQL, Sybase IQ, etc.)
● NoSQL Data: HBase, Cassandra, MongoDB
● OLAP (Mondrian, Palo, XML/A)
● Web (REST, SOAP, XML, JSON)
● Files (CSV, Fixed, Excel, etc.)
● ERP (SAP, Salesforce, OpenERP)
● Hadoop Data: HDFS, Hive
● Web Data: Twitter, Facebook, Log Files, Web Logs
● Others: LDAP/Active Directory, Google Analytics,
18
Why data quality?
Limitations and costs
Source: http://www.learningfrontiers.eu/?q=story/will-analytics-transform-education
19
Why data quality?
Challenges
● Data is everywhere
● Data is inconsistent
o Records are different in each system
● Performance issues
o Running queries to summarize data for
stipulated long period takes operating
system for task
o Brings the OS on max load
● Data is never all in Data Warehouse
o Excel sheet, acquisition, new application
20
Why data quality?
Challenges (II)
● Data is incomplete
● Certain types of usage data are not logged
● Data are not aggregated following a
didactical perspective
● Users are afraid that they could draw
unsound inferences from some of the data
[Mazza2012]
21
Why data quality?
Development of common language for data exchange
The IEEE defines
interoperability to be:
“The ability of two or more
systems or components to
exchange information and
to use the information that
has been exchanged”
22
Why data quality?
Development of common language for data exchange (II)
● The most difficult challenges with achieving
interoperability are typically found in
establishing common meanings to the data
● Sometimes this is a matter of technical
precision
o But culture – regional, sector-specific, and
institutional – and habitual practices also affect
meaning
23
Why data quality?
Development of common language for data exchange (III)
● Potential benefits
o Efficiency and timeliness
 No need for a persona to intervene to re-enter, re-
format or transform data
o Independence
 Resilience
o Adaptability
 Faster, cheaper and less disruptive to change
o Innovation and market growth
 Interoperability combined with modularity makes
it easier to build IT systems that are better
matched to local culture without needing to create
and maintain numerous whole systems
24
Why data quality?
Development of common language for data exchange (IV)
● Potential benefits
o Durability of data
 Structures and formats change over time
 The changes are rarely properly documented
o Aggregation
 Data joining might be supported by a common set
of definitions around course structure, combined
with a unified identification scheme
o Sharing
 Specially when there are multiple parties involved
25
Why data quality?
Development of common language for data exchange (V)
[LACE2013]
26
Why data quality?
Development of common language for data exchange (VI)
[LACE2013]
In our case?
27
Why data quality?
Development of common language for data exchange (VII)
[LACE2013]
In our case?
28
Why data quality?
Importance
● Data quality emerged as an academic
research theme in the early 90’s
● In large companies, awareness of the
importance of quality is much more recent
● The core of any business process where data
is the main asset
○ Why?
■ Poor decision taking process
■ Time to fix the errors
■ ...
29
Why data quality?
Meaning
● The primary meaning of
data quality is data suitable
for a particular purpose
○ Fitness for use
○ Conformance to requirements
○ A relative term depending on
the customers’ needs
● Therefore the same data can
be evaluated to varying
degrees of quality according
to users’ needs
Fuente:
http://mitiq.mit.edu/iciq/pdf/an%20evaluation%20framework%20for%20dat
a%20quality%20tools.pdf
30
Why data quality?
Meaning (II)
● How well the representation model lines up
with the reality of business processes in the
real world [Agosta2000]
● The different ways in which the project
leader, the end-user or the database
administrator evaluate data integrity
produces a large number of quality
dimensions
31
Why data quality?
Where are problems generated?
Data entry
External data integration
Loading errors
Data migrations
New applications
32
Índice de contenidos
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
33
Data lifecycle
Quality definition
34
Data lifecycle
Knowledge Discovery in Databases
35
Data lifecycle
Knowledge Discovery in Databases (II)
SQL
XML
CSV
...
Data
Management /
Integration
Ciclo /
Proceso
datos
Modelo
datos
Dashboard
Report
API
36
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
37
Data quality framework
Measuring data quality
● A vast number of bibliographic references
address the definition of criteria for
measuring data quality
● Criteria are usually classified into quality
dimensions
○ [Berti1999]
○ [Huang1998]
○ [Olson2003]
○ [Redman2001]
○ [Wang2006]
38
Data quality framework
Quality concepts hierarchy
Dimension
Factor
Metric
Method
39
Data quality framework
Dimensions
● A dimension captures a facet (at a high level)
of the quality
○ Completeness
○ Accuracy
○ Consistency
○ Relevancy
○ Uniqueness
[Goasdoué2007]
40
Data quality framework
Dimensions (II)
QUALITY INDICATORS
Completeness
Accuracy
Consistency
Relevancy
Uniqueness
Do I have all the information?
Is my dataset valid?
Are there conflicts within my data?
Is my data useful?
Do I have repeated information?
41
Data quality framework
Quality factors
Freshness Validity, age, volatility, opportunity,
obsolescence, etc.
Completeness Density, coverage, sufficiency, etc.
Data quantity Volume, data quantity, etc.
Interpretation Traceability, appearance, presentation,
modifiability, etc.
Understanding Clarity, meaning, readability,
comparability, etc.
Concise
representation
Uniqueness, minimality, etc.
Consistent
representation
Format, syntas, alias, semantic, version
control, etc.
42
Data quality framework
Quality metrics
● A metric is the tool that permits us to measure
a quality factor
● We must define
○ The semantic (how it is measured)
■ i.e. amount of null values, time elapsed since the last update
○ The measurement units
■ i.e. response time in ms, GB volume, transaction/seg. quantity
○ The measurement granularity
■ i.e. error quantity in the whole table or in one attribute
■ Usual granularities: cell, triple, attribute, view, table, etc.
43
Data quality framework
Quality methods
● A method is a process that implements a
metric
● It is the responsible of obtaining a set of
measurements (in relation to a metric) for a
given database
● The method implementation is dependant of
the application and of the database structure
o i.e. to measure the time since the last update we can
 Use database timestamps
 Access to the update logs
 Compare versions of the database
44
Data quality framework
Dimensions: 1) Completeness
● Is a concept missing?
● Are there missing values in a column, in a
table?
● Are there missing values?
● Examples
○ Empty postal codes in the 50% of the records
45
Data quality framework
Dimensions: 1) Completeness (II)
● Extensity
o The amount of entities/states of the reality
represented for solving our problem
● Intensity
o The amount of data of each entity/state of the data
model
46
Data quality framework
Dimensions: 1) Completeness (III)
Metrics
SUMMARY
Dimension
Factors
COMPLETENESS
CoverageDensity
Ratio Ratio
47
Data quality framework
Dimensions: 1) Completeness (IV)
● Density
o How much information about my entities do I have in
my information system?
o We need to measure the quantity of information and
the gap
o Some interpretations about missing values
 They exist but I do not know them
 It does not exist
 I do not know if they exist
Factors: Completeness
48
Data quality framework
Dimensions: 1) Completeness (V)
● Coverage
o How many entities does my information system
contain?
 Closed world: a table contains all the states
 Open world: a table contains some of the states
o We need to measure the quantity of of real world data
my information system contain
o Examples
 From all my students, ¿how much do I know?
 Which percentage of learning activities are registered in my
database?
Factors: Completeness
49
Data quality framework
Dimensions: 1) Completeness (VI)
● Density factor
○ Density ratio: % of no null values
● Coverage factor
○ Coverage ratio: % of data within the data model
● Improvement opportunities
○ Crosschecking or external data acquisition
○ ƒImputation with statistical models
○ ƒStatistical smoothing techniques
Metrics: Completeness
50
Data quality framework
Dimensions: 1) Completeness (VII)
● Completeness applies to values of items and
to columns of a table (no missing values in a
column) or even to an entire table (no missing
tuples in the table)
● Great attention is paid to completeness issues
where they are essential to the correct
execution of data processes
○ For example: the correct aggregation of learning
activities requires the presence of all activitiy lines
51
Data quality framework
Dimensions: 2) Accuracy
● Closeness between a value v and a value v’
considered as the correct representation of
the reality that v aims to portray
● It indicates the lack of errors of the data
● It covers aspects that are intrinsic of the data
and aspects of the representation (format,
accuracy, etc.)
52
Data quality framework
Dimensions: 2) Accuracy (II)
Dimension
Factors
Metrics
ACCURACY
Sintactic RepresentationSemantic
boolean
degrees
deviation
boolean
deviation scale
standard
deviation
granularity
SUMMARY
53
Data quality framework
Dimensions: 2) Accuracy (III)
● Semantic accuracy
o The closeness between a value v and a real value v’
o We need to measure how well are represented real
world states within the information system
o Some problems that may arise
 Data that do not correspond to any real world state
● i.e. An student that does not exist
 Data that correspond to a wrong real world state
● i.e. Data that does not refer to the proper student
 Data with errors in some attributes
● i.e. Data that refer to the correct student but with some wrong
attribute
Factors: Accuracy
54
Data quality framework
Dimensions: 2) Accuracy (IV)
● Syntactic accuracy
o It refers to the closeness that exist between a value v and
the elements of the domain D
o We need to know if v corresponds to a correct value within
D, leaving aside if it corresponds to a real world value
o Some problems that may arise
 Value errors: out-of-range values, orthographical
errors, etc .
● i.e. “Smiht” instead of “Smith” for a last name of a student
● i.e. 338 years
 Standardization errors:
● i.e. for genre, “0” or “1”, instead of “M” or “F”
● i.e. in a foreign currency instead of €
Factors: Accuracy
55
● Boolean
○ If data satisfies rules or not
● Standard deviation
○ If the accuracy error is within the standard deviation
or not
Metrics: Accuracy
Data quality framework
Dimensions: 2) Accuracy (V)
56
Data quality framework
Dimensions: 2) Accuracy (VI)
Referentials vs.
Dictionaries
Verify semantic accuracy Verify syntactic accuracy
<key, value> pair List of valid values for a given domain
The key represents an element or a state of the
real world
A value represents an attribute of that element
57
Data quality framework
Dimensions: 2) Accuracy (VII)
● It is often connected to precision, reliability
and veracity
○ In the case of a phone number, for instance, precision
and accuracy are equivalent
● In practice, despite the attention given to
completeness, accuracy is often a poorly
reported criterion since it is difficult to
measure and often leads to high repair costs
● This is due to the fact that accuracy control and
improvement requires external reference
data
58
Data quality framework
Dimensions: 2) Accuracy (VIII)
● In practice, this comes down to comparing
actual data to a true counterpart (for
example by using a survey)
● The high costs of such tasks leads to less
ambitious verifications such as consistency
controls (for example French personal phone
numbers must begin with: 01, 02, 03, 04, 05)
or based on likelihood (disproportional ratios
of men versus women)
59
Data quality framework
Dimensions: 3) Consistency
● Data are consistent if they respect a set of
constraints
● Data must satisfy some semantic rules
○ Integrity rules
■ All the database instances must satisfy properties
○ User rules
■ Not implemented in the database, but needed for
any given application
● Improvement opportunities
○ Definition of a control strategy
○ Comparison with another, apparently more reliable,
60
Data quality framework
Dimensions: 3) Consistency (II)
● A consistency factor is based on a rule, for
example, a business rule such as “town address
must belong to the set of French towns” or
“invoicing must correspond to electric power
consumption”
○ Consistency can be viewed as a sub-dimension of
accuracy
● This dimension is essential in practice as
much as there are many opportunities to
control data consistency
61
Data quality framework
Dimensions: 3) Consistency (III)
● Consistency can not be measured directly
○ It is defined by a set of constraints
● Instead, we often measure the percentage of
data which satisfy the set of constraints (and
therefore deduce rate of suspect data)
● Consistency only gives indirect proof of
accuracy
● In the context of data quality tools, address
normalisation and data profiling processes use
consistency and likelihood controls
62
Data quality framework
Dimensions: 3) Consistency (IV)
Metrics
Dimension
Factors
CONSISTENCY
Inter-relation
integrity
Domain integrity
Rule
Intra-relation
integrity
Rule Rule
62
SUMMARY
63
Data quality framework
Dimensions: 3) Consistency (V)
Factors: Consistency
● Domain integrity
o Rule satisfaction over the content of an attribute
 i.e. age of the student must be between 0 and 120 years
● Intra-relation integrity
o Rule satisfaction within attributes of the same table
 Functional dependencies
 Value dependencies
 Conditional expressions
● Inter-relation integrity
o Rule satisfaction among attributes of different tables
 Inclusion dependencies (foreign key, referential integrity,
64
Data quality framework
Dimensions: 3) Consistency (VI)
Metrics: Consistency
● Boolean
o If data satisfies rules or not
o Granularity could be the cell or a set of cells
● Aggregation
o Integrity ratio: % of data that satisfy the rules
o Since it can exist a variety of rules for a same
relationship (or group of relations), in general, we
build a weighted sum of the results after measuring
those rules
65
Data quality framework
Dimensions: 4) Relevancy
● Is the data useful for the task at hand?
● Relevancy corresponds to the usefulness of
the data
○ Database users usually access huge volumes of data
● Among all this information, it is often difficult
to identify that which is useful
○ In addition, the available data is not always adapted to
user requirements
○ For this reason users can have the impression of poor
relevancy, leading to loss of interest in the data(base)
66
Data quality framework
Dimensions: 4) Relevancy (II)
● Relevancy is very important because it plays a
crucial part in the acceptance of a data
source
● This dimension, usually evaluated by rate of
data usage, is not directly measurable by the
quality tools
67
Data quality framework
Dimensions: 4) Relevancy (III)
● It indicates how updated is the data
o Are they current enough for our needs?
o Are they updated or obsolete?
o Do we have the most recent data?
o Do we update the data?
● It has a temporary perspective
o When were those data created/updated?
o When did we check those data?
68
Data quality framework
Dimensions: 4) Relevancy (IV)
68
Metrics
Dimension
Factors
RELEVANCY
VolatilityPresent Opportunity
boolean FrequencyOn time
SUMMARY
Temporary
69
Data quality framework
Dimensions: 4) Relevancy (V)
Factors: Relevancy
● Present
o Are in force the data of my information system?
 A data model is a view of the entities and states of a given
reality in a given moment
 i.e.
● Student data (address, email addresses, etc.)
● Grades (exercises, courses, etc.)
 We need to measure the difference between existing data and
valid data
70
Data quality framework
Dimensions: 4) Relevancy (VI)
Factors: Relevancy
● Opportunity
o Are in force the data of my information system?
 How updated are my data for the task we have
 The data we have in our information system can be recently
updated but no relevant for the task in force for having
arrived late
 i.e.
● Activity improvement obtained after having finished the
course
● Teaching method improvement after having finished the
course
 We need to measure the moment of opportunity of our data
71
Data quality framework
Dimensions: 4) Relevancy (VII)
Factors: Relevancy
● Volatility
o How unstable are my data?
 It characterizes the frequency within my data changes over
time
 It is an intrinsic characteristic of the nature of data
 i.e.
● Born date has 0 volatility
● Average degree has high volatility
 We need to measure the time interval within data are still
valid
72
Data quality framework
Dimensions: 4) Relevancy (VIII)
Metrics: Relevancy
● Present
o Temporary: query moment - first modification without
update in the database
o Boolean: data is updated or not
● Opportunity
o On time: if it is updated and arrived on time for the
task in force
● Volatility
o Frequency: how often changes happen
73
Data quality framework
Dimensions: 5) Uniqueness
● It indicates the duplicity levels of the data
o The duplicity happens when a same entity is
represented two or more times in the information
system
o A same entity can be identified under different ways
 i.e. A teacher is identified by his/her email address; a student
is identified by the enrollment id. But some students could in
the future become teachers.
o A same entity can be two times represented due to
errors on the key
 i.e. an id badly digitalized
o A same entity can be repeated with different keys
 i.e. A teacher is identified by email address; but can have
more than one
74
Data quality framework
Dimensions: 5) Uniqueness (II)
Metrics
Dimension
Factors
UNIQUENESS
No-contradictionNo-duplicity
boolean boolean
SUMMARY
75
Data quality framework
Dimensions: 5) Uniqueness (III)
Factors: Uniqueness
● No-duplicity
o There is duplicity if the same entity appears repeated
 Key values and attributes match (or are nulls in some triples)
● No-contradiction
o There is contradiction if the same entity appears
repeated with different values
 Key values could be the same or not
 There are some differences in the values of some attributes
(not null)
76
Data quality framework
Dimensions: 5) Uniqueness (IV)
Metrics: Uniqueness
● Boolean
o If the data is duplicated or not
o If the data has contradictions or not
o Granularity could be from the cell or from a given set
of cells
● Aggregations
o No-duplication ratio: % of data that are not duplicated
o No-contradiction ratio: % of data that are not
duplicated with contradictions
77
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
78
Data quality plan
Quality model
Data profiling
Data cleansingData improving
Data matching
1
23
4
79
Data quality plan
1) Data profiling
● It permits to locate, measure, monitorize and
report data quality problems
● It is a project itself
● Two types
o Structure
 Position
 Format
o Content
80
Data quality plan
1) Data profiling (II)
● Structure profiling
o It consists on the data analysis without considering its
meaning
o Semi-automatic and massive
o Column profiling
81
Data quality plan
1) Data profiling (III)
● Structure profiling
o Dependency profiling
o Redundancy profiling
 Referential integrity
 Foreign keys
82
Data quality plan
1) Data profiling (IV)
● Structure profiling
o Example: for a given student
 Name
● How much students do have name and last name?
● % of syntactic errors? (badly written)
● Consistency between the name and the sex?
 Contact phone number
● Pattern recognition: 999 999 999 - 999.999.999, etc.
● Length
● Strange characters: . , -
 etc.
83
Data quality plan
1) Data profiling (V)
● Content profiling
o It analyses in depth the data and its meaning
o It is specific for each field
o It is realized in combination with dictionaries, specific
components of data treatment, etc.
84
Data quality plan
2) Data cleansing
● We implement a reliable methodology of
data quality
○ Normalization
○ Deduplication
○ Standardization
● It permits:
○ Determine and separate a field elements relocating it
in its proper field
○ Format standardization
○ Fix errors within the data
○ Data enriching
85
Data quality plan
2) Data cleansing (II)
● The data is normalized so that there is a
common unit of measure for items in a class
○ For example: feet, inches, meters, etc. are
all converted to one unit of measure
○ Adecuación de un dato a un formato
esperado
○ Ejemplo: NIF
■ 123456789
■ 0123456789B
86
Data quality plan
2) Data cleansing (III)
● Or it contains duplicate records/items and
may have missing or incomplete descriptions
● Fixes misspellings, abbreviations, and errors
● The values are also standardized so that the
name of each attribute is consistent
● For example: inch, in., and the symbol “ are all
shown as inch
87
Data quality plan
3) Data enriching
● Enrichment of data with more attributes,
images, and specifications
● We add some data that did not exist before
88
Data quality plan
4) Data matching
● It is used to:
o Duplicate detection → unicity
o Establish a relationship between two data sources that
did not have linking fields before
o Identify a same entity within different sources that
provide different observations
● Two types
o Deterministic
 By identifying the same code (A = A) or by relation of codes
(A = B)
o Probabilistic
 A = B in a given % over assessed distances and lengths
89
Data quality plan
4) Data matching (II)
● Data consolidation
o It usually consists on the fusion of two or more
records in the same
o It has been traditionally used for duplicate detection
o It is based on business rules
 Record survival
 Best record
 Best attribute of a given record
o The result is called Golden Record
90
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
91
ETL approach
Definition and characteristics
● An ETL tool is a tool that
o Extracts data from various data sources (usually
legacy data)
o Transforms data
 from → being optimized for transaction
 to → being optimized for reporting and analysis
 synchronizes the data coming from different
databases
 data cleanses to remove errors
o Loads data into a data warehouse
92
ETL approach
Why do I need it?
● ETL tools save time and money when
developing a data warehouse by removing
the need for hand-coding
● It is very difficult for database administrators
to connect between different brands of
databases without using an external tool
● In the event that databases are altered or new
databases need to be integrated, a lot of hand-
coded work needs to be completely redone
93
ETL approach
Kettle
Project Kettle
Powerful Extraction, Transformation and
Loading (ETL) capabilities using an
innovative, metadata-driven approach
94
ETL approach
Kettle (II)
● It uses an innovative meta-driven approach
● It has a very easy-to-use GUI
● Strong community of 13,500 registered
users
● It uses a stand-alone Java engine that
process the tasks for moving data between
many different databases and files
95
ETL approach
Kettle (III)
96
ETL approach
Kettle (IV)
Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
97
ETL approach
Kettle (V)
Source: Pentaho Corporation
98
ETL approach
Kettle (VI)
● Datawarehouse and datamart loads
● Data integration
● Data cleansing
● Data migration
● Data export
● etc.
99
ETL approach
Transformations
● String and Date Manipulation
● Data Validation / Business Rules
● Lookup / Join
● Calculation, Statistics
● Cryptography
● Decisions, Flow control
● Scripting
● etc.
100
ETL approach
What is good for?
● Mirroring data from master to slave
● Syncing two data sources
● Processing data retrieved from multiple
sources and pushed to multiple
destinations
● Loading data to RDBMS
● Datamart / Datawarehouse
o Dimension lookup/update step
● Graphical manipulation of data
101
Table of contents
● Introduction
● Why data quality?
● Data lifecycle
● Data quality framework
● Data quality plan
● ETL approach
● Tools
102
Tools
Source: https://www.informatica.com/data-quality-magic-quadrant.html#fbid=ln22wvh0trz
103
Tools (II)
Interactive Data Transformation Tools (IDTs)
1. Pentaho Data Integration: Kettle PDI
2. Talend Open Studio
3. DataCleaner
4. Talend Data Quality
5. Google Refine
6. Data Wrangler
7. Potter's Wheel ABC
104
References
[CampbellOblinger2007] Campbell, John P., Peter B. DeBlois, and Diana G. Oblinger. "Academic analytics:
A new tool for a new era." Educause Review 42.4 (2007): 40.
[Clow2012] Clow, Doug. "The learning analytics cycle: closing the loop effectively." Proceedings of the 2nd
International Conference on Learning Analytics and Knowledge. ACM, 2012.
[Cooper2012] Cooper, Adam. "What is analytics? Definition and essential characteristics." CETIS Analytics
Series 1.5 (2012): 1-10.
[DA09] J. Dron and T. Anderson. On the design of collective applications. Proceedings of the 2009
International Conference on Computational Science and Engineering, 04:368–374, 2009.
[DronAnderson2009] Dron, J., & Anderson, T. (2009). On the design of collective applications. In
Proceedings of the 2009 International Conference on Computational Science and Engineering, 4, 368–374.
[Dyckhoff2010] Dyckhoff, Anna Lea, et al. "Design and Implementation of a Learning Analytics Toolkit for
Teachers." Educational Technology & Society 15.3 (2012): 58-76.
[Eaton2012] Chris Eaton, Dirk Deroos, Tom Deutsch, George Lapis & Paul Zikopoulos, “Understanding Big
Data: Analytics for Enterprise Class Hadoop and Streaming Data”, p.XV. McGraw-Hill, 2012.
[Eli11] Tanya Elias. Learning analytics: definitions, processes and potential. 2011.
[GayPryke2002] Cultural Economy: Cultural Analysis and Commercial Life (Culture, Representation and
Identity series) Paul du Gay (Editor), Michael Pryke. 2002.
105
References (II)
[HR2012] NMC Horizon Report 2012 http://www.nmc.org/publications/horizon-report-2012-higher-ed-
edition
[Jenkins2013] BBC Radio 4, Start the Week, Big Data and Analytics, first broadcast 11 February 2013
http://www.bbc.co.uk/programmes/b01qhqfv
[Khan2012] http://www.emergingedtech.com/2012/04/exploring-the-khan-academys-use-of-learning-
data-and-learning-analytics/
[LACE2013] Learning Analytics Community Exchange http://www.laceproject.eu/
[LAK2011] 1st International Conference on Learning Analytics and Knowledge, 27 February - 1 March
2011, Banff, Alberta, Canada https://tekri.athabascau.ca/analytics/
[Mazza2006] Mazza, Riccardo, et al. "MOCLog–Monitoring Online Courses with log data." Proceedings of
the 1st Moodle Research Conference. 2012.
[Mazza2012] Riccardo Mazza, Marco Bettoni, Marco Far ́, and Luca Mazezola. Moclog–monitoring online
courses with log data. 2012.
[Reinmann2006] Reinmann, G. (2006). Understanding e-learning: an opportunity for Europe? European
Journal of Vocational Training, 38, 27-42.
[SiemensBaker2012] Siemens & Baker (2012). Learning Analytics and Educational Data Mining: Towards
Communication and Collaboration. Learning Analytics and Knowledge 2012. Available in .pdf format at
http://users.wpi.edu/~rsbaker/LAKs%20reformatting%20v2.pdf
Enhancing educational data
quality in heterogeneous
learning contexts using
Pentaho Data Integration
Learning Analytics Summer Institute, 2015
Alex Rayón Jerez
@alrayon, alex.rayon@deusto.es
June, 22nd, 2015

Contenu connexe

Tendances

Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
 
Data quality overview
Data quality overviewData quality overview
Data quality overviewAlex Meadows
 
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGMETA DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGIJCSEIT Journal
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical ArchitectureHarshendu Desai
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introductiondatatovalue
 
A simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseA simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseIJDKP
 
WITSML to PPDM mapping project
WITSML to PPDM mapping projectWITSML to PPDM mapping project
WITSML to PPDM mapping projectETLSolutions
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Data Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataData Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataMarco Torchiano
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceStratebi
 
Journey from Data Quality to Applied Machine Learning
Journey from Data Quality to Applied Machine LearningJourney from Data Quality to Applied Machine Learning
Journey from Data Quality to Applied Machine LearningVictor Gunawan
 
Machine learning techniques to improve data management and data quality
Machine learning techniques to improve data management and data quality Machine learning techniques to improve data management and data quality
Machine learning techniques to improve data management and data quality CDQ - Sharing Data Excellence
 
Literature review of attribute level and
Literature review of attribute level andLiterature review of attribute level and
Literature review of attribute level andIJDKP
 
Data Protection by Design and Default for Learning Analytics
Data Protection by Design and Default for Learning AnalyticsData Protection by Design and Default for Learning Analytics
Data Protection by Design and Default for Learning AnalyticsTore Hoel
 
Eight styles of data integration
Eight styles of data integrationEight styles of data integration
Eight styles of data integrationSteve Sobotincic
 
Intro to Demand Driven Open Data for Data Users
Intro to Demand Driven Open Data for Data UsersIntro to Demand Driven Open Data for Data Users
Intro to Demand Driven Open Data for Data UsersDavid Portnoy
 

Tendances (20)

Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management System
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
 
Nikhil (1)
Nikhil (1)Nikhil (1)
Nikhil (1)
 
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGMETA DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
A simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseA simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouse
 
WITSML to PPDM mapping project
WITSML to PPDM mapping projectWITSML to PPDM mapping project
WITSML to PPDM mapping project
 
Data Quality
Data QualityData Quality
Data Quality
 
Data Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataData Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open Data
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
 
Journey from Data Quality to Applied Machine Learning
Journey from Data Quality to Applied Machine LearningJourney from Data Quality to Applied Machine Learning
Journey from Data Quality to Applied Machine Learning
 
Data quality
Data qualityData quality
Data quality
 
Machine learning techniques to improve data management and data quality
Machine learning techniques to improve data management and data quality Machine learning techniques to improve data management and data quality
Machine learning techniques to improve data management and data quality
 
Literature review of attribute level and
Literature review of attribute level andLiterature review of attribute level and
Literature review of attribute level and
 
Data Protection by Design and Default for Learning Analytics
Data Protection by Design and Default for Learning AnalyticsData Protection by Design and Default for Learning Analytics
Data Protection by Design and Default for Learning Analytics
 
Eight styles of data integration
Eight styles of data integrationEight styles of data integration
Eight styles of data integration
 
Intro to Demand Driven Open Data for Data Users
Intro to Demand Driven Open Data for Data UsersIntro to Demand Driven Open Data for Data Users
Intro to Demand Driven Open Data for Data Users
 

En vedette

Creating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance FrameworkCreating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance Frameworkcolinrickard
 
Ensuring high quality data
Ensuring high quality dataEnsuring high quality data
Ensuring high quality dataDay Munatsi
 
Managing Completeness of Web Data
Managing Completeness of Web DataManaging Completeness of Web Data
Managing Completeness of Web DataFariz Darari
 
The Changing Data Quality & Data Governance Landscape
The Changing Data Quality & Data Governance LandscapeThe Changing Data Quality & Data Governance Landscape
The Changing Data Quality & Data Governance LandscapeTrillium Software
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzujerdeb
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2
Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2
Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2Ajaz Hussain
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyAmrapali Zaveri, PhD
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data GovernanceChristopher Bradley
 
Review of Data Management Maturity Models
Review of Data Management Maturity ModelsReview of Data Management Maturity Models
Review of Data Management Maturity ModelsAlan McSweeney
 
Developing Metrics and KPI (Key Performance Indicators
Developing Metrics and KPI (Key Performance IndicatorsDeveloping Metrics and KPI (Key Performance Indicators
Developing Metrics and KPI (Key Performance IndicatorsVictor Holman
 
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...Christopher Bradley
 

En vedette (13)

Creating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance FrameworkCreating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance Framework
 
Ensuring high quality data
Ensuring high quality dataEnsuring high quality data
Ensuring high quality data
 
Managing Completeness of Web Data
Managing Completeness of Web DataManaging Completeness of Web Data
Managing Completeness of Web Data
 
The Changing Data Quality & Data Governance Landscape
The Changing Data Quality & Data Governance LandscapeThe Changing Data Quality & Data Governance Landscape
The Changing Data Quality & Data Governance Landscape
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2
Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2
Excipient Knowledge Management Mumbai 12 March 2015 Part 1 & 2
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
 
Review of Data Management Maturity Models
Review of Data Management Maturity ModelsReview of Data Management Maturity Models
Review of Data Management Maturity Models
 
Developing Metrics and KPI (Key Performance Indicators
Developing Metrics and KPI (Key Performance IndicatorsDeveloping Metrics and KPI (Key Performance Indicators
Developing Metrics and KPI (Key Performance Indicators
 
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
 

Similaire à Enhancing educational data quality in heterogeneous learning contexts using pentaho data integration

Data Analytics.01. Data selection and capture
Data Analytics.01. Data selection and captureData Analytics.01. Data selection and capture
Data Analytics.01. Data selection and captureAlex Rayón Jerez
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...
RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...
RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...ASIS&T
 
Education Data Standards Overview
Education Data Standards OverviewEducation Data Standards Overview
Education Data Standards OverviewFrank Walsh
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseVaticle
 
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Martin Kaltenböck
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRobin Rice
 
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...DATAVERSITY
 
Towards a Semantic Information System for IT Services
Towards a Semantic Information System for IT ServicesTowards a Semantic Information System for IT Services
Towards a Semantic Information System for IT Servicesbmake
 
Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...
Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...
Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...Rachel Vacek
 
Data Analytics.03. Data processing
Data Analytics.03. Data processingData Analytics.03. Data processing
Data Analytics.03. Data processingAlex Rayón Jerez
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Denodo
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Debraj GuhaThakurta
 
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...Tore Hoel
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 

Similaire à Enhancing educational data quality in heterogeneous learning contexts using pentaho data integration (20)

Data Analytics.01. Data selection and capture
Data Analytics.01. Data selection and captureData Analytics.01. Data selection and capture
Data Analytics.01. Data selection and capture
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...
RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...
RDAP 15: Virginia Tech University Libraries’ Data Service Pilot with the Coll...
 
Education Data Standards Overview
Education Data Standards OverviewEducation Data Standards Overview
Education Data Standards Overview
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
DQ Book Review
DQ Book ReviewDQ Book Review
DQ Book Review
 
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the Data
 
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
 
Towards a Semantic Information System for IT Services
Towards a Semantic Information System for IT ServicesTowards a Semantic Information System for IT Services
Towards a Semantic Information System for IT Services
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...
Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...
Hitting the Road towards a Greater Digital Destination: Evaluating and Testin...
 
Data Analytics.03. Data processing
Data Analytics.03. Data processingData Analytics.03. Data processing
Data Analytics.03. Data processing
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
lec1.pdf
lec1.pdflec1.pdf
lec1.pdf
 

Plus de Alex Rayón Jerez

El Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligenceEl Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligenceAlex Rayón Jerez
 
Herramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructuradosHerramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructuradosAlex Rayón Jerez
 
Las competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricasLas competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricasAlex Rayón Jerez
 
El Big Data en mi empresa ¿de qué me sirve?
El Big Data en mi empresa  ¿de qué me sirve?El Big Data en mi empresa  ¿de qué me sirve?
El Big Data en mi empresa ¿de qué me sirve?Alex Rayón Jerez
 
Aplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresaAplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresaAlex Rayón Jerez
 
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAnálisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAlex Rayón Jerez
 
Marketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer JourneyMarketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer JourneyAlex Rayón Jerez
 
Modelos de propensión en la era del Big Data
Modelos de propensión en la era del Big DataModelos de propensión en la era del Big Data
Modelos de propensión en la era del Big DataAlex Rayón Jerez
 
Customer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big DataCustomer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big DataAlex Rayón Jerez
 
Big Data: the Management Revolution
Big Data: the Management RevolutionBig Data: the Management Revolution
Big Data: the Management RevolutionAlex Rayón Jerez
 
Optimización de procesos con el Big Data
Optimización de procesos con el Big DataOptimización de procesos con el Big Data
Optimización de procesos con el Big DataAlex Rayón Jerez
 
La economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidadesLa economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidadesAlex Rayón Jerez
 
Cómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big DataCómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big DataAlex Rayón Jerez
 
El poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero éticaEl poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero éticaAlex Rayón Jerez
 
Búsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizajeBúsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizajeAlex Rayón Jerez
 
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Alex Rayón Jerez
 
Fomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas socialesFomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas socialesAlex Rayón Jerez
 
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Alex Rayón Jerez
 
Procesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimientoProcesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimientoAlex Rayón Jerez
 
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?Alex Rayón Jerez
 

Plus de Alex Rayón Jerez (20)

El Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligenceEl Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligence
 
Herramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructuradosHerramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructurados
 
Las competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricasLas competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricas
 
El Big Data en mi empresa ¿de qué me sirve?
El Big Data en mi empresa  ¿de qué me sirve?El Big Data en mi empresa  ¿de qué me sirve?
El Big Data en mi empresa ¿de qué me sirve?
 
Aplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresaAplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresa
 
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAnálisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
 
Marketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer JourneyMarketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer Journey
 
Modelos de propensión en la era del Big Data
Modelos de propensión en la era del Big DataModelos de propensión en la era del Big Data
Modelos de propensión en la era del Big Data
 
Customer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big DataCustomer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big Data
 
Big Data: the Management Revolution
Big Data: the Management RevolutionBig Data: the Management Revolution
Big Data: the Management Revolution
 
Optimización de procesos con el Big Data
Optimización de procesos con el Big DataOptimización de procesos con el Big Data
Optimización de procesos con el Big Data
 
La economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidadesLa economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidades
 
Cómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big DataCómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big Data
 
El poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero éticaEl poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero ética
 
Búsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizajeBúsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizaje
 
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
 
Fomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas socialesFomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas sociales
 
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
 
Procesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimientoProcesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimiento
 
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
 

Enhancing educational data quality in heterogeneous learning contexts using pentaho data integration

  • 1. Enhancing educational data quality in heterogeneous learning contexts using Pentaho Data Integration Learning Analytics Summer Institute, 2015 Alex Rayón Jerez @alrayon, alex.rayon@deusto.es June, 22nd, 2015
  • 2. 2 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 3. 3 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 12. 12 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 13. 13 Why data quality? Data sources Today we have so much data that come in an unstructured or semi-structured form that may nonetheless be of value in understanding more about our learners
  • 14. 14 Why data quality? Data sources (II) “Learning is a complex social activity” [Siemens2012] Lots of data Lots of tools Humans to make sense
  • 15. 15 Why data quality? Data sources (III) ● The world of technology has changed [Eaton2012] o 80% of the world’s information is unstructured o Unstructured data are growing at 15 times the rate of structured information o Raw computational power is growing at such an enormous rate that we almost have a supercomputer in our hands o Access to information is available to all
  • 16. 16 Why data quality? Data sources (IV) Source: http://www.bigdata-startups.com/BigData-startup/understanding-sources-big-data-infographic/
  • 17. 17 Why data quality? Data sources (V) ● RDBMS (SQL Server, DB2, Oracle, MySQL, PostgreSQL, Sybase IQ, etc.) ● NoSQL Data: HBase, Cassandra, MongoDB ● OLAP (Mondrian, Palo, XML/A) ● Web (REST, SOAP, XML, JSON) ● Files (CSV, Fixed, Excel, etc.) ● ERP (SAP, Salesforce, OpenERP) ● Hadoop Data: HDFS, Hive ● Web Data: Twitter, Facebook, Log Files, Web Logs ● Others: LDAP/Active Directory, Google Analytics,
  • 18. 18 Why data quality? Limitations and costs Source: http://www.learningfrontiers.eu/?q=story/will-analytics-transform-education
  • 19. 19 Why data quality? Challenges ● Data is everywhere ● Data is inconsistent o Records are different in each system ● Performance issues o Running queries to summarize data for stipulated long period takes operating system for task o Brings the OS on max load ● Data is never all in Data Warehouse o Excel sheet, acquisition, new application
  • 20. 20 Why data quality? Challenges (II) ● Data is incomplete ● Certain types of usage data are not logged ● Data are not aggregated following a didactical perspective ● Users are afraid that they could draw unsound inferences from some of the data [Mazza2012]
  • 21. 21 Why data quality? Development of common language for data exchange The IEEE defines interoperability to be: “The ability of two or more systems or components to exchange information and to use the information that has been exchanged”
  • 22. 22 Why data quality? Development of common language for data exchange (II) ● The most difficult challenges with achieving interoperability are typically found in establishing common meanings to the data ● Sometimes this is a matter of technical precision o But culture – regional, sector-specific, and institutional – and habitual practices also affect meaning
  • 23. 23 Why data quality? Development of common language for data exchange (III) ● Potential benefits o Efficiency and timeliness  No need for a persona to intervene to re-enter, re- format or transform data o Independence  Resilience o Adaptability  Faster, cheaper and less disruptive to change o Innovation and market growth  Interoperability combined with modularity makes it easier to build IT systems that are better matched to local culture without needing to create and maintain numerous whole systems
  • 24. 24 Why data quality? Development of common language for data exchange (IV) ● Potential benefits o Durability of data  Structures and formats change over time  The changes are rarely properly documented o Aggregation  Data joining might be supported by a common set of definitions around course structure, combined with a unified identification scheme o Sharing  Specially when there are multiple parties involved
  • 25. 25 Why data quality? Development of common language for data exchange (V) [LACE2013]
  • 26. 26 Why data quality? Development of common language for data exchange (VI) [LACE2013] In our case?
  • 27. 27 Why data quality? Development of common language for data exchange (VII) [LACE2013] In our case?
  • 28. 28 Why data quality? Importance ● Data quality emerged as an academic research theme in the early 90’s ● In large companies, awareness of the importance of quality is much more recent ● The core of any business process where data is the main asset ○ Why? ■ Poor decision taking process ■ Time to fix the errors ■ ...
  • 29. 29 Why data quality? Meaning ● The primary meaning of data quality is data suitable for a particular purpose ○ Fitness for use ○ Conformance to requirements ○ A relative term depending on the customers’ needs ● Therefore the same data can be evaluated to varying degrees of quality according to users’ needs Fuente: http://mitiq.mit.edu/iciq/pdf/an%20evaluation%20framework%20for%20dat a%20quality%20tools.pdf
  • 30. 30 Why data quality? Meaning (II) ● How well the representation model lines up with the reality of business processes in the real world [Agosta2000] ● The different ways in which the project leader, the end-user or the database administrator evaluate data integrity produces a large number of quality dimensions
  • 31. 31 Why data quality? Where are problems generated? Data entry External data integration Loading errors Data migrations New applications
  • 32. 32 Índice de contenidos ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 35. 35 Data lifecycle Knowledge Discovery in Databases (II) SQL XML CSV ... Data Management / Integration Ciclo / Proceso datos Modelo datos Dashboard Report API
  • 36. 36 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 37. 37 Data quality framework Measuring data quality ● A vast number of bibliographic references address the definition of criteria for measuring data quality ● Criteria are usually classified into quality dimensions ○ [Berti1999] ○ [Huang1998] ○ [Olson2003] ○ [Redman2001] ○ [Wang2006]
  • 38. 38 Data quality framework Quality concepts hierarchy Dimension Factor Metric Method
  • 39. 39 Data quality framework Dimensions ● A dimension captures a facet (at a high level) of the quality ○ Completeness ○ Accuracy ○ Consistency ○ Relevancy ○ Uniqueness [Goasdoué2007]
  • 40. 40 Data quality framework Dimensions (II) QUALITY INDICATORS Completeness Accuracy Consistency Relevancy Uniqueness Do I have all the information? Is my dataset valid? Are there conflicts within my data? Is my data useful? Do I have repeated information?
  • 41. 41 Data quality framework Quality factors Freshness Validity, age, volatility, opportunity, obsolescence, etc. Completeness Density, coverage, sufficiency, etc. Data quantity Volume, data quantity, etc. Interpretation Traceability, appearance, presentation, modifiability, etc. Understanding Clarity, meaning, readability, comparability, etc. Concise representation Uniqueness, minimality, etc. Consistent representation Format, syntas, alias, semantic, version control, etc.
  • 42. 42 Data quality framework Quality metrics ● A metric is the tool that permits us to measure a quality factor ● We must define ○ The semantic (how it is measured) ■ i.e. amount of null values, time elapsed since the last update ○ The measurement units ■ i.e. response time in ms, GB volume, transaction/seg. quantity ○ The measurement granularity ■ i.e. error quantity in the whole table or in one attribute ■ Usual granularities: cell, triple, attribute, view, table, etc.
  • 43. 43 Data quality framework Quality methods ● A method is a process that implements a metric ● It is the responsible of obtaining a set of measurements (in relation to a metric) for a given database ● The method implementation is dependant of the application and of the database structure o i.e. to measure the time since the last update we can  Use database timestamps  Access to the update logs  Compare versions of the database
  • 44. 44 Data quality framework Dimensions: 1) Completeness ● Is a concept missing? ● Are there missing values in a column, in a table? ● Are there missing values? ● Examples ○ Empty postal codes in the 50% of the records
  • 45. 45 Data quality framework Dimensions: 1) Completeness (II) ● Extensity o The amount of entities/states of the reality represented for solving our problem ● Intensity o The amount of data of each entity/state of the data model
  • 46. 46 Data quality framework Dimensions: 1) Completeness (III) Metrics SUMMARY Dimension Factors COMPLETENESS CoverageDensity Ratio Ratio
  • 47. 47 Data quality framework Dimensions: 1) Completeness (IV) ● Density o How much information about my entities do I have in my information system? o We need to measure the quantity of information and the gap o Some interpretations about missing values  They exist but I do not know them  It does not exist  I do not know if they exist Factors: Completeness
  • 48. 48 Data quality framework Dimensions: 1) Completeness (V) ● Coverage o How many entities does my information system contain?  Closed world: a table contains all the states  Open world: a table contains some of the states o We need to measure the quantity of of real world data my information system contain o Examples  From all my students, ¿how much do I know?  Which percentage of learning activities are registered in my database? Factors: Completeness
  • 49. 49 Data quality framework Dimensions: 1) Completeness (VI) ● Density factor ○ Density ratio: % of no null values ● Coverage factor ○ Coverage ratio: % of data within the data model ● Improvement opportunities ○ Crosschecking or external data acquisition ○ ƒImputation with statistical models ○ ƒStatistical smoothing techniques Metrics: Completeness
  • 50. 50 Data quality framework Dimensions: 1) Completeness (VII) ● Completeness applies to values of items and to columns of a table (no missing values in a column) or even to an entire table (no missing tuples in the table) ● Great attention is paid to completeness issues where they are essential to the correct execution of data processes ○ For example: the correct aggregation of learning activities requires the presence of all activitiy lines
  • 51. 51 Data quality framework Dimensions: 2) Accuracy ● Closeness between a value v and a value v’ considered as the correct representation of the reality that v aims to portray ● It indicates the lack of errors of the data ● It covers aspects that are intrinsic of the data and aspects of the representation (format, accuracy, etc.)
  • 52. 52 Data quality framework Dimensions: 2) Accuracy (II) Dimension Factors Metrics ACCURACY Sintactic RepresentationSemantic boolean degrees deviation boolean deviation scale standard deviation granularity SUMMARY
  • 53. 53 Data quality framework Dimensions: 2) Accuracy (III) ● Semantic accuracy o The closeness between a value v and a real value v’ o We need to measure how well are represented real world states within the information system o Some problems that may arise  Data that do not correspond to any real world state ● i.e. An student that does not exist  Data that correspond to a wrong real world state ● i.e. Data that does not refer to the proper student  Data with errors in some attributes ● i.e. Data that refer to the correct student but with some wrong attribute Factors: Accuracy
  • 54. 54 Data quality framework Dimensions: 2) Accuracy (IV) ● Syntactic accuracy o It refers to the closeness that exist between a value v and the elements of the domain D o We need to know if v corresponds to a correct value within D, leaving aside if it corresponds to a real world value o Some problems that may arise  Value errors: out-of-range values, orthographical errors, etc . ● i.e. “Smiht” instead of “Smith” for a last name of a student ● i.e. 338 years  Standardization errors: ● i.e. for genre, “0” or “1”, instead of “M” or “F” ● i.e. in a foreign currency instead of € Factors: Accuracy
  • 55. 55 ● Boolean ○ If data satisfies rules or not ● Standard deviation ○ If the accuracy error is within the standard deviation or not Metrics: Accuracy Data quality framework Dimensions: 2) Accuracy (V)
  • 56. 56 Data quality framework Dimensions: 2) Accuracy (VI) Referentials vs. Dictionaries Verify semantic accuracy Verify syntactic accuracy <key, value> pair List of valid values for a given domain The key represents an element or a state of the real world A value represents an attribute of that element
  • 57. 57 Data quality framework Dimensions: 2) Accuracy (VII) ● It is often connected to precision, reliability and veracity ○ In the case of a phone number, for instance, precision and accuracy are equivalent ● In practice, despite the attention given to completeness, accuracy is often a poorly reported criterion since it is difficult to measure and often leads to high repair costs ● This is due to the fact that accuracy control and improvement requires external reference data
  • 58. 58 Data quality framework Dimensions: 2) Accuracy (VIII) ● In practice, this comes down to comparing actual data to a true counterpart (for example by using a survey) ● The high costs of such tasks leads to less ambitious verifications such as consistency controls (for example French personal phone numbers must begin with: 01, 02, 03, 04, 05) or based on likelihood (disproportional ratios of men versus women)
  • 59. 59 Data quality framework Dimensions: 3) Consistency ● Data are consistent if they respect a set of constraints ● Data must satisfy some semantic rules ○ Integrity rules ■ All the database instances must satisfy properties ○ User rules ■ Not implemented in the database, but needed for any given application ● Improvement opportunities ○ Definition of a control strategy ○ Comparison with another, apparently more reliable,
  • 60. 60 Data quality framework Dimensions: 3) Consistency (II) ● A consistency factor is based on a rule, for example, a business rule such as “town address must belong to the set of French towns” or “invoicing must correspond to electric power consumption” ○ Consistency can be viewed as a sub-dimension of accuracy ● This dimension is essential in practice as much as there are many opportunities to control data consistency
  • 61. 61 Data quality framework Dimensions: 3) Consistency (III) ● Consistency can not be measured directly ○ It is defined by a set of constraints ● Instead, we often measure the percentage of data which satisfy the set of constraints (and therefore deduce rate of suspect data) ● Consistency only gives indirect proof of accuracy ● In the context of data quality tools, address normalisation and data profiling processes use consistency and likelihood controls
  • 62. 62 Data quality framework Dimensions: 3) Consistency (IV) Metrics Dimension Factors CONSISTENCY Inter-relation integrity Domain integrity Rule Intra-relation integrity Rule Rule 62 SUMMARY
  • 63. 63 Data quality framework Dimensions: 3) Consistency (V) Factors: Consistency ● Domain integrity o Rule satisfaction over the content of an attribute  i.e. age of the student must be between 0 and 120 years ● Intra-relation integrity o Rule satisfaction within attributes of the same table  Functional dependencies  Value dependencies  Conditional expressions ● Inter-relation integrity o Rule satisfaction among attributes of different tables  Inclusion dependencies (foreign key, referential integrity,
  • 64. 64 Data quality framework Dimensions: 3) Consistency (VI) Metrics: Consistency ● Boolean o If data satisfies rules or not o Granularity could be the cell or a set of cells ● Aggregation o Integrity ratio: % of data that satisfy the rules o Since it can exist a variety of rules for a same relationship (or group of relations), in general, we build a weighted sum of the results after measuring those rules
  • 65. 65 Data quality framework Dimensions: 4) Relevancy ● Is the data useful for the task at hand? ● Relevancy corresponds to the usefulness of the data ○ Database users usually access huge volumes of data ● Among all this information, it is often difficult to identify that which is useful ○ In addition, the available data is not always adapted to user requirements ○ For this reason users can have the impression of poor relevancy, leading to loss of interest in the data(base)
  • 66. 66 Data quality framework Dimensions: 4) Relevancy (II) ● Relevancy is very important because it plays a crucial part in the acceptance of a data source ● This dimension, usually evaluated by rate of data usage, is not directly measurable by the quality tools
  • 67. 67 Data quality framework Dimensions: 4) Relevancy (III) ● It indicates how updated is the data o Are they current enough for our needs? o Are they updated or obsolete? o Do we have the most recent data? o Do we update the data? ● It has a temporary perspective o When were those data created/updated? o When did we check those data?
  • 68. 68 Data quality framework Dimensions: 4) Relevancy (IV) 68 Metrics Dimension Factors RELEVANCY VolatilityPresent Opportunity boolean FrequencyOn time SUMMARY Temporary
  • 69. 69 Data quality framework Dimensions: 4) Relevancy (V) Factors: Relevancy ● Present o Are in force the data of my information system?  A data model is a view of the entities and states of a given reality in a given moment  i.e. ● Student data (address, email addresses, etc.) ● Grades (exercises, courses, etc.)  We need to measure the difference between existing data and valid data
  • 70. 70 Data quality framework Dimensions: 4) Relevancy (VI) Factors: Relevancy ● Opportunity o Are in force the data of my information system?  How updated are my data for the task we have  The data we have in our information system can be recently updated but no relevant for the task in force for having arrived late  i.e. ● Activity improvement obtained after having finished the course ● Teaching method improvement after having finished the course  We need to measure the moment of opportunity of our data
  • 71. 71 Data quality framework Dimensions: 4) Relevancy (VII) Factors: Relevancy ● Volatility o How unstable are my data?  It characterizes the frequency within my data changes over time  It is an intrinsic characteristic of the nature of data  i.e. ● Born date has 0 volatility ● Average degree has high volatility  We need to measure the time interval within data are still valid
  • 72. 72 Data quality framework Dimensions: 4) Relevancy (VIII) Metrics: Relevancy ● Present o Temporary: query moment - first modification without update in the database o Boolean: data is updated or not ● Opportunity o On time: if it is updated and arrived on time for the task in force ● Volatility o Frequency: how often changes happen
  • 73. 73 Data quality framework Dimensions: 5) Uniqueness ● It indicates the duplicity levels of the data o The duplicity happens when a same entity is represented two or more times in the information system o A same entity can be identified under different ways  i.e. A teacher is identified by his/her email address; a student is identified by the enrollment id. But some students could in the future become teachers. o A same entity can be two times represented due to errors on the key  i.e. an id badly digitalized o A same entity can be repeated with different keys  i.e. A teacher is identified by email address; but can have more than one
  • 74. 74 Data quality framework Dimensions: 5) Uniqueness (II) Metrics Dimension Factors UNIQUENESS No-contradictionNo-duplicity boolean boolean SUMMARY
  • 75. 75 Data quality framework Dimensions: 5) Uniqueness (III) Factors: Uniqueness ● No-duplicity o There is duplicity if the same entity appears repeated  Key values and attributes match (or are nulls in some triples) ● No-contradiction o There is contradiction if the same entity appears repeated with different values  Key values could be the same or not  There are some differences in the values of some attributes (not null)
  • 76. 76 Data quality framework Dimensions: 5) Uniqueness (IV) Metrics: Uniqueness ● Boolean o If the data is duplicated or not o If the data has contradictions or not o Granularity could be from the cell or from a given set of cells ● Aggregations o No-duplication ratio: % of data that are not duplicated o No-contradiction ratio: % of data that are not duplicated with contradictions
  • 77. 77 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 78. 78 Data quality plan Quality model Data profiling Data cleansingData improving Data matching 1 23 4
  • 79. 79 Data quality plan 1) Data profiling ● It permits to locate, measure, monitorize and report data quality problems ● It is a project itself ● Two types o Structure  Position  Format o Content
  • 80. 80 Data quality plan 1) Data profiling (II) ● Structure profiling o It consists on the data analysis without considering its meaning o Semi-automatic and massive o Column profiling
  • 81. 81 Data quality plan 1) Data profiling (III) ● Structure profiling o Dependency profiling o Redundancy profiling  Referential integrity  Foreign keys
  • 82. 82 Data quality plan 1) Data profiling (IV) ● Structure profiling o Example: for a given student  Name ● How much students do have name and last name? ● % of syntactic errors? (badly written) ● Consistency between the name and the sex?  Contact phone number ● Pattern recognition: 999 999 999 - 999.999.999, etc. ● Length ● Strange characters: . , -  etc.
  • 83. 83 Data quality plan 1) Data profiling (V) ● Content profiling o It analyses in depth the data and its meaning o It is specific for each field o It is realized in combination with dictionaries, specific components of data treatment, etc.
  • 84. 84 Data quality plan 2) Data cleansing ● We implement a reliable methodology of data quality ○ Normalization ○ Deduplication ○ Standardization ● It permits: ○ Determine and separate a field elements relocating it in its proper field ○ Format standardization ○ Fix errors within the data ○ Data enriching
  • 85. 85 Data quality plan 2) Data cleansing (II) ● The data is normalized so that there is a common unit of measure for items in a class ○ For example: feet, inches, meters, etc. are all converted to one unit of measure ○ Adecuación de un dato a un formato esperado ○ Ejemplo: NIF ■ 123456789 ■ 0123456789B
  • 86. 86 Data quality plan 2) Data cleansing (III) ● Or it contains duplicate records/items and may have missing or incomplete descriptions ● Fixes misspellings, abbreviations, and errors ● The values are also standardized so that the name of each attribute is consistent ● For example: inch, in., and the symbol “ are all shown as inch
  • 87. 87 Data quality plan 3) Data enriching ● Enrichment of data with more attributes, images, and specifications ● We add some data that did not exist before
  • 88. 88 Data quality plan 4) Data matching ● It is used to: o Duplicate detection → unicity o Establish a relationship between two data sources that did not have linking fields before o Identify a same entity within different sources that provide different observations ● Two types o Deterministic  By identifying the same code (A = A) or by relation of codes (A = B) o Probabilistic  A = B in a given % over assessed distances and lengths
  • 89. 89 Data quality plan 4) Data matching (II) ● Data consolidation o It usually consists on the fusion of two or more records in the same o It has been traditionally used for duplicate detection o It is based on business rules  Record survival  Best record  Best attribute of a given record o The result is called Golden Record
  • 90. 90 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 91. 91 ETL approach Definition and characteristics ● An ETL tool is a tool that o Extracts data from various data sources (usually legacy data) o Transforms data  from → being optimized for transaction  to → being optimized for reporting and analysis  synchronizes the data coming from different databases  data cleanses to remove errors o Loads data into a data warehouse
  • 92. 92 ETL approach Why do I need it? ● ETL tools save time and money when developing a data warehouse by removing the need for hand-coding ● It is very difficult for database administrators to connect between different brands of databases without using an external tool ● In the event that databases are altered or new databases need to be integrated, a lot of hand- coded work needs to be completely redone
  • 93. 93 ETL approach Kettle Project Kettle Powerful Extraction, Transformation and Loading (ETL) capabilities using an innovative, metadata-driven approach
  • 94. 94 ETL approach Kettle (II) ● It uses an innovative meta-driven approach ● It has a very easy-to-use GUI ● Strong community of 13,500 registered users ● It uses a stand-alone Java engine that process the tasks for moving data between many different databases and files
  • 96. 96 ETL approach Kettle (IV) Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
  • 97. 97 ETL approach Kettle (V) Source: Pentaho Corporation
  • 98. 98 ETL approach Kettle (VI) ● Datawarehouse and datamart loads ● Data integration ● Data cleansing ● Data migration ● Data export ● etc.
  • 99. 99 ETL approach Transformations ● String and Date Manipulation ● Data Validation / Business Rules ● Lookup / Join ● Calculation, Statistics ● Cryptography ● Decisions, Flow control ● Scripting ● etc.
  • 100. 100 ETL approach What is good for? ● Mirroring data from master to slave ● Syncing two data sources ● Processing data retrieved from multiple sources and pushed to multiple destinations ● Loading data to RDBMS ● Datamart / Datawarehouse o Dimension lookup/update step ● Graphical manipulation of data
  • 101. 101 Table of contents ● Introduction ● Why data quality? ● Data lifecycle ● Data quality framework ● Data quality plan ● ETL approach ● Tools
  • 103. 103 Tools (II) Interactive Data Transformation Tools (IDTs) 1. Pentaho Data Integration: Kettle PDI 2. Talend Open Studio 3. DataCleaner 4. Talend Data Quality 5. Google Refine 6. Data Wrangler 7. Potter's Wheel ABC
  • 104. 104 References [CampbellOblinger2007] Campbell, John P., Peter B. DeBlois, and Diana G. Oblinger. "Academic analytics: A new tool for a new era." Educause Review 42.4 (2007): 40. [Clow2012] Clow, Doug. "The learning analytics cycle: closing the loop effectively." Proceedings of the 2nd International Conference on Learning Analytics and Knowledge. ACM, 2012. [Cooper2012] Cooper, Adam. "What is analytics? Definition and essential characteristics." CETIS Analytics Series 1.5 (2012): 1-10. [DA09] J. Dron and T. Anderson. On the design of collective applications. Proceedings of the 2009 International Conference on Computational Science and Engineering, 04:368–374, 2009. [DronAnderson2009] Dron, J., & Anderson, T. (2009). On the design of collective applications. In Proceedings of the 2009 International Conference on Computational Science and Engineering, 4, 368–374. [Dyckhoff2010] Dyckhoff, Anna Lea, et al. "Design and Implementation of a Learning Analytics Toolkit for Teachers." Educational Technology & Society 15.3 (2012): 58-76. [Eaton2012] Chris Eaton, Dirk Deroos, Tom Deutsch, George Lapis & Paul Zikopoulos, “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data”, p.XV. McGraw-Hill, 2012. [Eli11] Tanya Elias. Learning analytics: definitions, processes and potential. 2011. [GayPryke2002] Cultural Economy: Cultural Analysis and Commercial Life (Culture, Representation and Identity series) Paul du Gay (Editor), Michael Pryke. 2002.
  • 105. 105 References (II) [HR2012] NMC Horizon Report 2012 http://www.nmc.org/publications/horizon-report-2012-higher-ed- edition [Jenkins2013] BBC Radio 4, Start the Week, Big Data and Analytics, first broadcast 11 February 2013 http://www.bbc.co.uk/programmes/b01qhqfv [Khan2012] http://www.emergingedtech.com/2012/04/exploring-the-khan-academys-use-of-learning- data-and-learning-analytics/ [LACE2013] Learning Analytics Community Exchange http://www.laceproject.eu/ [LAK2011] 1st International Conference on Learning Analytics and Knowledge, 27 February - 1 March 2011, Banff, Alberta, Canada https://tekri.athabascau.ca/analytics/ [Mazza2006] Mazza, Riccardo, et al. "MOCLog–Monitoring Online Courses with log data." Proceedings of the 1st Moodle Research Conference. 2012. [Mazza2012] Riccardo Mazza, Marco Bettoni, Marco Far ́, and Luca Mazezola. Moclog–monitoring online courses with log data. 2012. [Reinmann2006] Reinmann, G. (2006). Understanding e-learning: an opportunity for Europe? European Journal of Vocational Training, 38, 27-42. [SiemensBaker2012] Siemens & Baker (2012). Learning Analytics and Educational Data Mining: Towards Communication and Collaboration. Learning Analytics and Knowledge 2012. Available in .pdf format at http://users.wpi.edu/~rsbaker/LAKs%20reformatting%20v2.pdf
  • 106. Enhancing educational data quality in heterogeneous learning contexts using Pentaho Data Integration Learning Analytics Summer Institute, 2015 Alex Rayón Jerez @alrayon, alex.rayon@deusto.es June, 22nd, 2015