The tutorial has been presented at CAISE 2010. The tutorial discusses the state-of-the-art on research addresseing the quality of data at the conceptual level (conceptual schemas) and of Ontologies
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Information Quality in the Web Era
1. Tutorial at CAISE 2010
Information Quality in the Web Era
C. Batini & Matteo Palmonari
Department of Computer Science,
Communication and Systems
University of Milano Bicocca
[batini;palmonari]@disco.unimib.it 1
2. Outline
• Motivation [Palmonari]
– the Web era / information quality meeting
ontologies / the ontology landscape /
• Quality of data (conceptual level) [Batini]
– frameworks / metamodels / dimensions /
metrics / groups of schemas
• Quality of ontologies [Palmonari]
– frameworks / metamodels / dimensions /
metrics
• Conclusions [Palmonari+Batini]
2
3. Outline
• Motivation:
– the Web era / information quality meeting
ontologies / the ontology landscape /
• Quality of data (conceptual level)
– frameworks / metamodels / dimensions /
metrics / groups of schemas
• Quality of ontologies
– frameworks / metamodels / dimensions /
metrics
• Conclusions
3
4. the Web era is
characterized by…
The “Big Data”
phenomenon
7. How to make sense of all these data?
Data management needs
data quality
7
8. How to make sense of all these data?
Data management needs
data quality
8
9. Data/information heterogeneity in Information Systems
Information is available in different formats and is represented
according different models
Place Country Population Main economic activity
Portofino Italy 7.000 Tourism
Need to consider information
Image quality for heterogeneous
Structured data information sources
Portofino Map
Dear Laure, I try to describe the wonder-
ful harbour of Portofino as I have seen
Text this morning a boat is going in, other boats
are along the wharf. Small pretty buildings
9
and villas are looking on to the harbour.
10. Tutorial Background - Data Quality (Structured Data)
23rd International Conference on Conceptual Modeling (ER 2004), Shangai
A Survey of Data Quality Issues in
Cooperative Information Systems
Carlo Batini Monica Scannapieco
Università di Milano “Bicocca” Università di Roma “La Sapienza”
batini@disco.unimib.it monscan@dis.uniroma1.it
11. Tutorial Background – Towards Information Quality
(Heterogenous Data)
Tutorial at ER 08, Barcelona, Spain
Quality of Data, Textual Information and Images: a
comparative survey
Speaker: C. Batini
Other authors: F. Cabitza, G. Pasi, R. Schettini
Dipartimento di Informatica, Sistemistica e Comunicazione, Universita’ di
Milano Bicocca, Milano, Italy
batini@disco.unimib.it
12. How to make sense of all these data?
Together with automatic techniques for
information extraction, processing &
integration, also need automatic techniques for
assessing the quality of information
Information quality for information shared,
consumed and delivered on the Web
Increasing attention to information semantics
12
13. Of course, the “Semantic Web” perspective
• Make the semantics of 1998
information explicit with Web-
compliant ontologies* by
– sharing conceptualizations/
terminologies on the Web
– sharing data on the Web
• Models, languages &
technologies
– E.g. RDF, RDFS, OWL, SKOS
2006
By now, let’s consider a very broad definition
An ontology is a specification of a conceptualization.
T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, 1993. 13
14. Ontologies out of the Semantic Web
• But also for the ones that are skeptic wrt the semantic
Web,
• Ontologies (e.g. OWL ontologies, linked data, thesauri)
can be considered useful external resources to use in
– Conceptual modeling
– Data integration
– Document management
– Service Oriented Computing
– Information retrieval
– …
– Software Engineering
– Information System Design
14
17. Ontologies &
Semantic Resources
• KB - Axiomatic ontologies (e.g. SUMO)
– Terminological (intentional/schema) level:
concepts, relationships, axioms specifying logical
constraints
– Assertional (extensional/data) level: instances,
typing, relations between instances
• LD - Linked data on the Web (e.g. DBpedia)
– RDF data, usually light-weight KBs
• Th – Thesauri (e.g. WordNet)
– Lexical ontologies: terms, no schema vs. instances
• In synthesis, the ontology landscape
includes:
– Shared Vocabulary (KB,LD,Th)
– Modeling principles (KB)
– Logical theories supporting reasoning (KB)
– Web-compliant representations of models and
data (KB,LD,Th) 17
18. Need for ontology evaluation
• Ontology “Quality” Ontology Evaluation
• Quality of ontologies matters!
– In particular, when ontologies:
• are built to support specific applications (their
quality impacts on the application effectiveness)
• are searched on the Web, reused, extended
– Many ontologies to choose from!
– E.g. suppose that you need an ontology
describing customer and the business domain
18
23. Ontologies and semantic resources should
be considered in comprehensive studies
about information quality in the Web era
Tough work!
Let’s start from the beginning:
ontologies and structured data
23
24. Structured data and ontologies
• Structured data • Ontologies (KB)
Instances Instances
Logical Schemas Schema Tight vs loose
instance-schema
Conceptual schemas coupling
A
- Concpetual level representations
- Externalized models (semiotic objects)
- Constraints on domain (data)
Diagrammatic models
(ER, UML,ORM) Logical models
supporting reasoning 24
25. Ontologies and their grandparents
• Structured data • Ontologies (KB)
Instances Instances
Logical Schemas Schema / Terminologies
Conceptual schemas
In this (mini) tutorial we will:
- focus on the modeling level:
“Quality of Conceptual Schemas and
Ontologies”
A
- provide a guided tour on the topic by
- Concpetual level representations
discussing only part of the material (soon
- Externalized models (semiotic objects)
available online) on domain (data)
- Constraints
Diagrammatic models
(ER, UML,ORM) on
- focus common aspects and, in Logical models
particular, differences supporting reasoning 25
26. Outline
• Motivation:
– the Web era / information quality meeting
ontologies / the ontology landscape /
• Quality of data (conceptual level)
– frameworks / metamodels / dimensions /
metrics / groups of schemas
• Quality of ontologies
– frameworks / metamodels / dimensions /
metrics
• Conclusions
26
27. Outline
• Motivation:
– the Web era / information quality meeting
ontologies / the ontology landscape /
• Quality of Conceptual Schemas
– frameworks / metamodels / dimensions /
metrics / groups of schemas
• Quality of ontologies
– frameworks / metamodels / dimensions /
metrics
• Conclusions
27
28. # of slides
• About 130 30
• I will provide mainly a guided
introduction to the slides
29. In a database,
quality can be investigated..
• At model (language) level
• At schema (model) level
• Al instance (value/data) level
29
33. Here we focus on
• Model level
• Schema level
• Data level
33
34. Quality of Conceptual Schemas - contents
• Frameworks and Metamodels proposed
• Quality of Schemas
– Classifications, Dimensions & Metrics: main
proposals
– Comparison of proposals
– Improving the quality of schemas
• Quality of groups of schemas
– Quality of Data Integration Architectures
– Quality of the documentation for large
related groups of schemas
34
36. Some figures on proposed approaches in the literature
(from Mehmood 2009, citing Moody 2005)
Research Practice Mixed
# of proposals 29 8 2
Frameworks and
% of total 74% 21% 5%
metamodels
Empirically validated 6 0 1
% 20% 0% 50%
Generalizable 5 0 0
% 175 0% 0%
Not generalizable 24 8 2
% 83% 100% 100%
Generalizable means that the proposal can be applied to
conceptual models in general and is not specific to, e.g., ER
37. Metaschema of approaches
Formal Meta Classification
Framework schema
One/two or three level
taxnomies
Quality
Concepts and dimension
Concepts and paradigms
paradigms involved in involved in the life cycle
a formally grounded of quality, namely in the
approach to quality Quality production assessment
subdimension and improvement
activities
Metrics
Examples
Experiments
37
38. Krogstie & Solvberg
(the Scandinavians) Proposals
Meta Classification
Formal schema
Framework
• Shanks Quality
• Arab French dimension
• Vassiliadis
Quality origins – Batini et al.
The
subdimension
• Scandinavians
• Arab French
• Moody
Metrics
• Genero et al.
• Herden
• Poels
Examples
Experiments
38
41. Krogstie and Solvberg framework
Social
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Technical actor
Intepretation
41
42. Krogstie and Solvberg framework
Social
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of Correspondence
quality between
modeling
the conceptual model and
Physical
Empirical Social
quality
quality domain
the Pragmatic
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Technical actor
Intepretation
42
43. Krogstie and Solvberg framework
Correspondence between
participant knowledge and
individual interpretation
Social
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Technical actor
Intepretation
43
44. Krogstie and Solvberg framework
Social
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Correspondence between
the conceptual model and
Technical actor
Intepretation
the language
44
45. Krogstie and Solvberg framework
Social
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Correspondence between the
conceptual model and the Technical actor
audience’s interpetation Intepretation
of it
45
46. Correspondence between participant knowledge and
Krogstie and Solvberg framework
the externalized conceptual model
° Externalization: the knowledge of social actors
has been externalized in the model Social
° Internalizability, the model is persistent
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Technical actor
Intepretation
46
47. Krogstie and Solvberg framework
Social
It is reflected by the error frequency when a model is
Participant quality
Perceived
read or written, so by readability and clarity Social actor
knowledge
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Model Syntactic Language
Semantic extension
domain Externalization quality
quality
Technical
Pragmatic
quality
Technical actor
Intepretation
47
48. Krogstie and Solvberg framework
Social
Participant quality
knowledge
Perceived Social actor
Semantic Interpretation
Goal of quality
modeling
Physical Social
Empirical
quality Pragmatic
quality
Organizational quality
quality
Modeling Syntactic Language
domain Agreement on participant knowledge
Semantic Model
quality extension
quality Externalization
and individual interpretation
Technical
Pragmatic
quality
Technical actor
Intepretation
48
49. More formally
• G, the goals of the modeling task.
• L, the language extension, i.e., the set of all statements that are
possible to make according to the graphemes, vocabulary, and
syntax of the modeling languages used.
• D, the domain, i.e., the set of all statements that can be stated
about the situation at hand.
• M, the model (schema) itself.
• Ks, the relevant explicit knowledge of those being involved in
modeling. A subset of these is actively involved in modeling, and
their explicit knowledge is indicated by KM.
• I, the social actor interpretation, i.e., the set of all statements that
the audience thinks that an externalized model consists of.
• T, the technical actor interpretation, i.e., the statements in the
model as 'interpreted' by modeling tools.
49
50. Main quality types
• Physical quality: The basic quality goal is that the model M is
available for the audience.
• Empirical quality deals with predictable error frequencies when a
model is read or written by different users, coding (e.g. shapes of
boxes) and HCI-ergonomics for documentation and modeling-tools.
For instance, graph layout to avoid crossing lines in a model is a
mean to address the empirical quality of a model.
• Syntactic quality is the correspondence between the model M
and the language extension L.
• Semantic quality is the correspondence between the model M
and the domain D. This includes validity and completeness.
• Perceived semantic quality is the similar correspondence
between the audience interpretation I of a model M and his or hers
current knowledge K of the domain D.
• Pragmatic quality is the correspondence between the model M
and the audience's interpretation and application of it (I). 50
52. Framework for language (model) quality
Participant Social actor
knowledge Interpretation
Participant appropriateness
Goal of
modeling
Organizational Modeler appropr. Comprehensibility
appropriateness
appropriateness
Model
Externalization
Language
Modeling Domain
extension
domain Appropriateness
Tool
Appropriateness
Technical actor
Intepretation
52
53. Main quality types
Domain appropriateness. This relates the language and the domain.
Ideally, the conceptual basis must be powerful enough to express
anything in the domain, not having what terms construct deficit. On
the other hand, you should not be able to express things that are
not in the domain, i.e. what is termed construct excess. Domain
appropriateness is primarily a mean to achieve semantic quality.
Participant appropriateness relates the social actors’ explicit
knowledge to the language. Participant appropriateness is primarily a
mean to achieve pragmatic quality both for comprehension, learning
and action.
Modeler appropriateness: This area relates the language
extension to the participant knowledge. The goal is that there are
no statements in the explicit knowledge of the modeler that cannot
be expressed in the language. Modeler appropriateness is primarily a
mean to achieve semantic quality.
53
54. Main quality types
Comprehensibility appropriateness relates the language to the
social actor interpretation. The goal is that the participants in the modeling
effort using the language understand all the possible statements of the
language. Comprehensibility appropriateness is primarily a mean to achieve
empirical and pragmatic quality.
Tool appropriateness relates the language to the technical audience
interpretations. For tool interpretation, it is especially important that the
language lend itself to automatic reasoning. This requires formality (i.e.
both formal syntax and semantics being operational and/or logical), but
formality is not necessarily enough, since the reasoning must also be
efficient to be of practical use. This is covered by what we term
analyzability (to exploit any mathematical semantics) and executability (to
exploit any operational semantics). Different aspects of tool
appropriateness are means to achieve syntactic, semantic and pragmatic
quality (through formal syntax, mathematical semantics, and operational
semantics).
Organizational appropriateness relates the language to standards
and other organizational needs within the organizational context of
modeling. These are means to support organizational quality. 54
56. Shanks et al. composite model
Theory based
Domain Quality type Means
Language Goal Property
Prqa Model Activity
Audience
Weighting Quality factor
Rating Evaluation method Practice based
56
57. Metamodels – Arab/French
Mehmood, Chefri et al. 2009, based on goals,
question, metrics
Quality goal Q. Dimension Q. Attribute
Model element
Transformation Transformation Q. Metric
step rule
57
58. Metamodel instantiation
Quality goal Ease of change
Dimension Complexity Mantainability
Quality Simplicity Structural Modu Under Modi
attribute complexity larity standa fiabi
bility lity
Quality # of # of
metric associations dependencies
Transfor Merge Divide
mation entities The model
58
59. Metamodels – Vassiliadis et al. For DWs
Quality goal Q. Dimension
Improvement Factor
process
Interaction Measurem.
Q. Metric
method
Transformation
Information Measurem.
System object value Date
Data o. Process o. Model o.
59
60. Quality goal Q. Dimension
Comparison
Improvement Factor
process
Interaction Measurem.
Q. Metric method
Transformation
Vassiliadis Information
System object
Measurem.
value Date
Data o. Process o. Model o.
Quality goal Q. Dimension Q. Attribute
Model element
Mehemood
Transformation Transformation Q. Metric
step rule
62. The origins…
Batini, Ceri, Navathe 1991
Formal Meta Classifica
Frame schema tion
work
Quality
dimension
Quality
subdimension
Metrics
Examples
Experiments
63. Batini, Ceri, Navathe 1991
Q. Dimension Definition
Completeness Represents all (only) relevant features of
Pertinence requirements
Correctness - Concepts are properly defined in the schema
Syntactic
Correctness - Concepts are used according to their definitions
Semantic
Minimality Every aspect of reqs. appears only once in the schema
Expressiveness Can be easily understood
Readability Diagram respects aesthetic criteria
Self-explaination Other formalisms and languages not needed
Extensibility Easily adapted to changing requirements
Normality From theory of normalization 63
64. Completeness
Completeness measures the
extent to which a conceptual Students have a
Schema includes all the code, a name, a
place of birth.
conceptual elements necessary to
meet some specified requirements.
It is possible that the designer has
not included certain characteristics
present in the requirements in the
Code
schema, e.g., attributes related to
Student Name
an entity Person; in this case, the
schema is incomplete.
64
65. Pertinence
Pertinence measures how many
unnecessary conceptual Students have a
code and a name.
elements are included in the
Conceptual schema. In the case
of a schema that is not
pertinent, the designer has
Gone too far in modeling the Code
requirements, and has included Student Name
Place_of
too many concepts. Birth
65
66. Correctness - syntactic
Concerns the correct use of the
categories of the model in representing
requirements.
Student
Example – In the Entity Relationship
model we may represent the (1,n)
logical link between persons and their has
first names using the two entities Person (1,1)
and FirstName and a relationship between First Name
them. The schema is not correct wrt the
model since an entity should be used only
when the concept has a unique existence
in the real world and has an identifier.
66
67. Correctness - semantic
Correctness with respect to requirements
concerns the correct representation of
The requirements in terms of the model
Manager
categories.
(1,n)
Example - In an organization each
department is headed by exactly one heads
manager and each manager may head (1,1)
exactly one department. Department
If we represent Manager and Department
as entities, the Relationship between them
should be one-to-one; in this case, the
Schema is correct wrt requirements. If we
Use a one-to-many relationship, the
schema is incorrect.
67
68. Minimality/Redundancy
1,n
A schema is minimal if every Student
part of the requirements is 1,n
represented only once in the Attends
1,n
schema. In other words, it is
Course Assigned to
not possible to eliminate some 1,?
element from the schema Teaches
without compromising the 1,n
Instructor
information content. 1,n
68
69. Expressiveness/Readability
Intuitively, a schema is readable whenever it represents
the meaning of the reality represented by the schema in a
clear way for its intended use. This simple, qualitative
definition is not easy to translate in a more formal way,
since the evaluation expressed by the word clearly
conveys some elements of subjectivity. In models, such as
the Entity Relationship model, that provide a graphical
representation of the schema, called readability concerns
both the diagram and the schema itself.
69
70. Diagrammatic readability
With regard to the diagrammatic representation,
readability can be expressed objectively by a
number of aesthetic criteria that human beings adopt in
drawing diagrams:
1. crossings between lines should be minimized,
2. graphic symbols should be embedded in a grid,
3. lines should be made of horizontal or vertical segments,
4. The number of bends in lines should be minimized,
5. the total area of the diagram should be minimized, and, finally,
6. Parents in generalization hierarchies should be positioned at a
higher level in the diagram in respect to children.
7. The children entities in the generalization hierarchy should be
symmetrical with respect to the parent entity.
70
71. Unreadable schema
Works Manages
Head
Employee
Floor Purchase
Vendor
Located
Born
In
Department
Warehouse
Engineer
Worker
Of
Produces
Acquires Order
Item Type
City
Warranty
71
@C.Batini, F. Cabitza, G. Pasi, R. Schettini, 2008
72. A Readable schema
Floor
Located Manages
Head
Born City
Department Employee
Works
Produces
Vendor Worker Engineer
Item In
Warehouse
Type
Acquires Order Of Purchase
Warranty
72
@C.Batini, F. Cabitza, G. Pasi, R. Schettini, 2008
73. Is diagrammatic readability objective?
SEM Place close
entitities in
SYNT Minimize generalizations
bends
Works Manages
Head
Employee SYNT Minimize
Minimize
Floor Purchase
crossings
crossings…
Vendor
Located
SEM Place most Born Don’t change at all !
important
In
Department
Warehouse
Engineer
Worker
concept in the Of
middle
Produces
Acquires Order
Item Type
City
SYNT Use only Warranty
horizontal
Works Manages
Floor
Head
Located Manages Employee
Department
Head
Employee Born City Floor Purchase
Works Vendor
Located
Born
Produces Vendor Worker Engineer Department In
Warehouse
Engineer
Worker
Item In
Of
Produces
Type
Warehouse Acquires Order
Item Type
Warranty Acquires Order Of Purchase City
Warranty 73
@C.Batini, 2009
74. But ……personal experience in China,
Beda University, about 1985
Question to chinese professors:
Which one of the two diagrams do you like more?
Works Manages Floor
Located Manages
Head
Employee
Head City
Born
Floor Department Employee
Purchase Works
Vendor
Located
Born Produces
Vendor Worker Engineer
In
Department
Warehouse
Engineer
Worker
Item In
Of
Produces
Acquires Order Warehouse
Type
Item Type
City Acquires Order Of Purchase
Warranty
Warranty
Answer: definitively the left one,
we like asymmetry and movement …
74
@C.Batini, 2009
75. Expressiveness
The second issue addressed by readability is the
compactness of schema representation. Among the
different conceptual schemas that equivalently represent a
certain reality, we prefer the one or the ones that are
more compact, because compactness favors readability.
75
76. Transformation the preserves
information content
and enhances compactness/expressiveness
Employee Born City
Employee
Vendor Worker Engineer
Vendor Worker Engineer
Born
Born City Born
76
77. Normalization
Unnormalized ER schema
Employee-Project
Employee #
Salary
Project #
Budget
Role
Normalized ER schema
Employee 1,n 1,n Project
Assigned to
Employee # Project #
Salary Role Budget
77
80. Main model (schema) quality dimensions
Physical quality
• Externalization, number of statements on the domain not yet stated in the model/total # of stat.
• Interalizability
– Persistence, proptection against loss or damage
– Availability, usual meaning
Empirical quality, deals with readability by the audience
Expressed in terms of graph aesthetics and graph layout criteria
Syntactic quality, correspondence between the model (schema) and the language (model), where errors are due by
Syntactic invalidity, words or graphems not part of the language are used
Syntactic incompleteness, the model lack constructs to obey the language’s grammar (e.g. usa only one
cardinality to express minimum and max cards
Semantic quality
(feasible) Validity, the stements in the model are correct and relevant for the problem
(feasible) Completeness, the model contains all the stements which would be correct and relevant
Perceived semantic quality the correspondence between the actor interpetation of the model and her current
knolwledge of the domain
Validity
Completeness
Pragmatic quality, the correspondence between the model and the audience interpretation of it
(Feasible) Comprehension the actors undesrstaod the moled, or else individual actors und. The part of the
model relevant to them
Social quality,
Agreement in knowledge,
Agreement in model interpretation
Knowledge quality, that is perfect when the audience knew everything about the domain at a given time.
Validity
Completeness 80
81. Language quality dimensions - 1
May refer
a. to the language or else
b. to the relationship btwn language and other issues.
In the first case may refer to:
– the constructs of the language
– the external visual representation
For both
Perceptibility, how easy for persons is language comprehension
Expressive power, what it is possible to espress in the language
Expressive economy, hoe effectively can things be expressed in the
lanugage
Method/tool potential, how easily the language lends itself to proper
method or tool support.
Reducibility, what features are provided by the language to deal with
large and complex models.
81
82. Language quality dimensions - 2
Referring to the relationship btwn the language and other
issues
Domain appropriateness, there are not statements in the
domain that cannot be expressed in the language
Participant kn. appr., statements in the language models
are part of the explicit knowledge of participants.
Knowledge externalizability appr. There are no
statements in the explicit kn. of the participants that
cannot be expressed in the language
Comprehensibility appr
Technical actor interpretation appr.
82
83. More of Pragmatic quality
• Social pragmatic quality (to what extent
people understand and are able to use the
models) and technical pragmatic quality (to
what extent tools can be made that interpret
the models).
83
84. Arab French (2002-
Formal Meta Classification
Framework schema
Quality
dimension
Quality
subdimension
Metrics
Examples
Experiments
86. Definitions – 1
Q. Dimension Definition
Clarity is an aesthetic criterion, based on the graphical arrangement
Minimality Every aspect of the requirements appears only once
Min - Non Redundancy No concept can be canceled without decreasing the information content
Min - Factorization degree Measures the effectiveness of inheritance hierarchies of the schema
Min - Aggregation degree Measures the efficient use of aggregate attributes in the schema
Expressiveness The schema can be easily understood without additional explaination
Exp – Concept and schema expr Compactness
Simplicity The schema contains the minimum possible constructs
Correctness (syntactical) Concepts are properly defined in the schema
Understandability (model) The easy with which the data model can be intepreted by the user
Understandability (schema) How much modeling features are made explicit
Und – Documentation degree Presence of additional documentation for concepts
Und – User vocabulary rate Users are able to make easy correspondences btwn schema and reqs.
Und Concept independ. degree “short paths” for semantic intercnnections (ex. A ISA B)
86
87. Definitions - 1
Q. Dimension Definition
Completeness The schema represents all relevant features in the
requirements
Comp – Requirements Correpondence btwn concepts in sch. and relevant terms in
coverage reqs
Comp – Cross modeling Presence in a sch S1 of all concepts in schemas in a set
compl.
Implementability Amount of effort to implement the schema
Imp - Implementability Overall semantic distance btwn concept is the source m and
conc in the target model
Maintainability Ease with which the schema can evolve
Man - Modifiability # of modif. related to a concept mod. deriving from
dependencies
Man - Cohesion Existence of clusters with high # of internal links btwn
clusters compared with external links
Man – Coupling Existence of clusters with low # of links btwn clusters
87
88. Chefri et al. classification – metrics
(examples)
Specification
Legibility
– Clarity # of concepts – number of crossings in the diagram
– Minimality
• Non Redundancy (# weight. conc. - # red. Conc.)/ total # weigh conc.
• Factorization degree
• Aggregation degree
Expressiveness
– Concept expressiveness
– Schema expressiveness
Simplicity
Correctness
88
89. Metrics for structural complexity
• # of associations
• # of dependencies
• # of aggregations
• Depth inheritance tree, longest path from
the root of a hierachy to the leaves
90. Moody 1998 -
Meta Classification
Formal
schema
Method for Framework
Quality
dimension
Quality
subdimension
Metrics
Examples
Experiments
91. Moody’s classification
• Completness
• Integrity
• Flexibility
• Understendability
• Correctness
• Simplicity
• Implementability
• Integration Quality of related groups of
schemas (see later)
91
92. Moody’s classific. of Quality dim. and metrics - 1
Dimension Definition
Completeness The schema contains all the information required to meet reqs.
Completness M1 # of items that do not correspond to reqs.
Completness M2 # of reqs. Not represented in the schema
Completness M3 # of items that inacurrately represent reqs
Completness M4 # of inconsistencies in the schema
Integrity Extent to which the business rules on data are enforced by the sch.
Integrity M1 # of business rules not enforced by the schema
Integrity M2 # of integrity constr. In the schema not accurate in repr. Bus. rules
Flexibility The ease with which the schema can cope with business change
Flexibility M1 # of elements in the sch. Which are subject to change in the future
Flexibility M2 Estimated cost of changes
Flexibility M3 Strategic importance of change
92
93. Moody’s classific. of Quality dim. and metrics - 2
Dimension Definition
Understandability Ease with which the schema can be understood
Understandability User rating
M1
Understandability Ability of users to interpret the model correctly
M2
Understandability Application developer rating
M3
Correctness The schema conforms to the rules of the conceptual
model
Correctness M1 # of violations to model conventions
Correctness M2 Intra ent. Redundancy: Number of normal form
violations
Correctness M3.a Inter ent. Redundancy: # of redund. concepts in the
schema
93
94. Moody’s classific. of Quality dim. and metrics - 3
Dimension Definition
Simplicity The schema contains the minimum possible constructs
Simplicity M1 # of entities
Simplicity M2 # of entities + relationships
Simplicity M3 # of entities + relationships + attributes
Implementability Ease with which the schema can be implemented within
time, budget, technology constraints
Implement M1 Technical risk rating
Implement M2 Schedule risk rating
Implement M3 Development cost estimate
94
95. Moody’s monumental contribution to empirical quality/
quality of diagrammatic notations (TSE 2009)
Semiotic clarity – there should be a 1:1 correspondence between
semantic constructs and graphical symbols
Symbol redundancy
Symbol overload
Symbol excess
Symbol deficit
Perceptual discriminability: different symbols should be clearly
distinguishable form each other
Visual distance
Discriminability treshold
Semantic transparency: use visual representations whose appearenace
suggests their meaning, where symbols can be
Immediate
Semantically opaque
Semantically perverse
Semantic translucent
96. Moody’s monumental contribution to empirical quality/
quality of diagrammatic notations (TSE 2009)
Complexity management: include explicit mechanisms for
dealing with complexity
Modularization
Abstraction
Cognitive integration: include explicit mechanisms to
support integration of information for different
diagrams
Conceptual integration
Contextualization
Perceptual integration
Wayfinding
97. Moody’s monumental contribution to empirical quality/
quality of diagrammatic notations (TSE 2009)
Visual expressiveness: use the full range and capacities of
visual variables
Degree of visual freedom
Saturation
Dual coding: use text to complement graphics
Graphic economy: the number of different graphical
symboles should be cognitively maneageble
Symbol deficit
Cognitive fit: use different visual dialects for different
tasks and audiences
Visual mono/plurilinguism
98. Moody’s monumental contribution to empirical quality/
quality of diagrammatic notations (TSE 2009)
Interactions among principles
Semiotic Clarity can affect Graphic Economy either positively or
negatively: Symbol excess and symbol redundancy increase graphic
complexity, while symbol overload and symbol deficit reduce it.
Perceptual Discriminability increases Visual Expressiveness as it
involves using more visual variables and a wider range of values (a
side effect of increasing visual distance); similarly, Visual
Expressiveness is one of the primary ways of improving Perceptual
Discriminability.
Increasing Visual Expressiveness reduces the effects of graphic
complexity, while Graphic Economy defines limits on Visual
Expressiveness (how much information can be effectively encoded
graphically).
Increasing the number of symbols (Graphic Economy) makes it more
difficult to discriminate between them (Perceptual Discriminability).
Perceptual Discriminability, Complexity Management, Semantic
Transparency, Graphic Economy, and Dual Coding improve
effectiveness for novices, though Semantic Transparency can
reduce effectiveness for experts (Cognitive Fit).
Semantic Transparency and Visual Expressiveness can make hand
drawing more difficult (Cognitive Fit)
100. Genero et al. 2005 -
Formal Meta Classifica
Framework schema tion
Quality
dimension
Quality
subdimension
Metrics
Examples
Experiments
101. Genero et al classification
Maintainability is influenced by the following subcharacteristics:
• Understandability: the ease with which the conceptual data model can be
understood.
• Legibility: is the ease with which the conceptual data model can be read,
with respect to certain aesthetic criteria [13].
• Simplicity: means that the conceptual data model contains the minimum
number of constructions possible.
• Analysability: the capability of the conceptual data model to be diagnosed
for deficiencies or for parts to be modified.
• Modifiability: the capability of the conceptual data model to enable a
specified modification to be implemented.
• Stability: the capability of the conceptual data model to avoid unexpected
effects from modifications.
• Testability: the capability of the conceptual data model to enable
modifications to be validated
101
102. Herden
Formal Meta
Frame Classification
schema
work
Quality
Metadata dimension
Quality
subdimension
Metrics
Examples
Experiments
103. Herden classification
• Correctness
• Consistency
• Scope
• Level of detail
• Completeness
• Minimality
• Ability of integration (see later)
• Readability
103
104. Herden
Dimension Definition
(Technical) Correctness Correctness of concepts w.r.t reqs.
(Technical) Consistency Absence of contradiction
Scope Comprehensive w.r.t. general user acceptance
Level of detail Adequacy in detail w.r.t. user acceptance
Completeness Completeness w.r.t. requirements
Minimality Compactness and absence of redundancies
Readability Completeness od documentation
104
105. Metadata in Herden’s classification
• Description
• Relevance
• Measuring
• Metric
• Degree of automation
• Objectivity
105
106. Poels et al
Formal Meta Classifica
Frame schema tion
work
Quality
dimension
Quality
subdimension
Metrics
Examples
Experiments
107. Poels et al
Interested in
• Perceived semantic quality
• Perceived pragmatic quality
To understand their relationship with
1. Perceived ease of use (efficiency)
2. Perceived usefullness (effectiveness)
and
3. User information satisfaction
108. Poels et al. classification
Quality# Quality dimension Definition
PSQ1 The schema represents the business process
correctly
PSQ2 The schema is a realistic representation of
the business process
PSQ3 The schema contains contradicting elements
PSQ4 The schema contains redundant elements
PSQ5 Elements must be added to faithfully
represent the business process
PSQ6 All the elements in the conceptual schema are
relevant for the representation of the
business process
PSQ7 The schema gives a complete representation
of the business process
108
@C.Batini, F. Cabitza, G. Pasi, R. Schettini, 2008
109. Poels et al. classification
Quality# Quality dimension Definition
PSQ1 Correctness/ The schema represents the business process
Validity correctly
PSQ2 Feasible cor- The schema is a realistic representation of
rectness/validity the business process
PSQ3 Coherence The schema contains contradicting elements
PSQ4 Non redundancy The schema contains redundant elements
PSQ5 ??? Elements must be added to faithfully
represent the business process
PSQ6 Relevance All the elements in the conceptual schema are
relevant for the representation of the
business process
PSQ7 Completeness The schema gives a complete representation
of the business process
109
@C.Batini, F. Cabitza, G. Pasi, R. Schettini, 2008
110. Poels et al general findings
Perceived
usefullness
0,1 0,38
Perceived User
Semantic Infornation
quality 0,58 satisfaction
0,35
0,29
Perceived
ease of use
112. Physical and empirical quality
Author(s)/Types of Batini Scand. Moody ArabFrench Genero et Herden Poels
qualities et al 91 94- 98- 02- 2005
Physical quality
Externalization x
Persistence x
Availability x
Empirical quality
Minimality x x x
Readability/legibility x x x x x
Expressiveness x x x
Simplicity/self x x x x
explaination
Graph aesthetics/ x x x
readability/Clarity
Understandability X-3 x x
112
113. Syntactic and semantic quality
Author(s)/Types of Batini Scand. Moody ArabFrench Genero Herden Poels
qualities et al 91 94- 98- 02- et 2005
Syntactic quality
Invalidity x x x x
Incompleteness x
Semantic quality
Validity/Correctness x x X-1 x x
Feasible validity x x
Normality x
Integrity X-2 x x
Completeness x x
X-4 x x
Level of detail x
Scope x
Relevance/Pertinence x x x
Perceived semanitc quality x
Analyzability x
Testability x
113
114. Pragmatic, knowledge and process quality
Author(s)/Types of Batini et Scand. Moody ArabFren Genero et Herden Poels
qualities al 91- 94- 98- ch 02- 05
Pragmatic quality
Comprehension x
Social quality x
Agreement in x
knowledge
Agreement in model
interpret.
Knowledge quality
Completeness x
Validity x
Process quality
Implementability x
Stability x
Maintainability/ Fle- x X - 3 x
xibility/Extensibility
114
116. Sheldon classification for Inheritance hierarchies
Viewpoints.
• (1) The deeper a class is in the hierarchy, the higher the
degree of methods inheritance, making it more complex
to predict its behavior.
• (2) Deeper trees constitute greater design complexity,
since more methods and classes are involved.
• (3) The deeper a particular class is in the hierarchy, the
greater the potential reuse of inherited methods.
116
@C.Batini, F. Cabitza, G. Pasi, R. Schettini, 2008
117. Sheldon classification for Inheritance hierarchies
• Maintainability
• Understandability
• Modifiability
117
@C.Batini, F. Cabitza, G. Pasi, R. Schettini, 2008
119. Person
When a schema is defined, quality
ID Name Surname
can be achieved working both on
the schema and on the instance 1 John Smith
2 Mark Bauer
3 Ann Swenson
Person Address
ID Name Surname Address ID StreetPrefix StreetName Number City
1 John Smith 113 Sunset Avenue A11 Avenue Sunset 113 Chicago
60601 Chicago
2 Mark Bauer 113 Sunset Avenue A12 Street 4 Heroes null Denver
60601 Chicago
3 Ann Swenson 4 Heroes Street Denver
ResidenceAddress
(a) PersonID AddressID
1 A11
(b)
2 A11
3 A12
119
120. Experimentally investigated
by Arab French
Quality at schema level
Impact
Quality at data level Interdependencies
122. Methods
• Origins: achieving normal form
Decomposition techniques
• Scandinavian: derived from the framework
• Through schema transformations
122
123. Derived from framework
Syntactic quality
• Error prevention through syntax directed
editors
• Error detection through syntactic checks
123
124. Derived from framework
Semantic quality
• Consistency checking
– Based on a logical description
– Based on constructivity, namely through
properties of the generation process
(Langefors et al.)
– Use of driving questions to improve
completeness
124
125. Derived from framework
Pragmatic quality
• Audience training
• Inspection and walkthroughs
• Transformations (see also later)
– Rephrasing
– Filtering
• Translation
– Explaination generation
– Model execution
• Documentation
• Prototyping
125
126. Derived from framework
Social quality
• Integration
– Intra project
– Inter project
– Inter organizational
• Integration process
– Pre-integration
– Viewpoint comparison
– Viewpoint conforming
– Merginf and restructuring
126
128. The Assenova Johannesons approach
Dimensions considered
Dimension Definition
Explicitness Requirements are represented at the schema
level, not at instance level
Size # of entities + relationships + attributes
Rule simplicity High # of business rules are represented by
simple type of constraints
Rule uniformity Cardinality constraints are uniform,
Query simplicity Simple retrieval form requirements corresponds
to simple queries on the schema
Stability Small changes in requrements result in small
changes in requirements
128
129. Dimensions and transformations
Explicit Size Rule sim- Rule uni- Query Stabi
ness plicity formity Simplic. lity
Partial attributes - + +
Non surjective attributes - + +
Partial attr. which are total in + + - +
Union
Non-surg. attributes surjective + + - +
in Un.
M-N attributes - + - +
Lexical attributes - + - +
Attributes with fixed ranges + - = +
Two non disjoint entities + - + +
Non unary “overloaded” +/- +
attributes
129
130. Example transformation
Partial attribute
- The size of the schema increases (-)
- Introducing the entity EMPLOYEE results in
- increased rule uniformity (+) (all attributes are total)
- increased stability (+)
130
131. Example of increased stability
Introducing different categories of employees can be done
in the new schema without violating rule simplicity
The same cannot be done in the old schema
Old schema New schema
131
133. The approach of Akoka et al. (2007)
General statement: In DI Architectures quality of data
and quality of schemas have to be considered together
• Qualities at schema level
– Completeness,
– Understandability
– Minimality
– Expressiveness
• Qualities at data level
– Completeness
• Coverage
• Density
– Uniqueness
– Consistency
– Freshness
• Currency
• Timeliness
– Accuracy
• Semantic
• Syntactic
• Precision
134. The approach of De Conseicao et al (2007)
Relevant qualities to be evaluated in DI arch.
Given a DI Architecture defined in terms of
• [Data, Local Schemas, Global Schema, Data sources]
DI Element IQ Criteria
Data Sources Reputation; Verifiability; Availability; Response
Time
Schema Schema completeness, Minimality, Type
Consistency
Data Data Completeness, Timeliness, Accuracy
135. The approach of De Conseicao et al (2007)
Relevant qualities to be evaluated in DI arch.
Given a DI Architecture defined in terms of
• [Data, Local Schemas, Global Schema, Data sources]
Quality Definition Refers to Metrics Detailed in terms of
dimension
Completeness Coverage of global Global schema 1 – (# of incomplete
schema concepts items / # total
wrt the application items)
domain
Minimality Extent in which the Global schema 1 – (# redundant Attrib. in an entity
schema is schema elements / Attrib. in diff. Ent.
compaclty modeled # total items) Ent. Redundancy degree
and without Redundant Relationship
redundancy Entity Redund. of a Schema
Relationsh. Red. Of a Schema
Schema Minimality
Type Data Type Global schema 1 – (# of Data type consistency
Consistency uniformity across + inconsistent schema Attribute type consistency
the schemas Local schemas elements / # total Schema data type consistency
schema elements )
136. The H. Dai et al. approach (2006)
• Focus on Column Heterogeneity
e-mail, phone n. Many e-mail and
Only E-mail addr. And socsec n. phone numb. And
Few phone numb
Socsec numbers
B more heterogeneous than a
B more heterogeneous than c
B more heterogeneous than d
137. The H. Dai et al. approach (2006)
Focus on Column Heterogeneity
Heterogeneity dimensions
– Number of semantic types resulting in
different clusters
– Cluster entropy
– Probabilistic soft clustering
138. The Moody’s approach
Classification of schemas related by integration
Quality categ. Definition
Integration Level of consistency of the schema with the rest of the org.
data
Integr M1 # of data conflicts with the Corporate Schema
Integr M1.a # of entity conflicts
Integr M1.b # of data element conflicts, namely, defs. and domains
Integr M1.c # of naming conflicts (synonims + homonims)
Integr. M2 # of data conflicts with existing systems (ES)
Integr M2.a # of data element conflicts, namely, defs. and domains with ES
Integr M2.b # of key conflicts, namely, defs. and domains with ES
Integr M2.c # of naming conflicts (synonims + homonims) with ES
Integr M3 # of data elements with duplicate data elem. in ES
Integr M4 Rating by representatives of other business areas
138
139. The Chai approach
Matchability of schemas
• Focus on the evolution of a Data
Integration system, and the cost of
maintaining the mediated schema S
• Quality observed: the matchability of S
against a matching tool M, defined as
• the average of accuracies of matching S
with future schemas F1, F2, …Fn (that we
assume known at least to some extent)
using M
140. Cases for matching mistakes
• Predict a spurious match
• Miss a match
• Predict a wrong match
• Strategy to improve matchability
– Change concepts in M using rules that
minimize error probability
142. The data architecture of a set of databases is the
allocation of concepts and tables
across the DB data schemas
Example of change of data architecture due to improving
access efficiency
Employee
Employee #
Distribute
Salary
d
DB
Assigned-to
Employee #
Project #
Role Centralized DB
Project
Project #
Budget
144. Potential information content
Global
Boat
has
schema
Tax payer
declares
Income
Find CF, Name of Tax Payer that
Tax payer Boat
declares <= 30.000 € and
has
declares has >= 1 Boat
Income Tax payer
Sources
145. Potential information content
• Given a schema I, global schemas resulting
from virtual integration of schemas S1,
S2, .., Sn, the potential information
content of I is the set of queries that can
be performed on I and cannot be
performed on S1, S2, .., Sn.
151. Abstraction
Department Employee City
Department Employee
Seller
Item Order
Item in Order of Purchaser
Floor Department Employee City
Department Employee City
Seller
Seller Engineer
Clerk
Item in Order Item in Order of Purchaser
of
Warranty Warehouse Purchaser
151
153. First case: integration + abstraction
Company Production Sales Department structure
Department Employee
Item Order
Department Employee City
Seller
Item in Order of Purchaser
Floor Floor
Floor
Employee
Department Employee City Department Employee City
Department Employee
Engineer Seller
Seller Engineer
Clerk Clerk
Item in Order City
Item Item of
in Order
of
Warehouse
Warranty Purchaser
Warranty Warehouse
Purchaser
153