Keynote/Invited Talk
IFIP TC-11 First Working Conference on
Keynote/Invited Talk at the IFIP TC-11 First Working Conference on
Integrity and Internal Control in Information Systems
Zurich, Switzerland, December 4-5, 1997
Interactive Powerpoint_How to Master effective communication
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
1. Pragmatics Driven Issues in
Data and Process Integrity in Enterprises
Keynote/Invited Talk
IFIP TC-11 First Working Conference on
Integrity and Internal Control in Information Systems
Zurich, Switzerland December 4-5, 1997
Amit Sheth
Large Scale Distributed Information System Lab
University of Georgia
http://LSDIS.cs.uga.edu/
2. Three Real Challenges to Data Integrity
Three Real Challenges to Data Integrity
Three realities of IS environment
• Dirty data
• Interdependent Data
• Process Coordination /Workflow
Management
but traditional data integrity and database
transaction solutions come up short ...…...
3. Overview
Poor Quality Inconsistent Process
of Data Related Data Coordination
Data Cleanup/ Correct Workflow
Purification Inconsistencies Specifications
Achieve
Data Data Process
Integrity
Transaction Interdependent Manage
Management Data Management Data Integrity
4. Dirty Data
Dirty Data
Managing Data Quality 46%
Business Data Modeling 31%
End-user Expectations 29%
Legacy Data 25%
Transformation 22%
Business Rule Analysis 17%
Management Expectations 16%
Database Performance Source: DCI/Meta Group, Inc.
Users cite their biggest data warehouse challenges;
5. Dirty Data
Stories I have heard/seen
• 30% fall-outs (“requests for manual assist”)
due to mismatch between address in
customer service request and loop inventory
database in a Telco
• PUC insisted that a Regional Bell Company
do something about reducing 400 persons
employed ($40 million+) to keep data
consistent
6. Dirty Data
Dirty Data: Real World Stories
• Insurance company regional data: 80% of
claims had “broken leg” as diagnosis*
• 4% error rate, a $2 billion forfeits $80 million
in revenue*
* Emily Kay, Dirty Data Challenges Warehouses, DW/Software Magazine, Oct. 97
7. Dirty Data
Data Quality Dimensions
• invalid or impaired data
• incomplete or missing data
• inconsistent data
How to continue business operations
• by discounting affect of poor data quality data
• without worsening data quality
8. Dirty Data
Improving Data Quality
• Rule discovery, audit,
scrubbing/cleansing/purifying, defect
prevention
• Commercial offerings give partial solutions
to some aspects of identifying data quality
problems and some aspects of cleanup
(scrubbing)
9. Dirty Data
NASD Data Quality Toolset
Client-access tool Cognos, SAS, Applix
Conversion tool ETI* Extract
Metadata tool Platinum Tech’s
Repository
Auditing tool Prism Solution’s QDB
Solutions QDB/Connect
Problem: No integrated solution!
From L. Wilson, “NASD: Securing Data Quality, DW/Software Magazine, Oct. 97
10. Dirty Data
More on Commercial Solutions
• Commercial solution providers: Information
Builders, Platinum Technologies, SAS
Institute, Group 1 Software, Vality
Technology, First Logic
• Hundred of thousands of dollars: Why?
11. Dirty Data
Issues reasonably addressed
• Conceptual framework -- MIT’s work gives
very good start
• Most existing solutions apply to single data
repository or database -- possible to use
remote data access solutions for one
database at a time
12. Dirty Data
Challenges to be addressed
• Most solutions deal with structured/relational
data only -- increasingly data is in different
media
• Most solutions deal with creation of data
warehouse; OK for decision support, but what
about operational use?
13. Dirty Data
Data Quality Challenges
How to continue business operations
• by discounting affect of poor data quality data
• without worsening data quality
“A Mediator for Approximate Consistency:
Supporting “Good Enough” Materialized Views”
Seligman-Kerschberg
14. Dirty Data
A Research Project: Q-Data
Define Invoke Validation Display Results
Rules & Cleanup or Consult
GUI
Rules & Programs
Declarative Rule and
- Ref. Integrity
Procedural Programs - Approx. Match
LDL++ (LDL/Prolog/C++) - Consistency
Database Legacy
Access Interface System Interface
Databases Legacy Information Systems
15. Dirty Data
Interested in More Information?
• Industry/Practice:
– www.sentrytech.com
– “Data Quality Maze”, DW, Software Magazing,
Oct. 1997
• MIS: Total Data Quality Research:
www.mit.edu/tqdm/www
• Computer Science Research:
Sheth-Wood- Kashyap, Ami Motro,...
16. Interdependent Data
Interdependent Data and
Multidatabase Consistency
Function oriented independently created
application systems to automate different parts of
operation.
Hence independently developed databases where:
• information about a subject is distributed in
multiple systems
• a new application manages existing data
independently
17. Interdependent Data
Interdependent Data and
Multidatabase Consistency
Order Billing Planning &
Processing System Engineering
System System
Customer Data
Inventory Data
Assignment Data
Reference Data
18. Interdependent Data
War Stories
• Data analysis: One data element was in 43
separate legacy system files, maintained by
43 separate programs.
• Telco: Customer information is probably in
over 100 information systems. Some
information may be overlapping, and in
different representational forms.
20. Interdependent Data
Lack of understanding and maintenance of data
independency lead to data inconsistency and require
• manual intervention for completed failed operations
• work-around/patches
• manual reconciliation
and result in
• incorrect and wasted operations, poor quality of work
• difficulty in interoperability, high costs
• lost business opportunities
21. Interdependent Data
A Framework for Specifying
Interdependent Data
data dependency descriptor
dependency consistency restoration
structural control data state temporal coupled/ vital/
decoupled non-vital
Sheth and Rusinkiewicz 1990
22. Interdependent Data
A Case Study at Bellcore
Planning Apps.
Inventory/
Planning
Source
Reference
Engineering Design Data
Karabatis and Sheth 92
23. Interdependent Data
An Example of Interdependent Data
YEAR (…,demand, …) DMD_CAP(…,assigned,…)
ENTITY_JOB (…,capacity,…)
• Dependency: join and aggregation/sum over YEAR and ENTITY_JOB
• Consistency requirement: C1: demand/capacity > 0.9 or
C2: (capacity - demand) < 5000
• Restoration procedure:
• when C1 then regular_planning_update as non-coupled
• when C2 then emergency_planning_update as coupled & vital
24. Interdependent Data
Types of Dependency Specification
• Redundant data
– replication data, primary-secondary copies
– vertical/horizontal partitions
• Semantic integrity constraints
– value existential constraints
• Derived data
25. Interdependent Data
Types of Consistency Requirements
• Immediate consistency
• eventual consistency
• lagging consistency
– Temporal criteria
• at or before some time, within an interval, periodically
– Data state criteria
• number of operations or data items change, value of change,
before or after an operation
26. Interdependent Data
Some Relevant Work: Criteria
• replica control: primary secondary copies, one-
copy serializability
• epsilon-serializability [Pu & Leff], N-ignorance
[Krishnakumar & Bernstein], k-completeness [Sarin et al]
• eventual and lagging consistency [Sheth et al]
27. Interdependent Data
Some Relevant Work: Modeling
• Identity Connections [Wiederhold & Qian]
• Demarcation Protocol [Barbara and Garciia-Molina]
• Data Dependency Descriptors
[Rusinkiewicz/Sheth/Karabatis]
• Existence/Value Dependency [Ceri & Widom],
Interdependencies (existence, structural,
behavioral, value) [Li and McLeod]
• Computational Invariants, PATH structure [Etzion]
• ECA Rules [Dayal]
28. Interdependent Data
Enforcement Strategies
• Application code
• Middleware: Transaction Monitors,
Replication Server [Notes]
• Quasi-copies [Barbara et al]
• Production Rules and Persistent Queues [Ceri
and Widom]
• Extended Distributed Transaction
Management
– Polytransactions [Sheth et al], Quasi-transactions
[Arizio et al]
29. Interdependent Data
Polytransactions
root transaction (t1) IDS t1
coupled- coupled-
t2b t3 non-vital vital
t2a
Interdependent Interdependent Interdependent t2a t2b
Data Manager Data Manager Data Manager
Non--coupled
Local DBMS Local DBMS Local DBMS t3
How are related transactions determined? => S,U,P
When is a related transaction created? => C, Policy
What does a related transaction do? => A
30. Interdependent Data
Enforcement Policy
current consistent inconsistent
eager restoration partial restoration
late restoration or lazy restoration
31. Workflow
Workflow Management
• Workflow Management (WFM) is the
automated coordination, control, and
communication of work, both of people and
computers, in the context of organizational
processes, through the execution of software in
a network of computers whose order of execution
is controlled by a computerized representation of
the business processes.
32. Workflow
What is workflow about ?
• Effective coordination, control and
communications of work among human
participants and system/information resources
to orchestrate organizational processes
• Need to improve human/organization
productivity, efficiency, quality of work
• New paradigm for “Programming in the large”
33. METEOR Workflow Model
(very high level)
task
start task
task end
filter
task
interface interface interface
aux. sys
proc. proc. proc.
entity entity entity
35. A Complex Real-world Example
Generates:
• alerts to identify
patient’s needs.
• contraindications CLINICAL SUBSYSTEM
to caution
providers. Reminders to parents
Health providers can obtain up-to-date
clinical and eligibility information
C
T
Reports to state
Hospitals and clinics update
central databases after
encounters Health agencies can
use reports generated
SDOH and to track
CHREF population’s needs Hospitals and
maintain case workers
databases, State and HMO’s can reach
can update out to the population HMOs can keep track
support EDI of performance
patient’s eligibility
transactions data
TRACKING SUBSYSTEM
36. Implementation Testbed
Admit Clerk Triage Nurse Doctor/NP Maternity Ward
Administrator Case Worker etc.
CORBA (ORBeline)*
Iris (Pentium/ Windows NT)
Om (SunSparc 20 / Solaris)
Illustra DBMS
Oracle7 DBMS
Web Server Web Server
MPI MEI Immunization Db
Optimus (SunSparc 2 / Solaris) Ra (SunSparc 20 / Solaris)
Detailed Encounter Db
CHREF Hospital
Internet
Clinic
I
CHREF/SDOH
ED
Admit Clerk Triage Nurse Doctor/NP
em
S yst
File Web Server POMS
rk
Om (SunSparc 20 / Solaris) two
Ne
Illustra DBMS
Db
Files
Insurance Eligibility Db
Detailed Encounter Data
37. Workflow
Data Integrity Challenges
• Workflows express application level
integrity needs
– e.g., customer available to task 1 should be
consistent with the related information
available to task 2 even if both execute quite
independently
• In wake of -- inter-workflow requrements
• Integrity of specification for adaptive
workflows
38. Workflow
Weaknesses of
State-of-the-art WFMS
• Lack of clear theoretical basis
• Undefined correctness criteria
• Limited support for:
– Concurrency Control
– Interoperability between workflow systems
– Scalability
– Availability
– Recovery (no human assisted recovery)
39. Workflow
Transactions to the rescue?
• DB transactions and DP transactions
address the correctness, consistency,
recovery issues to different degrees, and
have strong theoretical foundation ---
• BUT can they apply to Workflow
Management? Applications and
environments differ significantly!
40. Workflow
Transactions in WFMS
• Task specific:
– transactional tasks (e.g., database related)
– distributed transaction processing
• Domain specific:
– EDI, HL7
– business contracts
41. Workflow
Transactions in WFMS
• Business-process specific:
– workflow correctness and reliability from a
business process point of view
– roles, worklists, error handling
• infrastructure specific: (each with their own
notions)
– CTM, DOM (CORBA), WWW, TP-
monitors, Lotus Notes
42. Workflow
An intuitive argument -
why extended transactions don’t apply
• ATMs were often motivated by a particular
domain or a set of applications ...
too narrow a scope in many case
• Workflow is more horizontal in nature,
many ATMs have been vertical in nature
(Transaction concepts scale relatively well
with hierarchical decompositions)
• Significant human involvement, long
running, autonomous systems,...
43. Workflow
Characteristics of Large-Scale
Real-World Workflow Applications
• HAD computing environments
• Multiple communication paradigms
• Humans, legacy applications, and other non-
transactional tasks
• Organizational requirements (roles,
authentication, security, etc.)
• Heterogeneous multimedia data
• Dynamic and virtual enterprises
• Electronic commerce
44. Workflow
Our view
• In the context of workflows:
– basis for modeling transactional tasks … YES
– basis for modeling group of tasks as a
transaction …MAY BE or YES
– basis for ensuring reliable communication
between workflow components … MAY BE or
YES
– basis for modeling workflows ?? …. NO!
• Transactions --yes, ATMs --probably not
45. Workflow
Our view
• Notion of transactions in WFMS is more
generalized than in TP-systems and DBMSs
• Workflow systems should provide support for all
forms of transactions
• Strict transactional semantics not practical in
workflow systems
• Role of transactions in workflow systems:
– for tasks within the workflow process
– for implementing solutions to support fault-tolerance,
concurrency control, correctness, recovery
46. Conclusions
• Neither Systems Environment nor Data
integrity requirements are as “simplistic”,
“clean”, “well defined” as in research
• Research has taken “black and white”
approach -- we need to deal with “shades of
gray”, how do you deal with the imperfect
world?
47. • We have to address issues that span
multiple heterogeneous systems
– numerous, more challenging, more complex
• Both data, application/process level issues
48. For more information:
http://lsdis.cs.uga.edu
For publications: check corresponding areas at
http://lsdis.cs.uga.edu/publications