Improving Data Quality and Regulatory Compliance

Improving data quality and regulatory compliance in global Information Technology
Harvey Robson
M.Sc. CEng, MIMechE, BEng
A thesis submitted in partial fulfilment of the
requirements of Leeds Metropolitan University
for the degree of Doctor of Engineering
This research program was carried out
in collaboration with the global Pharmaceutical company GSK
AUGUST 2010

Improving data quality and regulatory
compliance in global Information Technology
i
Abstract
This thesis deals with the issues around data quality and how they affect a
company in ways such as financial reporting, computer related incident and
problem resolution. It also demonstrates that large savings can be made through
using the methods outlined in the thesis. The literature review provides a thorough
analysis of previous and existing work in the area of data quality, and acts as a gap
analysis. The research chapter demonstrates the reasoning behind the frameworks
developed for this thesis and provides a comparison and grading of the available
methodologies. There are two case studies one around the reduction of data field
errors using the Six Sigma methodology and the second detailing a framework for
reducing cycle time and gaining productivity savings around the Sarbanes-Oxley
regulations. The analysis chapter provides an overview of the contributions
resulting from this thesis. Finally the conclusions resulting from this thesis and
recommendations for future research are outlined.
The main contribution of this thesis is a framework, based on empirical study, that
identified and resolved causes of data quality issues. Process steps that have an
adverse affect on data quality are surfaced. The approaches to research,
associated tools and techniques are discussed. Guidance is provided that aids
management in resolving process performance issues. Finally further uses of the
tools and techniques identified in this thesis are outlined and discussed at the end
of the conclusions chapter. The results of the research carried out in this thesis can
potentially be used by different organizations to drive improvements. These tools
would allow organizations to use a structured approach to identify and then focus
on the most important areas for process improvement.

ii
Acknowledgements
I would like to acknowledge the assistance of the many people in my academic,
business and personal life that provided the opportunity to undertake this study, for
their help, support, encouragement and guidance over the last 4 years. Their
support not only enabled me to complete my DEng dissertation but enabled me to
keep striving to make this work as extensive as time and energy would allow.
In particular, I would like to acknowledge the contribution of my principle
supervisor, Professor David Webb (Leeds Metropolitan University) who guided and
encouraged me from the beginning and throughout my whole DEng candidature.
I would also like to acknowledge the guidance from my industrial supervisors.
Professor Paul Drake (GSK) for his unswerving support, patience, academic
guidance and attention to detail. Special thanks to Dr. Tim Dickens (Cambridge
University) for his detailed guidance and insight on management strategies and
academic guidance.
Other friends and colleagues in the Innovation North (INN) Faculty of Information
and Technology and particularly in the School of Information Management
provided invaluable assistance, support and feedback throughout the course of my
studies.
I also wish to acknowledge the support and guidance of Kent Jensen (GSK) and
Vic Porcelli (GSK) for providing the opportunity and encouragement for me to
undertake, review and continue this study and also for the last four years.
Finally, I wish to express my gratitude and love to my parents and daughters for
their unreserved love, support and encouragement.

iii
I confirm that the thesis is my own work; and that all published or other sources of
material consulted have been acknowledged in notes to the text or the
bibliography.
I confirm that the thesis has not been submitted for a comparable academic award.
Harvey Robson
February 2011

iv
Table of Contents
Section Page
1 Introduction.................................................................................................................... 1
1.1 Research Overview........................................................................................... 7
1.2 Aim ................................................................................................................. 11
1.3 Overall aim ..................................................................................................... 11
1.4 Overall Objectives .......................................................................................... 12
1.4.1 Aim for case study 1 ....................................................................................... 12
1.4.2 Research questions case study 1 CMDB Data Quality................................... 13
1.4.3 Aim for Case study 2 ...................................................................................... 13
1.4.4 Sarbanes Oxley (SARBOX) background ....................................................... 13
1.4.5 Research Questions – case study 2 SARBOX................................................ 15
1.5 Background..................................................................................................... 16
1.6 What is data quality ........................................................................................ 16
1.7 Why are organizations such as GSK concerned about data quality?.............. 17
2 Literature Review......................................................................................................... 24
2.1 Introduction..................................................................................................... 24
2.1.1 Summary of Six Sigma and SARBOX........................................................... 29
2.2 Overview of Chapter Two .............................................................................. 31
2.3 Data Quality World......................................................................................... 32
2.4 Categorisation of the information................................................................... 37
2.5 Sort by significance and value add ................................................................. 38
2.6 Narrowing the focus........................................................................................ 39
2.7 Identify the gaps.............................................................................................. 39
2.8 Where value can be added .............................................................................. 40
2.8.1 Apply the existing methods and theory to a new area .................................... 40
2.9 Specific to Case studies one and two.............................................................. 41
2.9.1 Literature specific to case study one............................................................... 41
2.9.2 Literature specific to case study two............................................................... 51
2.10 Six Sigma – a discussion ................................................................................ 60
2.10.1 Development of Six Sigma – A timeline:....................................................... 61
2.10.2 The Pareto Principle and Pareto Chart............................................................ 62
2.11 Summary of Chapter Two............................................................................... 65
3 Research Chapter ......................................................................................................... 67
3.1 Introduction to Chapter 3................................................................................ 67
3.1.1 Overview of Research approaches.................................................................. 73
3.2 Two Research Paradigms - Positivism and Phenomenology.......................... 79

v
3.3 Choice of methodologies ................................................................................ 88
3.3.1 Selection of research methods and tools......................................................... 89
3.4 Output of the research chapter........................................................................ 92
4 Case Study One............................................................................................................ 96
4.1 Case Study One Introduction.......................................................................... 96
4.1.1 Research methodology for Case Study One ................................................... 98
4.1.2 Case study Summary ...................................................................................... 99
4.1.3 Exclusions..................................................................................................... 100
4.1.4 Definition of Exclusion................................................................................. 100
4.1.5 Exclusion Process ......................................................................................... 101
4.2 Define: the Process ...................................................................................... 101
4.2.1 Define: Process owner .................................................................................. 101
4.2.2 Define: Process improvement case study management................................ 102
4.2.3 Define: Process Customers ........................................................................... 102
4.2.4 Define: Purpose of Process........................................................................... 102
4.2.5 Define: Voice of the Customer - requirements............................................. 103
4.3 Define: Process Definition............................................................................ 105
4.3.1 Define: High Level Process .......................................................................... 105
4.3.2 Define: Determining the baseline ................................................................. 106
4.3.3 Define: CMDB Data Quality SIPOC............................................................ 107
4.3.4 Define: Integrated Flowchart........................................................................ 108
4.3.5 Define: Agree customer requirements .......................................................... 109
4.3.6 Survey metrics (telephone, face to face interviews and workshops)............ 110
4.4 Measure: Identify and Collect the data......................................................... 112
4.4.1 Measure: Define data collection ................................................................... 112
4.4.2 Measure: Customer and User feedback gathering ........................................ 112
4.4.3 Measure: How the data was identified.......................................................... 113
4.4.4 Measure: Why do we need this particular data............................................. 114
4.4.5 Measure: How the data was collected........................................................... 114
4.4.6 Measure: Listen to the voice of the process.................................................. 114
4.4.7 Measure: The original process:..................................................................... 116
4.5 Measure: The monthly metrics reported....................................................... 117
4.5.1 Measure: Field value error metrics ............................................................... 117
4.5.2 Measure: Metrics for ITMT Data Quality Infrastructure.............................. 118
4.5.3 Measure: Dependency errors ........................................................................ 119
4.5.4 Measure: Baseline measurements................................................................. 121
4.5.5 Measure: Field Value Errors Baseline Measurement ................................... 121
4.5.6 Measure: The Process Capability ................................................................. 122
4.5.7 Measure: Bar charts ...................................................................................... 125
4.6 Analyse: Dependency Errors Baseline Measurement.................................. 131
4.6.1 Analyse: Cause and Effect............................................................................ 131
4.6.2 Analyze: Voice of the customer – reporting results to the customer........... 132
4.6.3 Analyze: Bar charts....................................................................................... 134
4.6.4 Analyze: Identify the root causes.................................................................. 135
4.7 Improve: Process Improvement Approach ................................................... 141
4.7.1 Improve: What were the changes and what led to them? ............................. 141
4.7.2 Improve: Metrics focus................................................................................. 141

vi
4.7.3 Improve: Control Chart before and after improvement combined ............... 143
4.7.4 Improve: Moving Range Chart before and after improvement combined.... 143
4.7.5 Improve: The Process Capability Diagram – after improvement................. 144
4.7.6 Improve: The „to be‟ process at the successful conclusion of the case study146
4.7.7 Improve: Control Chart representing before the data quality improvement. 147
4.7.8 Improve: Control Chart representing the data after quality improvement.... 147
4.7.9 Improve: Implementation of process Improvement..................................... 148
4.7.10 Improve: Process Improvement of dependencies ........................................ 150
4.7.11 Improve: Deep quality auditing.................................................................... 150
4.7.12 Improve: Continuous Improvement.............................................................. 151
4.7.13 Improve: Ideas for improvement in data quality: ......................................... 151
4.7.14 Improve: Key Learning and Future Plans..................................................... 154
4.7.15 Improve: Future Plans.................................................................................. 155
4.7.16 Result of the Future Plans updated 2010 ...................................................... 155
4.8 Control: Ensure that the problems seen are kept in control.......................... 158
4.9 Case Study One Conclusions........................................................................ 159
4.9.1 Relationship between the Process Sigma and Z values ................................ 161
4.10 Reclaim of servers – a follow on small project ............................................ 165
4.11 Model to be used for improvement case study one ...................................... 168
5 Case Study Two ......................................................................................................... 170
5.1 Introduction to chapter 5............................................................................... 170
5.1.1 Research methodology for Case Study Two................................................. 170
5.2 Summary of Case study Two........................................................................ 173
5.2.1 Summary: Case study Purpose ..................................................................... 173
5.2.2 Summary: the Process Purpose..................................................................... 174
5.2.3 Summary: Primary and Secondary Customers ............................................. 176
5.2.4 Summary: Process Owners ........................................................................... 176
5.2.5 Summary: Process improvement case study team........................................ 177
5.2.6 Summary: Mentors ....................................................................................... 177
5.2.7 Summary: Agree customer requirements ..................................................... 177
5.2.8 Summary: Baseline Measurement ................................................................ 178
5.2.9 Summary: Milestones ................................................................................... 178
5.3 Define: Voice of the Customer Critical To Quality Tree ............................. 179
5.4 Define: Process Definition........................................................................... 181
5.4.1 Define: Operational Definition ..................................................................... 181
5.4.2 Define: Select Priorities................................................................................ 181
5.4.3 Define: Learn about process: Project Charter............................................... 184
5.4.4 Define: Learn about process: Project Contract............................................. 188
5.4.5 Define: Learn about process: CMDB / SARBOX Data Quality SIPOC ...... 193
5.4.6 Define: Learn about process: Integrated Flowchart...................................... 195
5.4.7 Define: Learn about process: Customer / results measure matrix ................ 197
5.4.8 Define: Learn about process: Results / measures matrix.............................. 198
5.4.9 Define: Learn about process: Data Collection.............................................. 199
5.4.10 Survey metrics (telephone, face to face interviews and workshops)............ 199
5.5 Measure: Investigate Sources of Variation: Voice of process Baseline ....... 202
5.5.1 Measure: The Process Sigma – before improvement ................................... 204
5.5.2 Measure: Investigate Sources of variation.................................................... 205

vii
5.5.3 Measure: Investigate Sources of Variation: Pareto...................................... 205
5.5.4 Measure: Investigate Sources of Variation: Cause and Effect .................... 206
5.5.5 Measure: The 5 why‟s................................................................................... 206
5.5.6 Measure: Define Exclusions and exclusions process .................................. 208
5.5.7 Measure: Investigate Sources of Variation: MSA ........................................ 209
5.5.8 Triangulation of the testing results ............................................................... 209
5.5.9 Measure: Failure Mode Effects Analysis...................................................... 210
5.6 Analyze: Hypothesis Testing........................................................................ 212
5.6.1 Explanation of Hypothesis testing ................................................................ 212
5.7 Hypothesis test overview.............................................................................. 213
5.7.1 Hypothesis test 1 – number of records and errors ........................................ 218
5.7.2 Confirmation of the model by removing two outliers .................................. 222
5.7.3 Hypothesis test 2 – data distribution normality ............................................ 223
5.7.4 Hypothesis test 3 – completion cycle times of business units ...................... 226
5.7.5 Mood Median test for cycle times of different Business Units .................... 227
5.7.6 Hypothesis test 4 – Number of applications ................................................. 228
5.7.7 Hypothesis test 5 – Number of dependencies............................................... 231
5.7.8 Mood Median Test: Defect data versus Cycle by defect.............................. 233
5.7.9 Box plot of Defect data by cycle .................................................................. 234
5.7.10 Test for Equal Variances for Defect Data..................................................... 235
5.7.11 Analyze: Identify the root causes.................................................................. 237
5.7.12 Improvement in defect rates cycles II and III 2007 to Cycle I 2008 ............ 238
5.7.13 Design of Experiments – which approach to take ........................................ 241
5.7.14 Three factors used in the DOE...................................................................... 242
5.7.15 Analyze: Actions taken further to Hypothesis testing .................................. 247
5.8 Improve: Study Results................................................................................. 250
5.8.1 Improve: Study Results Control Chart.......................................................... 250
5.8.2 The Process Sigma – after improvement ...................................................... 252
5.8.3 Improve: Benefits delivered.......................................................................... 252
5.9 Control: Standardise ..................................................................................... 254
5.9.1 Control: Control plan.................................................................................... 254
5.9.2 Control: Review: Case study closure:........................................................... 256
5.9.3 Control: Review: Future Plans...................................................................... 257
5.10 Case Study Two Conclusions ....................................................................... 259
6 Conclusions................................................................................................................ 264
6.1 Introduction to conclusions chapter.............................................................. 264
6.2 Contribution made by this thesis .................................................................. 264
6.2.1 Conclusions from Case study one................................................................. 267
6.2.2 Conclusions from Case study two ................................................................ 270
6.3 Advantages of using this approach ............................................................... 274
6.4 Limitations of this approach ......................................................................... 275
6.5 Recommendations for further work.............................................................. 275
7 References.................................................................................................................. 279
Bibliography....................................................................................................................... 290
List of names...................................................................................................................... 294
Appendices......................................................................................................................... 301
Appendix One – Hypothesis Selection flowchart.......................................................... 301

viii
Appendix Two – Factors for variables control charts.................................................... 302
Appendix Three – the Western electric rules................................................................. 303
Index................................................................................................................................... 306

ix
Tables
Section Page
Table 1 The business impacts delivered by the five benefits ................................. 11
Table 2 Volume of data on users and transactions within the CMDB ................... 19
Table 3 Article evaluation form .................................................................................... 38
Table 4 Overview of research paradigms...................................................................... 80
Table 5 Selection of research methods and tools..................................................... 91
Table 6 Points to consider when selecting research method ................................. 92
Table 7 The Process Improvement Team................................................................ 102
Table 8 The Data Quality SIPOC .............................................................................. 107
Table 9 Email Survey (unique record owners) for case study one ...................... 111
Table 10 Meetings, interviews and workshops for case study one...................... 111
Table 11 Field value error metrics............................................................................... 117
Table 12 Field value error metrics example output................................................... 118
Table 13 Metrics for ITMT Data Quality Infrastructure............................................. 118
Table 14 Metrics for ITMT Data Quality Infrastructure example output................. 119
Table 15 Dependency errors........................................................................................ 120
Table 16 Dependency errors example output ........................................................... 120
Table 17 Table of Defects and resultant Sigma values ........................................... 163
Table 18 List of reclaimed servers by Business Unit................................................ 166
Table 19 List of savings by reclaimed server............................................................. 167
Table 20 Primary Customers........................................................................................ 176
Table 21 Secondary Customers .................................................................................. 176
Table 22 Process Owners............................................................................................. 176
Table 23 The Process Improvement Team................................................................ 177
Table 24 The case study mentors ............................................................................... 177
Table 25 Milestones....................................................................................................... 178
Table 26 Case study two project charter.................................................................... 187
Table 27 Case study two project contract .................................................................. 192
Table 28 The Data Quality SIPOC .............................................................................. 194
Table 29 Email Survey for case study two................................................................. 199
Table 30 Meetings and workshops for case study two ............................................ 200
Table 31 Failure Modes Effects Analysis ................................................................... 211
Table 32 Completion times........................................................................................... 233
Table 33 Defect rating by Department........................................................................ 239
Table 34 Error counts by department by year ........................................................... 240
Table 35 Design of Experiments – Factors, Runs and blocks table ...................... 245
Table 36 Randomized design Table ........................................................................... 245
Table 37 Design of Experiments – Factors................................................................ 245
Table 38 Design of Experiments – which approach to take, Run order table ...... 246
Table 39 Control plan .................................................................................................... 255
Table A2.1 40 Factors for variables control charts .................................................... 302

x
Figures
Section Page
Figure 1 Relationship of levels of data and how they support the business. ........ 3
Figure 2 Literature Review .......................................................................................... 31
Figure 3 Total articles by year reviewed for case studies one and two ............... 33
Figure 4 Key Milestones and contributors................................................................. 61
Figure 5 Approach taken to understand and select research methodology........ 68
Figure 6 Ontology, epistemology, methodology and methods .............................. 71
Figure 7 A scheme for analyzing assumptions about nature of social science .. 75
Figure 8 Matrix of research philosophies Easterby-Smith et al. (2006) ............... 83
Figure 9 Model of research approach Drake (2005) ............................................... 85
Figure 10 The Exclusions process ............................................................................. 101
Figure 11 The high Level Process.............................................................................. 105
Figure 12 CMDB Data Quality Improvement Integrated Flowchart....................... 109
Figure 13 Data collection Metric design process ..................................................... 113
Figure 14 Record Counts (records containing no errors) ....................................... 127
Figure 15 Record Percent Yield Error Reduction..................................................... 128
Figure 16 Error Counts................................................................................................. 129
Figure 17 Error Reduction ........................................................................................... 130
Figure 18 Data quality Fishbone / Ishikawa Diagram.............................................. 132
Figure 19 Bar chart-reduction Chameleon Field Value Errors-Record Counts... 133
Figure 20 Yield (inverse DPO) .................................................................................... 137
Figure 21 Error Corrections......................................................................................... 138
Figure 22 Top 10 Service Owners.............................................................................. 139
Figure 23 Top 10 Errors............................................................................................... 140
Figure 24 Control Chart before and after improvement combined........................ 143
Figure 25 Moving Range Chart before and after improvement combined........... 143
Figure 26 Control Chart representing before the data quality improvement ....... 147
Figure 27 Control Chart representing the data after quality improvement........... 147
Figure 28 Chameleon Field Value Errors - Total Field Error Counts.................... 149
Figure 29 Graphical example of Dependencies....................................................... 150
Figure 30 Normal distribution and Z values.............................................................. 161
Figure 31 Model resulting from case study one ....................................................... 168
Figure 32 Overview of Chapter Five .......................................................................... 172
Figure 33 Customer Critical Quality Tree.................................................................. 180
Figure 34 The high Level Process.............................................................................. 181
Figure 35 Priorities System Map................................................................................. 182
Figure 36 SARBOX record checks and verification process.................................. 196
Figure 37 Customer / results measures matrix ........................................................ 197
Figure 38 Results / measures matrix ......................................................................... 198
Figure 39 Data collection tracking example.............................................................. 201
Figure 40 Data collection Metric design process ..................................................... 201
Figure 41 SARBOX Record Errors baseline cycle 3 2006 to cycle 1 2007 ......... 204
Figure 42 Pareto of the Data Quality Error Types ................................................... 205

xi
Figure 43 Cause and effect diagram.......................................................................... 206
Figure 44 The Exclusions process ............................................................................. 208
Figure 45 Triangulation of the script testing.............................................................. 210
Figure 46 Hypothesis test 1 – number of records and errors, fitted line plot....... 218
Figure 47 Hypothesis test 1 – Residuals versus the order of the data................. 220
Figure 48 Hypothesis test 1 record numbers fitted line plot, 2 outliers removed 223
Figure 49 Hypothesis test 2 – data distribution normality....................................... 225
Figure 50 Moods Median test for cycle times of different Business Units ........... 227
Figure 51 Residual plots for completion times of different Business Units.......... 229
Figure 52 Fitted Line Plot for completion times of different Business Units ........ 229
Figure 53 Adjusted Fitted Line Plot completion times different Business Units.. 230
Figure 54 Hypothesis test 5 – Residual plots completion times dependencies.. 231
Figure 55 Hypothesis test 5 – Fitted line plot completion time vs dependencies 232
Figure 56 Mood Median Test: Defect data versus Cycle by defect ...................... 233
Figure 57 Boxplot of Defect data by Cycle by defect .............................................. 234
Figure 58 Test for Equal Variances for Defect data ................................................ 236
Figure 59 Design of Experiments – Normal Probability of the effects.................. 243
Figure 60 Interaction Plot (data means) for Cycle Time ......................................... 244
Figure 61 Design of Experiments – 3D Cube Plot for Approach........................... 246
Figure 62 Control Chart before and after improvement combined........................ 251
Figure 63 Framework for process improvement ...................................................... 266
Figure A1.1 64 Hypothesis Selection Flowchart ...................................................... 301
Figure A3.1 65 Example control chart and zones (Western Electric rules) ......... 303

xii
Equations
Section Page
Equation 1 USL before improvement....................................................................... 123
Equation 2 Mean of the individual values ............................................................... 123
Equation 3 Mean of the moving ranges................................................................... 123
Equation 4 Process Capability Index, Upper.......................................................... 123
Equation 5 Defects Per Opportunity ........................................................................ 124
Equation 6 Defects Per Million Opportunity............................................................ 124
Equation 7 Process Sigma ........................................................................................ 124
Equation 8 Process Yield........................................................................................... 124
Equation 9 USL After.................................................................................................. 144
Equation 10 Mean of the individual values ............................................................... 144
Equation 11 Mean of the moving ranges................................................................... 144
Equation 12 Process Capability Index, Upper.......................................................... 145
Equation 13 Defects Per Opportunity ........................................................................ 145
Equation 14 Defects Per Million Opportunity............................................................ 145
Equation 15 Process Sigma ........................................................................................ 146
Equation 16 Process Yield........................................................................................... 146

xiii
Glossary
The glossary represents terms, abbreviations and symbols used and noted in this
thesis. The meaning is the how the term is used in this thesis.
Abbreviation/Term Meaning
Accessibility Denotes the extent to which data are
available, or easily and quickly
retrievable, (Strong and Yang, 2003)
ACS Affiliated Computer Services
Accuracy Denotes the extent to which data are
correct and free-of-error (Strong, Yang,
2003)
α Alpha - in statistics the formula for
expressing confidence level is 1-α
where α is the acceptable percentage of
error. Typical values of confidence
levels are 80%, 90% and 95% and so in
each of these cases the α value would
be 0.2, 0.1 and 0.05 respectively.
α Risk Probability of committing a type I error
(q.v.), an example would be treating
common cause variation as assignable
cause.
Anti Positivism (Epistemology (q.v.)) Approach based on observations and
unstructured interviews rather than the
hard measured scientific approach,
tends to be more used in the social
theory and research.
Applications End user software (q.v.) such as SAP.
Assignable cause (sometimes known as
special cause variation)
A cause that can be assigned to a one
off occurrence outside of normal
process conditions, such as an external
power outage that causes cessation of a
process but is not in control of the
process, a one off or rare occurrence.
Attribute Properties of a CI, such as device
name, device type, unique identifier.
Axiology The study of value and quality
(philosophy)
β Risk Probability of committing a type II error,
an example would be treating

xiv
assignable cause as common cause
variation.
BDR Benefit Delivery Report, an overview of
benefit that is provided in a standard
format for management review.
BI Intelligence gained from analysis of data
allowing understanding to be gained
about the data, such as trends, patterns
and cycles.
BIS Business Infrastructure Services
BSI British Standards Institute
Catalogue Catalogues which hold records of
specific configuration item types such as
Hardware catalogue holds information
specific to hardware devices
CI Configuration Item
CMDB Configuration Management Data Base
Completeness Denotes the extent to which data are
not missing and are of sufficient breadth
and depth for the task at hand. (Strong,
Yang, 2003)
Common cause variation Variation that is part of the process
under observation, such as data being
incorrectly entered into a system
because the control of the input field is
not set up only to capture the accurate
data required.
Constructivist A constructivist believes that there is a
semi-objective reality, they term this an
inter-subjective reality
CTQ Critical To Quality
Data Information about the devices
supporting the Information Technology
infrastructure
DB Data Base
DBA Data Base Administrator
Determinism (Human Nature) The deterministic approach dictates that
an individual has little or no choice in
their actions, instead their actions are
determined by the surrounding and
external factors and environment that

xv
they exist in.
DMAIC A Six Sigma process improvement tool,
DMAIC stands for Define, Measure,
Analyse, Improve, and Control.
DPMO Defects Per Million Opportunity, a
measure of the rates of defect per
opportunity, a standard for a 6 sigma
process is 3.4 defects per million.
Epistemology Epistemology is a philosophical study
primarily based on the study of
knowledge and knowing
EU European
FVC Field Value Completeness – measures
whether the field has a value in it and is
correct and complete.
GITRM Global IT Risk Management
GMS Global Manufacturing and Supply
Grounded (tools and methods) Means the tools used in this thesis are
well proven in other areas (grounded in
experience empirically) e.g., Six Sigma.
GSA Global Services for Applications
GSK GlaxoSmithKline
Hardware Devices that typically host software, in
this thesis these are classified as
Servers(q.v.), Databases(q.v.)
Hypothesis A suggested idea or proposal, idea or
theory that can be tested
Ho (Null Hypothesis) Null Hypothesis, the statement that two
things are equal or have a relationship.
See Hypothesis (q.v.) test one in this
thesis for an example.
H0 (q.v.) there is no relationship
between the number of records owned
and the number of errors
H1 (q.v.) there is a relationship between
the number of records owned and the
number of errors.
H1 (Alternate Hypothesis) Alternative Hypothesis (q.v.), the
statement that two things are not equal.
Human Nature The relationship between human beings
and their environment
Ideographic (Methodology) The Ideographic approach is that for an
individual to fully understand something
they have to be part of it or live it.

xvi
Information Knowledge and understanding gained
from studying the IT processes and data
Information Technology The methods (q.v.) and processes used
to generate, modify, obtain, store and
transmit information using the
computing processes and devices such
as software (q.v.) and hardware (q.v.).
ITIL, Information Technology
Information Library
Information Technology Information
Library, A set of best practice guidelines
for managing IT infrastructure and
services.
ITMT IT Management Team
LIN Local Instruction, Documented
instructions for localized procedures sits
under the SOP (q.v.)
Lower Specification Limit, LSL A limit that is 3 standard deviations (3 σ)
below the centre line on a control chart.
Methods The tools used to gather the data and
make decisions about the data
Methodology The processes used for studying the
processes that control the data quality,
in this thesis elements of Positivist (q.v.)
and Phenomenological (q.v.)
methodologies are used.
MO Management Office
MSA Measurement Systems Analysis
MT Management Team
Nominalism (Ontology) A nominalist does not believe there is
any structure to the reality they exist in,
rather the reality exists in their mind,
and is the opposite view to realism.
Nomothetic (Methodology) The Nomothetic approach is structured
and controlled and employs the use of
standard tools and processes such as
questionnaires to gain an insight into a
situation.
Objectivist Believe that the world they live in exists
in itself, fully independent of them
Ontology Ontology is a philosophical study
primarily based on metaphysics. It is the
study of reality or being.
Ops Operations

xvii
P value (Probability value, Hypothesis
(q.v.) testing)
The P value is a probability rating, from
0 to 1, used to reflect the strength of the
data that is being used to prove or
disprove the null hypothesis (q.v.).
Typically a P- Value of 0.05 is used. The
use of the P- Value depends on the type
of test but typically if <0.05 then the null
is rejected (there is a difference), if
>0.05 then the null is accepted (there is
no difference), if close to 0.05 then
gather more data and re-test to be sure.
Example, section 5.7.4 Mood‘s median
test:
The resultant P Value was 0.025 and
therefore one can be confident that at
least one of the samples has a different
median cycle time from the others.
If the P value is more than α,
then Ho (q.v.) is not rejected
If the P value is less than α then
Ho (q.v.) is rejected
Paradigm Paradigm in terms of this thesis means
model.
Phenomenology Studying a situation from the viewpoint
of experiencing something first hand.
Positivist / Positivism (Epistemology) Studying a situation from the viewpoint
of measuring the outcomes of a
situation.
Power (of test in Hypothesis (q.v.)
testing)
Power is the probability of correctly
rejecting the null hypothesis (q.v.) when
it is indeed false, and is defined as
P=(1- β)
PwC Price Waterhouse Coopers
Qualitative Use of non numerical measures,
typically used in Social Sciences and
based on interviews and discussions
and experiences
Quantitative Use of quantifiable, numerical
measures, such as cycle time, number
correct
R&D Research and Development
Realism (Ontology) A realist believes that reality is universal
and external to the individual, who
exists within it; this is the opposite

xviii
viewpoint to nominalism.
Relevancy Denotes the extent to which data are
applicable and useful for the task at
hand(Strong, Yang, 2003)
RMS Remote Managed Services
RPN Risk Priority Number
SARBOX Sarbanes Oxley Act
SCS Systems Communication Services
SDCS Servers Data Centers and Storage
Server A type of hardware(q.v.) that hosts
software(q.v) on a large hard drive
Six Sigma (6 Sigma) Process Improvement
methodology(q.v.), named in part on +/-
3 standard deviations from a centre line
on a control chart hence the number of
sigma (q.v.)
Sigma, σ Sigma is standard deviation, which is ―a
measure of dispersion obtained by
extracting the square root of the mean
of the squared deviations of the
observed values from their mean in a
frequency distribution‖ – (Collins, 1995).
In Six Sigma (q.v.) this is the number of
units of standard deviation from the
centre line of a control chart.
SME Subject Matter Expert, an individual that
is recognized in an expert in their field
SOP Standard Operating Procedure
Software Computer programs
SOX Sarbanes Oxley Act
Subjectivist Believe that there is no reality outside
the subject (human being) and, in the
extreme, that every subject has its own
image of reality.
TAH Traditional Application Hosting
Tampering Treating common cause variation as
assignable cause.
Timeliness Denotes the extent to which the data
are sufficiently up-to-date for the task at
hand (Strong, Yang, 2003)
Type I error (Hypothesis (q.v.) testing) Rejecting the null hypothesis (q.v.)
when in fact it was actually true, is a
type I error.
Type II error (Hypothesis (q.v.)testing) Not rejecting the null hypothesis (q.v.)

xix
when in fact it was actually false, is a
type II error.
Upper Specification Limit, USL A limit that is 3 standard deviations (3 σ)
above the centre line on a control chart.
Usability The data must be fit for purpose, such
as server type is the actual model and
not an approximation.
Value The value of a field, in case study one
some fields were more critical than
others
VM Virtual Machine
Voluntarism (Human Nature) Based on the individual having free
reign to make voluntary decisions

Introduction
xx
CHAPTER ONE
Introduction

Introduction Chapter One
1
1 Introduction
―Everything can always be done better than it is being done.‖
Henry Ford, 1922
This thesis is about Data Quality. This thesis also demonstrates that large savings
can be made through using the methods outlined in the thesis and case studies
within. Some of the work in this thesis was previously presented at the Leeds
Metropolitan University Innovation North Research conferences in 2005 and 2006.
The main contributions from the thesis are reviewed at length in Chapter 6,
Conclusions and there is a brief overview in the abstract of this thesis.
Although there are many definitions of Data Quality, for the purposes of this thesis
Data Quality is the Accuracy, Usability, Value and information relating to attributes
of configuration items (CIs) stored in a Configuration Management Database
(CMDB). According to the Information Technology Infrastructure Library (ITIL) V3
Foundation Handbook, Office of Government Commerce (2008, p.88) ―the
Configuration Management Data Base (CMDB) stores configuration records
containing attributes of CIs and their relationships.‖ The ITIL guidelines represents
best practices in the management of IT services and this provides a framework for
how to arrange IT services to gain the best performance. According to the British
Standards Institution (BSI), (BS EN ISO 10007:1997), a CI is an:
―aggregation of hardware, software, processed materials, services, or any of its
discrete portions, that is designated for configuration management and treated as a
single entity in the configuration management process‖.
The CMDB has a number of catalogues containing all the data in the form of CIs.

2
This Thesis is concerned with the Data Quality for the Information Technology
Department of a global Pharmaceutical company. The work demonstrates that
there are gaps in resolving Data Quality and these gaps have been addressed
using a combination of Phenomenology and Positivist methodologies and based on
the ITIL1
and the Six Sigma2
process improvement methodology.
There are currently a number of vendors selling products to provide Business
Intelligence (BI) to senior managers based on stored data. The problem is that if
the supporting data is not accurate then the BI may impact any strategy decisions.
This problem was recognized by a leading analyst on the IT market trends, the
Butler Group (2004, p.95) where they discussed the data quality software market.
‗This market was only too happy to sell customers software to allow them to ‗make
more effective decisions‘ without really taking any responsibility for the quality of
the data‘. This is an important point and needs to be noted when reviewing any off
the shelf solutions.
The cost of the Data quality software can also be expensive and this thesis offers a
solution to data quality resolution using a set of grounded theories (tools) and
methodologies that are inexpensive to implement.
The importance of information and the data that supports it was recognized by
Johnson, Scholes and Whittington (2006, p.459), where the competiveness of a
company was recognized as being in part dependent on the data it held.
‗Data mining is the process of finding trends patterns and connections in the data
in order to inform and improve competitive performance‘.
1
http://www.itil-officialsite.com/home/home.asp
2
http://www.isixsigma.com/

3
As can be seen from the above articles it is important for accurate data to be
available to senior managers. In order for strategy planning to be effective the
management must have access to data that allows trends and patterns to be
recognized ―Competence in information management is not enough‖, and also that
―Most organizations now have colossal amounts of raw data about these issues
and the IT processing capacity to analyze the data (both of which are necessary).
But they are not good at this data-mining process, which will convert data into
market data‖. (Johnson, Scholes and Whittington, 2006, p.459).
It is understood that accurate and organized data and information lead to
knowledge, and this is illustrated in the figure below from the ITIL V3 Foundation
Handbook, Office of Government Commerce ( 2008, p.95), which is a very
simplified illustration of the relationship of the three levels of data and how they
support the business in making informed decisions.
Service Knowledge
Management System
Configuration Management System
Configuration Management
Databases
Decisions
Figure 1 Relationship of levels of data and how they support the business.

4
It follows that having accurate knowledge leads to making sound business
decisions. Therefore accurate data must be a priority in determining business
planning. Business Intelligence is a term that refers to the gathering and
manipulation of data to be used in decision making analyses. As Francis Bacon
stated in Meditationes Sacrae, Bacon (1597), ‗For also knowledge itself is power‘,
and in business knowledge is primarily comprised of business intelligence, this
invariably relies on supporting data. It is the ―supporting data‖ that this Thesis is
concerned with. Data is described as a ―series of observations, measurements, or
facts; information‖ Collins concise dictionary (1995).
A company will use data for different decision making tasks from improving
manufacturing data to strategic planning (data from sales forecasts for example) to
financial planning (data on machinery longevity and replacement costs). There are
a number of process improvement methodologies available, for instance the Total
Quality Model (TQM)3
, Capability Maturity Model (CMM)4
and ISO 90005
. Each of
these offers a different focus on process improvement. TQM is summarized in The
Encyclopædia Britannica (2010), as follows,
―Management practices designed to improve the performance of organizational
processes in business and industry. Based on concepts developed by statistician
and management theorist W. Edwards Deming, TQM includes techniques for
achieving efficiency, solving problems, imposing standardization and statistical
control, and regulating design, housekeeping, and other aspects of business or
production processes.‖
3
http://www.britannica.com/EBchecked/topic/1387320/Total-Quality-Management
4
http://www.sei.cmu.edu/cmmi/
5
http://www.iso.org/iso/iso_catalogue/management_standards.htm

5
TQM focus on meeting given company standards and tolerances, where as CMM
is focused on Software quality and development and the ISO 9000 is a set of
standards for ensuring that quality is met, and although these are beneficial in their
own right the case studies in this thesis are more concerned with reducing defects
and so more focused on one of the most prevalent methodologies in process
improvement, that of Six Sigma, Pyzdek (2003, p.3), states that ―Six Sigma is a
rigorous, focused and highly effective implementation of proven quality principles
and techniques.‖. This is ideal for problems such as those studied in this thesis.
As data is the corner stone of many important decisions it is surprising that the
cataloguing and control of data and its properties is not given the attention it
deserves. ―Many important corporate initiatives, such as Business-to-Business
commerce, integrated Supply Chain Management, and Enterprise Resource
Planning are at risk of failure unless data quality is seriously considered and
improved.‖ (Wang, Ziad and Lee, 2001, p.1).
Data is used for many purposes and as such it needs to be understood but also
controlled and maintained. In the Open University article, ―Data Protection Part
One‖, senior lecturer Eby (2003, p.32), explains that ―Data in manual systems have
to be changed under certain circumstances, and in certain ways, to retain
documented records of who made what changes, why and when‖.
In the year 2000 an analysis by the U.S. Government Accountability Office found
that, ―....the fiscal year 2000 plans failed to include discussions of strategies to
address known data limitations. We reported that when performance data are
unavailable or of low quality, a performance plan would be more useful to decision
makers if it briefly discussed how the agency plans to deal with such limitations.

6
Without such a discussion, decision makers will have difficulty determining the
implications of unavailable or low-quality data for assessing the subsequent
achievement of performance goals that agencies include in their performance
reports‖ (U.S. Government Accountability Office, 2000, p.3).
Therefore it can be seen that data really does have a vital role to play in business
and that in the author‘s own experience it is seen as a low priority in terms of
investment in tools and strategies to improve data.
Even in professional organizations data quality has been found to lead to problems
for instance the ASAE (American Society of Association Executives) describes how
the lack of data quality affects such items as complete member records by applying
quality control targets. ―A high undeliverable rate with mail, fax or email
communication is the first indication that you may have a problem with your data‖
(Association Management, 2003, p.1).
The previous examples are just a few illustrations of where poor data quality
affects many different types of business in different ways from strategy planning to
membership lists.
This thesis seeks to prove that data quality can be improved by applying a blend of
the Positivist and Phenomenological paradigms and their associated
methodologies and methods. In this thesis methodologies means ‗the system of
methods and principles used in a particular discipline‘ Collins concise dictionary,
(1995).
Methods means ‗the techniques or arrangement of work for a particular field or
subject‘ Collins concise dictionary (1995). Following investigations and literature

7
searches it has been determined that companies across varying industries have
difficulty in improving and maintaining the quality, and hence value, of the data that
they collect and own.
The data does not always provide a credible or accurate representation of the
physical environment. As a result of the literature search and the authors own
research it would appear that there is a gap in the research carried out at Doctoral
level for Data Quality in Asset Inventories in global pharmaceutical companies. In
this study two case studies are run in an IT department within a global
Pharmaceutical company. The data is stored in a Configuration Management Data
Base.
1.1 Research Overview
This thesis is concerned with the quality of data associated with a department
within Glaxo Smith Kline (GSK) Information Technology (IT) group, namely
Systems and Communication Services (SCS) managed Infrastructure (Hardware &
Database) registered in the CMDB (Configuration Management Data Base).
A CMDB is a database for storing the supported infrastructure configuration and
changes to the infrastructure.
The CMDB used in this thesis stores information for configurations items ranging
from software, databases, and servers to network devices. There were two case
studies carried out, one regarding the quality of data contained within device
records registered in an asset inventory.
The second case study concerned Data quality and how this affects financial data
audits specifically in terms of the SARBANES OXLEY (SARBOX) legislation – see

8
section 1.4.4 for a further explanation of the SARBANES OXLEY regulations. In
both cases the data was analyzed and found to be of a low quality.
The analysis considered many aspects of the data such as completeness,
accessibility, accuracy and timeliness. A baseline was taken at the beginning and
then improvements agreed, monthly reporting commenced and finally an
improvement gained and reported to senior management.
In both case studies use was made of the Six Sigma Methodology (see earlier note
on Pyzdek (2003) in this chapter) and these surfaced underlying issues not always
clear to the casual observer.
Note: Both of the case studies covered in this thesis between them delivered five
benefits. These benefits were in the form of productivity savings, cycle time
reductions, improved and new processes for dealing with data, cost avoidance and
finally a marked increase in data quality.
The business impacts delivered by the five benefits are transferable6
: The benefits
delivered by the case studies are outlined below in table one. The improvements
measured were monitored and quantified through using control charts and
statistical analysis such as Hypothesis testing.
All of these techniques are reviewed in case studies one and two.
Improvement Benefit delivered
Increases in Data (Case Study 1) 33% improvement in the data quality
6
More details on the benefits of using the transferable methods covered in this thesis will be submitted as a
publication to the Journal of Business Information Systems.

9
Quality for Infrastructure components in the Application
Registry from 65% to 98%
Reduced linkage defects from 25% error rate to 0%
error rate. Implemented in process controls to
eliminate root cause. See Case Study One,
Introduction.
Cost avoidance
resulting from
improved data quality
(Case Study 1) 233 servers were identified as
temporarily allocated. As a result of this exercise,
126 servers were identified as unused by the
Business, and subsequently reclaimed. 126 Shared
Service servers reclaimed from the "temporary"
server pool, resulting in a cost avoidance of
£1,144,410. See Case Study One, Section 4.9.
Report Automation (Case Study 1) 98.3% reduction in the time to
produce monthly reports for chameleon data From
15 hours per month to 0.2 hours per month.
See Case Study One conclusions, answer to
research question 4
Productivity savings (Case Study 2) Six Sigma saved 83 Person days
(Field Value Completeness) Saved 54 Person Days
by automation of a manual report. See Case Study

10
Two conclusions, answer to research question 4.
Cycle time reduction (Case Study 2)
57% reduction of cycle time See Case Study Two
conclusions, answer to research question 4
New and improved
processes for dealing
with data
(Case Study 2) Root Cause Elimination for
SARBOX Applications infrastructure Audits 100%
improvement in Application Intelligence accuracy for
SARBOX components listed in Chameleon (Zero
records confirmed as accurate prior to the process,
114 of 114 applications confirmed accurate at case
study completion) which equates to 798 / 798
Opportunities (application attributes) have been
confirmed correct.
25 out 798 Application Configuration management
Errors discovered, investigated and fixed.
7 Applications that had been missed off the Master
Schedule of SARBOX Applications were added as a
result of this case study.
Raised level of awareness across the company.
Service owners were made aware of who supports

11
their application – how many and what did they
learn.
See Case Study Two conclusions, answer to
research question 4
Table 1 The business impacts delivered by the five benefits
1.2 Aim
1.3 Overall aim
The aim of this research is to design and then empirically test a novel set of
frameworks, tools and processes to improve the quality, and hence, credibility of
the data used to aid strategic planning, cost analysis and general decision making.
This thesis examines the approaches available in terms of research methodologies
in combination with the available process improvement methodologies, methods
and tools. In addition their applicability to the two scenarios represented in case
studies one and two, and the more general case is considered.

12
1.4 Overall Objectives
1. Research which tools and processes are being used to drive data
quality improvements in companies reliant on data, evaluate these by
suitability.
2. Determine what impact these improvements had on the quality and
effectiveness of the data.
3. Design a framework based on a blend of positivist and
phenomenological paradigms apply and empirically test a set of
grounded tools and processes to improve, and maintain, data quality
making the data more effective and accurate.
4. Measure the improvement of data quality; correlate against case
study 2.
5. Design, apply and empirically test an audit process to check that the
data accurately represents the physical infrastructure.
1.4.1 Aim for case study 1
Case study one deals with the quality of data contained within device records
registered in an asset inventory. The aim is to study the data, understand the
baseline data quality, improve the data entry processes and improve the data
quality, to a target agreed with management, from 65% to 97% (or greater) of
completed key fields to an agreed set of measures. The process is to be measured
ongoing and the data quality level maintained and controlled at 97% or greater.
Demonstrable positive business impact to the company must also be
demonstrated.

13
1.4.2 Research questions case study 1 CMDB Data Quality
1. How can data quality be accurately and effectively measured?
2. How can the quality of data be improved by applying tools and
processes that are based on a set of grounded tools? (Six Sigma and
the IT Infrastructure Library)
3. To ensure quality in the data, which data should be collected, and how?
4. How effectively does the resulting model of the infrastructure reflect the
physical infrastructure?
1.4.3 Aim for Case study 2
The second case study is concerned with Data quality and how this affects
financial data audits in terms of specific regulations, in this case adherence to the
SARBANES OXLEY (SARBOX) regulations.7
1.4.4 Sarbanes Oxley (SARBOX) background
The SARBOX regulations are intended to mandate financial accountability for
companies that are publicly traded. The SARBOX act‘s official title is ‗The Public
Company Accounting Reform and Investor Protection Act of 2002‘. The act
consists of eleven sections ranging from Public Company Accounting Oversight
Board to Corporate Fraud and Accountability. This case study deals with section
404, ‗Management Assessment of Internal Controls‘.
7
For more information on the SARBOX act please refer to this link to the US Securities
and Exchange Commission website. http://www.sec.gov/about/laws.shtml#sox2002 .

14
This piece of legislation was created to ensure that companies can provide an
assessment of their ability to control their financial reporting. It is a requirement that
the report is audited by external auditors, in the case of GSK this is currently Price
Waters Cooperhouse. To met these demands the processes supporting the
compliance and audit need to be robust.
One of the main drivers in compliance to the SARBANES OXLEY regulations is the
requirement for companies to be more transparent. A finding against any company
in terms of mis-statement of financial reporting will surface inadequate controls and
also can negatively impact both a company share price and reputation.
GSK compliance with the Sarbanes Oxley Regulations requires IT infrastructure to
be linked to applications which support financial processes. The IT Infrastructure is
stored in a centralised CMDB which is the source of data that is used to identify,
test and manage the entire GSK IT Infrastructure.
The accuracy of the data in the CMDB repository is critical to ensuring an accurate
link between the infrastructure and those applications within scope for SARBOX.
The initial target for the SARBOX was 97% or greater linkage correctness at the
time of submission of results to auditors.
As a result of the processes implemented in case study two this was improved to
100% at the end of each cycle as a result of the processes designed and
implemented. Therefore when the snapshot of data and linkages is taken at time of
audit to time of submission the data will be 100% correct. The importance of 97%
or above in terms of costs and resource is that if GSK gets the data right in the first
place then costly rework and Audit findings can be avoided. If the supporting
infrastructure is not accurately identified then GSK cannot demonstrate control

15
over the financial reporting process. Three times a year an audit is performed to
verify that the correct applications and linkages are in place.
The aim of case study two was to improve the credibility of the CMDB record
checks and verification in terms of an external audit carried out on GSKs Financial
reporting applications and their supporting underlying infrastructure (servers and
databases). The main criteria for success here was to complete the reviews, prior
to audit, within 3 months and also to reduce the number of issues found by the
auditors during their investigations.
1.4.5 Research Questions – case study 2 SARBOX
1. How can the representation of the ―as built environment‖ modeled
within the CMDB, that is to say the infrastructure supporting Financial
Applications, be accurately and effectively measured for SARBOX
audits?
2. How can the quality of stored data in terms of accuracy and
completeness of linkages between software, server and database, for
SARBOX audit requirements, be improved by applying tools and
processes that are based on a set of grounded tools? (Six Sigma and
the IT Infrastructure Library)
3. To ensure quality in the record checks and verification and resulting
audit success, which data should be collected, and how?
4. How effectively do the resulting processes, checks and control
mechanisms meet the requirements of achieving a positive audit
result, specifically for SARBOX audits?

16
1.5 Background
GSK is a global manufacturing concern with 99913 employees according to the
2009 annual report and has ―it‘s corporate head office in London and has its US
headquarters in Research Triangle Park with operations in some 120 countries,
and products sold in over 150 countries‖ (GSK annual report, 2009). The IT
infrastructure to support this organisation is a large and complex system. Within IT
there are many different technologies deployed and these all need to be able to
interact with each other - and be registered in some form so that the company
knows what IT equipment and software it has, where it is and what level of
timeliness it is at.
The ultimate aim of this investigation into data quality is to deliver a real return on
investment by creating and maintaining an accurate inventory of the GSK IT
infrastructure and supported applications. This will be achieved by reducing the
cycle time to complete operations, enable productivity savings and ultimately
increase the data quality so that re-work is not required in the future.
1.6 What is data quality
Firstly, a definition for high data quality can be found in the proceedings of the
Association for Computing Machinery, Strong, Lee and Wang (1997, p103) tell us
that ―High data quality means data that are fit for use by data consumers‖.
There are a number of dimensions for data, according to Strong and Yang (2003),
five of which are, Relevancy, Accuracy, Timeliness, Completeness and
Accessibility.

17
So ‗data quality‘ or more specifically ‗data of a high quality‘ is data that must meet
various criteria to satisfy the data collectors‘, custodians‘, consumers‘ and
regulatory demands.
1.7 Why are organizations such as GSK concerned about
data quality?
Pharmaceutical organizations are bound by regulatory compliance. There are
increasing demands being placed on all companies to manage, in a compliant way,
the infrastructure that holds the data and fixed content (content that is not to be
changed). Also not only the data but the infrastructure that supports the data has
to meet appropriate regulatory compliance. There are agencies that currently
monitor and enforce compliance, such as the Medicines and Healthcare products
Regulatory Agency (MHRA)8
and the Food and Drugs Administration (FDA)9
.
Legislation is also in place to ensure compliance, such as the Sarbanes Oxley Act
(SOX)10
and the Freedom of Information (FOI)11
Act.
The data is stored in a database known as a Configuration Management Data
Base - CMDB and this CMDB acts as a central registry. Until recently each
business unit had its own CMDB, but this led to a lack of interface between
business units. Additionally many Business Units used Chameleon without
8
http://www.mhra.gov.uk/index.htm
9
http://www.fda.gov/
10
http://www.sec.gov/about/laws.shtml#sox2002
11
http://www.ico.gov.uk/what_we_cover/freedom_of_information.aspx

18
following guidelines and without having in place clear and enforced standards
hence leading to poor quality of records and inconsistent data. The hope was that
GSK would have a single source of information pertaining to all details relevant to
the IT landscape.
The lack of interface led to delays in strategy planning and even vital information
being missed when planning downtime, for example, a piece of hardware may be
brought out of service for two hours for an upgrade, however the application owner
that runs the software on the server may not be aware of the down time – this
situation would lead to an outage for the application users.
The CMDB is the registry for the whole IT infrastructure and this infrastructure
supports many global business units such as Corporate IT and also global
manufacturing and supply.
The importance of data quality can be seen when it affects processes that rely on
it. For example if a server build (i.e. the size of storage and memory a server uses)
requires accurate ownership and site related data to be available or software
versions, and this data is found to be incorrect then it can affect the processes
such as security patch releases as the patch may not be added to the server if it is
not registered correctly in the CMDB.

19
Table 2, below provides a high level overview of the metrics corresponding to the
CMDB. As can be seen there are a lot of modifications to components which are
controlled by a change control process. The number of users varies and at the time
of the project was 3000, and these are component owners, service managers and
business owners.
Metric Volume
Number of component changes 34000 per annum.
Number of change requests. 11000 per annum
Current number of users 3000 users
Table 2 Volume of data on users and transactions within the CMDB
In summary the execution of this research falls into the following stages:
Stages:
1. Define the state of knowledge, issues and resolution in the area of data
quality.
2. Identify the areas of most impact that would benefit from this research in
GSK - i.e. the current projects that are higher priority in terms of business
needs.
3. Identify best practice by gathering information on the data quality world by
searching available literature, related theses and journals.
4. Carry out two distinct pilot case studies (based on stages one and two
above) to determine further factors affecting data quality which were not

20
identified in the literature search one. An example of such a factor is the
lurking variable12
, which according to Pyzdek (2003, p.3) ―is often hidden in
a correlation between values (for example x and y) that appears perfect ―.
5. Another factor or factors are often referred to as the hidden plant, Hoerl and
Snee (2002, p.72), sum this up as, ―is the system of non value added work
and complexity in our businesses‖.
6. Identify the areas of concern from the results of the two distinct pilot studies.
7. Implement improvements to the processes associated with the pilot studies
and re-run the case studies to test and confirm or disprove that the chosen
methods of improvement worked.
8. Identify the critical Success Factors from stage six and document.
Structure of Thesis:
This thesis consists of 6 chapters.
Chapter one
Introduces the issues and sets the scene for this research including its purpose,
questions and approach.
12
From Pizdek, ―for example, both x and y might be determined by a third variable, z. In such
situations, z is described as a lurking variable which ―hides‖ in the background, unknown to the
experimenter.

21
Chapter two
Details the literature searches undertaken for both pilot and full studies and
summarises the research questions, framework and methodology resulting from
the literature search.
Chapter three
Research methodology, examines what models and approaches exist and justify
the approach taken in the case studies, which is mainly of a positivist nature with
some Phenomenological methods.
Chapter Four
Chapter four describes the results and conclusions from case study 1. Included are
the cycle time and data quality results from the approach and models proposed
and then used. Chapter four also includes extensive reference to the techniques of
Six Sigma and statistical analysis.
Chapter Five
Describes the results and conclusions from case study 2. Included in this chapter
are improvements established plus outcomes of the work in terms of the research
approach and impact on the business. Chapter five also includes extensive
reference to the techniques of Six Sigma and statistical analysis.

22
Chapter Six
Is the conclusions chapter and answers the research questions posed in chapter
one. The novel solution of this thesis is reviewed along with its contribution to the
pool of knowledge. Finally opportunities for further work in this area are identified.
Conclusion of Chapter One:
The purpose of chapter one was to outline and explain the purpose of this thesis; it
introduced the research questions and the approach taken to resolve the
questions. There was discussion about the company that the studies were carried
out at and also an explanation of the background to the case studies.
Introduction to the following Chapter.
The following chapter reviews the literature concerned with data quality & integrity
and also SARBOX related data quality. The literature review was extensive and
aimed to cover aspects of data quality in terms of both the case studies in this
thesis. The literature review surfaces the fact that not any work was identified in the
pool of knowledge for data quality in a Pharmaceutical global IT Infrastructure
department.

Literature Review Chapter Two
23
CHAPTER TWO
Literature Review

24
2 Literature Review
―Workers work within a system that – try as they might – is beyond their control. It
is the system, not the individual skills, that determines how they perform‖
Dr W Edwards Deming, 1986
2.1 Introduction.
The first chapter in this thesis introduced the research questions and objectives
regarding data quality and how these were to be addressed as two case studies.
The purpose of this chapter is to review the current knowledge that is concerned
with data quality and highlight the relevant areas (i.e. themes and trends) in terms
of how data quality affects businesses in general and how it affects the IT
department at the company where the two case studies are carried out. Case
Study one was specifically concerned with registering components in the asset
inventory and planning for case studies using information in this register.
Literature review - Summary
When this thesis was started the main concern was IT configuration data. As such
the literature search commenced based on these terms. However as the paper
search continued it became clear that the data quality principles and solutions were
being applied to wide range of data quality problems - not just IT configuration
data.
The question was asked ‗could some of the previous work be continued or
modified in some way to be of use in this case study?‘ It was apparent that data

25
quality related to IT has been covered in a number of papers. For instance it has
been recognized that there is a lot of complexity with controlling data, for instance
Tayi and Ballou (1998, p.56), determined that although data quality had been
addressed previously there was still a long way to go and that ―The problem of
ensuring data quality is exacerbated by the multiplicity of potential problems with
data‖. Although this paper made the point about potential problems the paper was
of limited use as it was not clearly stated what the problem statement was and
what future work could be undertaken to address this, thus leaving the gap in
knowledge in terms of solutions.
The importance of data quality (and this is still a concern of the author of this
thesis) is addressed by Redman (1998). Although Redman‘s paper was published
in 1998 the fact that this thesis was possible means that data quality is still not
given the consideration it deserves, as problems are still occurring in the
management and maintenance of data in the author‘s experience.
Redman states that ―The softer impacts, including lower morale, organisational
mistrust, difficulties in aligning the enterprise, and issues of ownership, may be
even worse‖ (Redman, 1998, p.82). One of the drivers for both of the case studies
is the lack of data quality in a large organization such as GSK and how this has
been effectively addressed.
Further to the work carried out in the two case studies contained within this thesis,
it can be seen that Redman‘s concerns have been addressed in the IT department
of GSK.

26
In addition to the ideas of Redman are those of Wang (1998, p. 58), who states
that ―To increase productivity organisations must manage information as they
manage products‖.
This is borne out in the case studies and also there is an awful lot of wastage in
terms of hours and avoidable outages due to poor data quality. For example if the
contacts details for a server owner are out of date in the database then this means
that there could be a delay in reinstating that server if it is out of service. Therefore
the principles that Wang discusses in his paper can be applied to the work
described in this thesis as clearly there is a need to manage the information in a
more controlled manner.
In their paper Gustafson and Koff (2005), describe how data is everywhere and
affects us all in many ways. Although this is a more generalised paper it is
important as it reminds us that data quality is important in all systems and societies
for many different reasons.
Another important paper discussed how data must be of good quality and how this
has been addressed. The U.S. Government Accountability Office (2005, p.2),
states that ―To be useful to both federal and state decision-makers, crash data
must be complete, timely, accurate, and collected in a consistent manner.‖ In this
case the ideas postulated in the section ―Data Quality Problems often reflect
difficulties in collection and processing of crash reports‖ are in agreement and can
be applied to this area of research.
Therefore the review of 172 papers, articles and theses, regarding data quality
seeks to demonstrate that the area of work in this case study is new and in some
cases continues on from other work already achieved – but the case studies in this

27
thesis are carried out in the environment of a Global Pharmaceutical IT department
rather than a generalized context such as accounting or government offices. It is
important to remember that the case studies and tools in this thesis can be used to
tackle data quality problems outside of a global Pharmaceutical IT organization.
This is a key point, as the techniques represented in this thesis are transferrable to
any process operations that are required to be repeatable and reproducible and
this is a big advantage of the work represented in this thesis.
Therefore literature search was extensive (covered 172 texts) consisting of
academic papers, journal articles and theses which proved very informative in
terms of information to support both case studies.
In the data quality area the papers cover a wide range of topics from air quality
through to demographic data. As for the SARBOX the majority of articles where
concentrated on legal themes or accountancy firms looking to improve their own
processes to meet SARBOX Regulations – see page 13 in chapter one for more
explanation of SARBOX.
Therefore the main themes in the SARBOX domain were legal and accountancy.
This was interesting as in the IT Department of GSK the SARBOX IT audits are an
important part of compliance to the overall SARBOX requirements and yet there
was a gap in the literature for this area.
In order to ensure that the main sources of data quality and SARBOX papers have
been covered in this thesis a wide range of sources were used. A number of links
to the more important databases are included below, many archives were
searched, including, but not limited to the following:

28
The British Library ETHOS DB13
The Journal of Information Management Systems14
The library of the Association for Computing Machinery15
Leeds Metropolitan University – online library plus ATHENS and EBSCO
The Gartner Group
The Forrester Group
The Leading Edge Forum
The United States (US) Government Accountability Office
Datamonitor Group (now known as Ovum)
ITIL (Information Technology Infrastructure Library)
British Standards Institute
The MIT Total Data Quality Management Program16
A review of the numbers of documents and articles, plus all the useful information
gained from each database, demonstrated that the most useful source for
information on data quality research was the Association of Computing Machinery
(ACM) Digital Library. This was what the author expected to see, as the topics
reviewed in this thesis are very closely linked to computing processes. The Athens
database available via the University of Leeds Online Library was also
indispensible as it offered access to other journals not contained in the ACM Digital
13
http://ethos.bl.uk/Home.do
14
http://www.jmis-web.org/
15
http://www.acm.org/
16
http://web.mit.edu/tdqm/www/index.shtml

29
Library. Finally the British Library Ethos DB was very useful, in terms of thesis
reviews.
As it is not practicable to review all the papers in this Thesis, a discussion follows
covering a selection of the most relevant papers. The discussion is the author‘s
impression of the articles and how they form the basis of the current trends in
terms of data quality.
2.1.1 Summary of Six Sigma and SARBOX
The two case studies in this thesis use the tools from the process improvement
methodology of Six Sigma, which is based on investigating data to drive
improvements. The use of Six Sigma is intended to identify variations in repeatable
processes and allow the reduction of any variations that are leading to cost or
inefficiencies. Six Sigma uses problem solving and statistical tools to help surface
hidden causes of errors. For example brainstorming is used to define exactly what
the customer wants and how to meet those requirements. Agreements are drawn
up between the project teams and customers (such as charters and contracts). The
process is studied in detail and often improvements are identified by methods such
as hypothesis tests and improvements implemented. Once the improvements are
proven and returning to the value desired by the customer the final stage is for
control mechanisms to be implemented to maintain the process ongoing or identify
any deviations from targets that might occur over time.
Further discussion on Six Sigma and it background can be found in sections 2.10
and 2.11.

30
The SARBANES-OXLEY regulations require the company to demonstrate that it
has the necessary financial controls in place. One aspect of this is the ability to
verify that the hardware and databases that host the applications that handle the
financial transactions. The literature review of the SARBOX proved insightful and
this demonstrated that there was also a gap in the literature of the IT hardware
compliance of SARBOX.

31
2.2 Overview of Chapter Two
The diagram below on the left hand side provides an overview of how the mass of
information was assessed and then funneled into relevant data for review. It starts
with the inputs being the available information on data quality and ends with the
outputs being the two literature review sections, one each for case studies one and
two. The diagram on the right hand side provides the chapter headings and
illustrates the flow of the process and direction taken by the literature review
chapter.
2.3 Data Quality World
2.6 Narrowing the focus
2.9 Specific to case studies One and Two
Literature
pertinent to project
One
Literature
pertinent to project
Two
2.4 Categorisation of the information
2.5 Sort by significance and value added
2.7 Identify the gaps
2.8 Where value can be added
Air
Quality
Accounting
Road
safety
Six
SIgma
Process
mgt
IT
systems
Six
Sigma
SARBOXCMDB
Measure
Track
Value
add
Control
Census
Data
Quality in
Education
Manufac
turing
Results
Additional
Knowledge added
to pool (in the form
of this Thesis)
2.2 Overview of Chapter TwoInputs
Outputs
2.10 Six Sigma – a discussion
2.11 Summary of Chapter Two
Figure 2 Literature Review

32
2.3 Data Quality World
As can be imagined, the volume of information regarding data and its quality is
large and is apparent in a wide range of sources. The study of the available
literature reveals that data about all manner of things is collected, collated and
stored. The data being discussed in the articles is typically used to monitor, track
and draw conclusions on various topics as diverse as population trends17
, air
quality18
, traffic statistics19
, manufacturing20
, database performance21
and even soil
quality22
for more information on these papers refer to the foot notes below.
The literature search demonstrated that there had been an increase in studies for
Data Quality over time (see figure 3 below, in this chapter). In terms of this
particular study the peak of articles is from 2004 and 2005 and this makes sense
as the knowledge pool grows the problems are solved the data quality evolves.
Currently the data quality articles are more concerned with data warehousing and
automation and reporting from federated databases.
17
Significant Problems of Critical Automation Program Contribute to Risks Facing 2010 Census, GOA
18
System demonstrations (a): Significance information for translation: air quality data integration, Andrew
Philpot, Patrick Pantel, Eduard Hovy
19
Quality Guidelines Generally Followed for Police-Public Contact Surveys, but Opportunities Exist to Help
Assure Agency Independence, GOA
20
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
KDD '04
21
Data bubbles: quality preserving performance boosting for hierarchical clustering
22
Environmental Information EPA Needs Better Information to Manage Risks and Measure Results

33
Figure 3 Total articles by year reviewed for case studies one and two
The literature review surfaced the fact that there were more uses and impact
resulting from data quality (both positive and negative) than was apparent to the
author at the start of the case studies. When commencing with the case studies
there was already a focused view of data quality – that relating to IT configuration
items. However since exploring the extant literature it is noted that the importance
of data quality has been recognized in fields of study that are very different from
those of IT.
The articles provided a background to how far work has progressed in the area of
data quality. Moreover the impact of work completed previously (as evidenced by
the articles reviewed) has enabled analysts like the author to be able to study and
better understand the impact of data on operations. For example, in his paper
entitled Data quality and due process in large inter-organizational record systems,

34
Laudon (1986, p.10), studied the data quality of FBI records and stated that ―In
excess of 14,000 persons are at risk of being falsely detained and perhaps
arrested, because of invalid warrants in this system‖. That was back in 1986 and a
study of the available articles shows that currently the data quality has improved
and evidence of this can be found in the 2005 article by the leading edge forum,
which recognized the importance of data quality, Computer Sciences Corporation
CSC Leading Edge Forum (2005).
What was learnt by the author, as a result of carrying out the literature review, was
that there are many impacts of poor data quality, such as the FBI example quoted
above and that data quality affects all people in all walks of life in one way or
another. A positive learning was that, over time clear progress has been made in
improving data quality and this is best represented by the paper by Redman (1998)
reviewed later on in this chapter. A negative impact of data quality, where it is poor
or lacking, is that referenced above in the paper by Laudon (1986).
A review of the papers (for both data quality and SARBANES OXLEY data) also
revealed that there were some key contributors and these are summarized below:
There are a number of key contributors to the field, and these became apparent as
the author worked through the articles surfaced by the literature search. The
authors who are key contributors in this field of work are Wang in a paper co-
authored with Wand, see Wang and Wand (1996), who recognized the importance
of defining the data ahead of time. This is still very important and in the author‘s
opinion is still a relevant model today. In the study of literature Wang had
published a number of papers focusing on data quality. Another key contributor is

35
Ballou and Tayi, see Tayi and Ballou (1998), who identified that in terms of data
quality analysis (Implications of data quality for spreadsheet analysis) the impact of
poor data quality on spreadsheet output. The contribution they made was the
recognition that the impact of poor data on resulting calculations and the
cumulative effect this can have, had not been properly recognized previously. The
proposal of a framework for analyzing spreadsheet errors drew attention to the fact
that an error early on, in the processing of data, could lead to a larger error
resulting in financial misstatement – albeit by accident. This is very relevant as the
SARBANES OXLY Act of 2002 deals with financial misstatement and is also
dependent of tracing the data used in financial processes.
The contribution from Lee (in collaboration with Pipino and Wang), is that
―assessing data quality is an ongoing effort‖ and that a ―one size fits all‖ set of
metrics is not a solution‖ (Pipino, Lee and Wang, 2002, p.218). Again this is very
important for senior management to understand, and as companies‘ data
warehouses and data bases grow ever larger the measurement of the data and its
quality becomes more complex.
Also reviewed where articles relating to the Six Sigma process improvement
methodology. The author felt this was an important topic in its own right and
therefore a brief overview of Six Sigma is also provided as part of the literature
review. The important Six Sigma articles and the milestone for the foundation work
in terms of Six Sigma was the work carried out by the statisticians W Edwards
Deming, Joseph M Juran, and Walter A Shewhart, during the 1920‘s and 1930‘s
which was in turn utilized and expanded at Motorola in the 1980‘s.
In his seminal paper, Shewhart (1925, p.1), he stated that:

36
‗By detecting the existence of trends, statistics plays one important role in helping
the manufacturer to maintain the quality of product‘.
This formed the basis for the work that led to the current day methodology of Six
Sigma. The control charts used in chapter four are developed from Shewhart's
early work. The contribution of Shewhart was allowing managers to recognize
variance with common causes and those with assignable causes. Deming (1993),
discussed some elements of the work previously completed by Shewhart and
highlighted two common mistakes:
Mistake 1.
To react to an outcome as if it came from a special cause, when actually it
came from a common cause of variation.
Mistake 2.
To treat an outcome as if it came from a common cause of variation, when
actually it came from a special cause.
Joseph Juran discusses the importance of Process Validation and also Process
controls, see Juran (1978). In the article Juran talks about two important aspects
that are used in Six Sigma, Process Validation is akin to the Measure and Analyze
phases of the DMAIC (see chapter four for explanation of the DMAIC Term) and
the Control is akin to Control from DMAIC.
―In contrast, the Japanese emphasis is on the following:
- Process ―validation‖ – quantifying process capabilities so that it is known in
advance whether the vendor‘s processes can hold the tolerance.
- Process controls – the plans of control that will be used to keep the
processes doing good work." (Juran, 1978, p.26)

37
The articles also demonstrated that the work undertaken at Motorola forms the
basis for the current thinking in terms of Six Sigma. From a Six Sigma text book
stance the key contributors are Pyzdek (2003) and Breyfogle III (2003) whose
works offer a complete and in depth technical plus managerial overview of the Six
Sigma subject.
Additionally the articles covered the area of research methodologies and surfaced
the fact that the seminal work in terms of understanding the research
methodologies is that by Burrell and Morgan (2008) and this work represents a
milestone in terms of the field of Social Paradigms and Organisational Analysis.
Additionally Easterby Smith et al. (2006), with their work in management research
is a key piece of work.
2.4 Categorisation of the information
As can be seen in section 2.2 there is a lot of information on the use of and
differing types of data. It was necessary therefore to find a way to categorise the
literature into areas of interest or those that were outliers in terms of relevance to
the case studies.
As noted, this thesis is concerned with data associated with the IT Operations of a
global pharmaceutical company and the quality of that data. In order to categorise
the articles a scoring table was used, to record the papers and grade them on
relevance to the work in the case studies. It was possible to determine the
relevance of the literature, for instance whether it concerned IT, data quality, Six
Sigma, SARBOX, methodological approaches used and applications of the study

38
i.e. in an IT or audit environment. This was a structured approach to grading the
articles.
2.5 Sort by significance and value add
Once a basic list was created, an article evaluation form was developed based on
the example provided by Dunleavy (2003). It is important to note that although
ranking system shows method of sorting papers, every paper was reviewed to see
if any minor contribution to the work could be gleaned from these papers. It was
not felt necessary to include a review of every paper in the thesis as it would only
over complicate it for little or no improvement. The form is as follows:
Measure General General
Computing
Data
quality in
IT
Data
quality &
integrity
Six
Sigma,
Data
quality &
integrity
Relevance
0 Not
relevant
25 Very
relevant
Reports: Report
on the Dagstuhl
Seminar: "data
quality on the
Web"
1 2 3 2 2 10
Assessing data
quality with
control matrices.
4 2 4 4 2 16
Virtual extension:
Data quality
assessment
3 2 4 4 2 15
Posters and
Short Papers: A
practical
approach for
modeling the
quality of
multimedia data
2 3 2 2 2 11
Data quality in
internet time,
space, and
communities
(panel session)
4 2 1 2 1 10
Table 3 Article evaluation form

Improving Data Quality and Regulatory Compliance

Improving Data Quality and Regulatory Compliance

Recommandé

Recommandé

Contenu connexe

Similaire à Improving Data Quality and Regulatory Compliance

Similaire à Improving Data Quality and Regulatory Compliance (20)

Improving Data Quality and Regulatory Compliance