The document discusses data quality success stories and provides an overview of a program on the topic. It introduces the program, which will discuss data quality as an engineering challenge, putting a price on data quality, how components of data management complement each other, savings-based and innovation-based success stories, and non-monetary success stories. The program aims to provide takeaways and allow for questions and answers.
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Data Quality Success Stories
1. Data Quality Success Stories
Copyright 2019 by Data Blueprint Slide # 1Peter Aiken, PhD
• DAMA International President 2009-2013 / 2018
• DAMA International Achievement Award 2001
(with Dr. E. F. "Ted" Codd
• DAMA International Community Award 2005
Peter Aiken, Ph.D.
2Copyright 2019 by Data Blueprint Slide #
• I've been doing this a long time
• My work is recognized as useful
• Associate Professor of IS (vcu.edu)
• Founder, Data Blueprint (datablueprint.com)
• DAMA International (dama.org)
• 10 books and dozens of articles
• Experienced w/ 500+ data
management practices worldwide
• Multi-year immersions
– US DoD (DISA/Army/Marines/DLA)
– Nokia
– Deutsche Bank
– Wells Fargo
– Walmart
– …
PETER AIKEN WITH JUANITA BILLINGS
FOREWORD BY JOHN BOTTEGA
MONETIZING
DATA MANAGEMENT
Unlocking the Value in Your Organization’s
Most Important Asset.
3. 2Infogix Confidential Copyright 2019
About Us
• Obsessed with Data Integrity since 1982
• Headquartered in the heart of the Mid-West
• Mature and dedicated to customer success
• Trusted by some of the largest brands in the
most data intensive industries
Rhode Island Massachusetts
North Carolina South Carolina
4. 3Infogix Confidential Copyright 2019
Success Means Focusing on What Matters
Data
Selection of data at the system and source
level (tables and fields)
Information
As required to develop a common language for
important data
Insights & Process Excellence
To monitor the effectiveness of our analytics, metrics,
and processes
Business Goals & Objectives
Focus on Critical Data Elements required to support value drivers and key initiatives
All Available Data
100% of Data
Data We Use
40% of Data
Data We Should Govern
10% of Data
Data of High Value
< 5% of Data
CriticalData
5. 4Infogix Confidential Copyright 2019
Success Means Focusing on What Matters
Data
Selection of data at the system and source
level (tables and fields)
Information
As required to develop a common language for
important data
Insights & Process Excellence
To monitor the effectiveness of our analytics, metrics,
and processes
Business Goals & Objectives
Focus on Critical Data Elements required to support value drivers and key initiatives
All Available Data
100% of Data
Data We Use
40% of Data
Data We Should Govern
10% of Data
Data of High Value
< 5% of Data
CriticalData
Deliver Highest Value Data
• Metrics & Scoring
• Advanced Analytics Integration
• Data-Driven Authoring
Understand Your Data Context
• Business Glossary
• Policy Management
• Data Detection & Tagging
Know What Data You Have
• Data Catalog
• Data Lineage
• Data Transformation
• Data Integration
Know Your Data is Trustworthy
• Balancing, Reconciliation, & Statistical Controls
• Governance, Stewardship & Workflow
• Case Management
6. 5Infogix Confidential Copyright 2019
Operational Data Quality
• For one line of business, saved 35 data analyst FTEs who were dedicated to
fixing data
• Reduced manual lost claims payments processing costs from $5M to
$500K/annually
• Saved over $2M/year in audit costs through controls
Insurance
• Prevented over $10M in under-billing errors in one year
• Achieved an excess of $3M within the first quarter and realized full ROI of
license within first month of implementation
• Shrank time to process from 6 months to 3 weeks through more agile
integrations and transformations.
Media &
Communications
•Caught $9M erroneous transaction within 2 weeks of implementation go-live
•Saved over $2M in General Ledger reconciliation costs
•Monitors $20 Billion in daily electronic payments under 5 minutes
Financial Services
•Resolved over $190M in inventory discrepancies (duplicates, incorrect)
•30% reduction in M&A integration time through automated testing
•Prevented $16M in erroneous payments from being disbursed
Other
Quoting
Policy
Admin
Billing
Onboarding
GL
Agency
Direct
• In support of
• Governance & Compliance
• Financial Reporting Accuracy
• 3rd Party Data Management
• System Migration
• Evolving Technology (e.g.
Streaming Data)
7. Who is Joan Smith?
http://www.dataflux.com
3Copyright 2019 by Data Blueprint Slide #
Challenges
• Purchased an A4
on June 15 2007
• Had not done
business with the
dealership prior
• "makes them
seem sleazy when
I get a letter in the
mail before I've
even made the
first payment on
the car advertising
lower payments
than I got"
4Copyright 2019 by Data Blueprint Slide #
8. Letter from the Bank
… so please continue to open
your mail from either Chase or
Bank One
P.S. Please be on the lookout for any
upcoming communications from
either Chase or Bank One regarding
your Bank One credit card and any
other Bank One product you may
have.
Problems
• I initially discarded the letter!
• I became upset after reading it
• It proclaimed that Chase has data
quality challenges
5Copyright 2019 by Data Blueprint Slide #
How to solve this data quality problem using just tools?
Retail price for the unit was $40
6Copyright 2019 by Data Blueprint Slide #
9. A congratulations
letter from another
bank
Problems
• Bank did not know
it made an error
• Tools alone could
not have prevented
this error
• Lost confidence in
the ability of the
bank to manage
customer funds
7Copyright 2019 by Data Blueprint Slide #
8Copyright 2019 by Data Blueprint Slide #
DropTable
10. 9Copyright 2019 by Data Blueprint Slide #
10Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
11. Four ways to make your data sparkle!
1.Prioritize the task
– Cleaning data is costly and time consuming
– Identify mission critical/non-mission critical data
2.Involve the data owners
– Seek input of business units on what constitutes "dirty"
data
3.Keep future data clean
– Incorporate processes and technologies that check
every zip code and area code
4.Align your staff with business
– Align IT staff with business units
(Source: CIO JULY 1 2004)
11Copyright 2019 by Data Blueprint Slide #
Four ways to make your data sparkle!
1.Prioritize the task
– Cleaning data is costly and time consuming
– Identify mission critical/non-mission critical data
2.Involve the data owners
– Seek input of business units on what constitutes "dirty"
data
3.Keep future data clean
– Incorporate processes and technologies that check
every zip code and area code
4.Align your staff with business
– Align IT staff with business units
(Source: CIO JULY 1 2004)
12Copyright 2019 by Data Blueprint Slide #
12. Four ways to make your data sparkle!
1.Prioritize the task
– Cleaning data is costly and time consuming
– Identify mission critical/non-mission critical data
2.Involve the data owners
– Seek input of business units on what constitutes "dirty"
data
3.Keep future data clean
– Incorporate processes and technologies that check
every zip code and area code
4.Align your staff with business
– Align IT staff with business units
(Source: CIO JULY 1 2004)
13Copyright 2019 by Data Blueprint Slide #
14Copyright 2019 by Data Blueprint Slide #
• Information transparency
• Analytics
• Business Intelligence
• Increasing efficiencies
• Decreasing costs
• Driving holistic decision-making
across the organization
High
Quality
Data is
Critical
13. • SQL Server
– 47,000,000,000,000 bytes
– Largest table 34 billion records
• Informix
– 1,800,000,000 queries/day
– 65,000,000 tables / 517,000 databases
• Teradata
– 117 billion records
– 23 TBs for one table
• DB2
– 29,838,518,078 daily queries
• SQL Server
– 47,000,000,000,000 bytes
– Largest table 34 billion records
• Informix
– 1,800,000,000 queries/day
– 65,000,000 tables / 517,000 databases
• Teradata
– 117 billion records
– 23 TBs for one table
• DB2
– 29,838,518,078 daily queries
Data Footprints
15Copyright 2019 by Data Blueprint Slide #
Repeat 100s, thousands, millions, billions of times ...
16Copyright 2019 by Data Blueprint Slide #
14. Death by 1000 Cuts
17Copyright 2019 by Data Blueprint Slide #
18Copyright 2019 by Data Blueprint Slide #
Garbage In ➜ Garbage Out!
My most profound lesson! (so far)
15. 19Copyright 2019 by Data Blueprint Slide #
Perfect
Model
Garbage
Data
Garbage
Results
Data
Warehouse
Machine
Learning
Block Chain
AI
MDM
Analytics
Technology
Data
Governance
GI➜GO!
Business
Intelligence
20Copyright 2019 by Data Blueprint Slide #
Perfect
Model
Quality
Data
is
founda-
tional
Good
Results
Data
Warehouse
Machine
Learning
Business
Intelligence
Block Chain
AI
MDM
Analytics
Technology
Data
Governance
Quality In ➜ Quality Out!
16. Data Knowledge is insufficient and informal
• Data management happens 'pretty well' at
the workgroup level
– Defining characteristic of a workgroup
– Without guidance, what are the chances that all
workgroups are pulling toward the same objectives?
– Consider the time spent attempting informal practices
• Data chaff becomes sand in the machinery
– Preventing smooth interoperation and exchanges
– Difficulties that have been hard to account for
• Organizations and individuals lack
– Skills
– Knowledge (architecture)
– Data Engineering (how)
– Data Strategy (why)
21Copyright 2019 by Data Blueprint Slide #
Standard data
Data supply
Data literacy
Making a Better Data Sandwich
22Copyright 2019 by Data Blueprint Slide #
Data literacy
Standard data
Data supply
17. Making a Better Data Sandwich
23Copyright 2019 by Data Blueprint Slide #
Standard data
Data supply
Data literacy
Making a Better Data Sandwich
24Copyright 2019 by Data Blueprint Slide #
Standard data
Data supply
Data literacy
This cannot happen without data engineering and architecture!
Quality data engineering/
architecture work products
do not happen accidentally!
18. Our barn had to pass a foundation inspection
25Copyright 2019 by Data Blueprint Slide #
Engineering Standards
26Copyright 2019 by Data Blueprint Slide #
19. USS Midway &
Pancakes
27Copyright 2019 by Data Blueprint Slide #
• It is tall
• It has a clutch
• It was built in 1942
• It is cemented to the floor
• It is still in regular use!
Why is this an excellent
engineering example?
28Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
20. Hidden Data Factories are expensive https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
• Consider these two questions:
– Were your systems explicitly designed to
be integrated or otherwise work together?
– If not then what is the likelihood that they
will just happen to work well together?
• Data must function at the most granular
interaction or it results in things that:
– Take longer (end-of-day job runs 45 hours)
– Cost more (the wrong assets are transferred)
– Deliver less (features are not delivered)
– Present greater risk (billing delayed 30 days, monthly)
• 20-40% of IT budgets are spent evolving data:
– Data migration (changing the location from one place to another)
– Data conversion (changing it into another form, state, or product)
– Data improvement (inspecting, manipulating it, preparing for subsequent use)
29Copyright 2019 by Data Blueprint Slide #
"The choice of data structure and algorithm
can make the difference between software
running in a few seconds or many days."
http://slideplayer.com/slide/7664141/
DQ
challenges
are context
specific!
30Copyright 2019 by Data Blueprint Slide #
21. 31Copyright 2019 by Data Blueprint Slide #
Much more analysis is
required before we can
implement repeatable
solutions to today's data
quality challenges!
TWITTER
USERS SEND
473400TWEETS
SKYPEUSERS MAKE
176220CALLS
INSTAGRAM
GIPHY
USERS POST
PHOTOSSPOTIFYSTREAMS OVER
750,000
SONGS
TUMBLR
USERS PUBLISH
POSTS
USERS WATCH
VIDEOS
SHIPS
PACKAGES
SNAPCHATTHEWEATHER
CHANNEL
NETFLIX
USERS STREAM
97222HRS
OF VIDEO
VENMO
PROCESSES
$68493
PEER-TO-PEER
TRANSACTIONS
TINDER
AMAZON
USERS MATCH
TIMES
TEXTS SENTNEW COMMENTS
RECEIVES
USERS SHARE
SNAPS
YOUTUBE
LINKEDINGAINS
120+NEW
2083333
,
4333560, ,
1388889
,79740,
BITCOIN
NEW
FORECAST
REQUESTS
1.25
ARE CREATED
RECEIVES
,
49380,AMERICANS
USE
OF INTERNET DATA
3138420, , GB,6940,
18055555,,
1111
UBERUSERS TAKE
RIDES
1389,
1944,
SERVES UP
GIFS
12986111,,
PROFESSIONALS
GOOGLECONDUCTS
SEARCHES
3877140, ,
,,
, ,
REDDIT MINUTE
every
DAY
of the
PRESENTED BY DOMO
2018
,
32Copyright 2019 by Data Blueprint Slide #
https://www.domo.com/learn/data-never-sleeps-6
How much Data,
by the minute!
For the entirety of 2018, every minute
of every day:
• 18 million weather forecast requests
• Netflix streams almost 100,000
hours of video
• LinkedIn adds 120+ individuals
• 1,300 Uber rides
• (almost) a half million tweets
• 7,000 Tinder matches
• 1.25 new cryptocurrencies are
created
• ...
22. Great inspiration towards valuation ...
• How to Measure Anything: Finding the Value of
Intangibles in Business by Douglas Hubbard (ISBN: 0470539399)
• Measurement is a reduction in uncertainty
• Formalizing stuff forces clarity
• Whatever your measurement problem is,
– it's been done before
• You have more data than you think
• You need less data than you think
• Getting data is more economical than you think
• You probably need different data than you think
• Special shout out to Chapter 7
– Measuring the value of additional information to a decision
33Copyright 2019 by Data Blueprint Slide #
Sheena's in color Activity-Based Costing Kills Someone
34Copyright 2019 by Data Blueprint Slide #
23. Enrico Fermi (Nobel Prize Physics 1938)
35Copyright 2019 by Data Blueprint Slide #
• Tuners in Chicago ≈ Population/people per household
times % households with tuned pianos
times tunings per year
divided by (tunings per tuner per day
times workdays/year)
• How many piano tuners in the city of Chicago?
– Without using existing lists such as yellow pages, google ...
– Current population of Chicago (3 million at the time)
– Average number of people per household (2 or 3)
– Share of households with regularly tuned pianos (1 in 3)
– Required frequency of tuning (1/year)
– How many pianos can a tuner tune daily? (4 or 5)
– How many days/year are worked (250)
Monitization: Time & Leave Tracking
At Least 300 employees are
spending 15 minutes/week
tracking leave/time
36Copyright 2019 by Data Blueprint Slide #
24. Capture Cost of Labor/Category
37Copyright 2019 by Data Blueprint Slide #
District-L (as an example) Leave Tracking Time Accounting
Employees 73 50
Number of documents 1000 2040
Timesheet/employee 13.7 40.8
Time spent 0.08 0.25
Hourly Cost $6.92 $6.92
Additive Rate $11.23 $11.23
Cost per timekeeper $12.31 $114.56
Total timekeeper cost $898.49 $5,727.89
Monthly cost $21,563.83 $137,469.40
Compute Labor Costs
38Copyright 2019 by Data Blueprint Slide #
25. Annual Organizational Totals
• $100,000 Salem
• $159,000 Lynchburg
• $100,000 Richmond
• $100,000 Suffolk
• $150,000 Fredericksburg
• $100,000 Staunton
• $100,000 NOVA
• $800,000/month or $9,600,000/annually
• Awareness of the cost of things considered overhead
39Copyright 2019 by Data Blueprint Slide #
40Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
27. Definitions
• Quality Data
– Fit for purpose meets the requirements of its authors, users,
and administrators (adapted from Martin Eppler)
– Synonymous with information quality, since poor data quality
results in inaccurate information and poor business performance
• Data Quality Management
– Planning, implementation and control activities that apply quality
management techniques to measure, assess, improve, and
ensure data quality
– Entails the "establishment and deployment of roles, responsibilities
concerning the acquisition, maintenance, dissemination, and
disposition of data" http://www2.sas.com/proceedings/sugi29/098-29.pdf
✓ Critical supporting process from change management
✓ Continuous process for defining acceptable levels of data quality to meet
business needs and for ensuring that data quality meets these levels
• Data Quality Engineering
– Recognition that data quality solutions cannot not managed but must be engineered
– Engineering is the application of scientific, economic, social, and practical knowledge
in order to design, build, and maintain solutions to data quality challenges
– Engineering concepts are generally not known and understood within IT or business!
43Copyright 2019 by Data Blueprint Slide #
Spinach/Popeye story from http://it.toolbox.com/blogs/infosphere/spinach-how-a-data-quality-mistake-created-a-myth-and-a-cartoon-character-10166
Why isn't aren't my
data problems
solved by a data
warehouse?
44Copyright 2019 by Data Blueprint Slide #
28. Version 1
45Copyright 2019 by Data Blueprint Slide #
Data
Strategy
Data
Governance
Data
Quality
Improving
operations in
3 data
management
practice areas
BI
Warehouse
Version 2
46Copyright 2019 by Data Blueprint Slide #
Data
Strategy
Data
Governance
BI
Warehouse
Metadata
Improving
operations in
3 data
management
practice areas
29. Version 3
47Copyright 2019 by Data Blueprint Slide #
Data
Strategy
Data
Governance
BI/
Warehouse
Reference &
Master Data
Perfecting
operations in 3
data
management
practice areas
48Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
31. 51Copyright 2019 by Data Blueprint Slide #
Ubiquitous Mystery Object
52Copyright 2019 by Data Blueprint Slide #
32. Complex Data Quality Problems
• Agency manages (4,000,000 data items)
– Executive in charge requested
a conversion update
– Was told verbally the conversion was "going well"
– Demanded specifics
• Question: "How many items did you attempt to convert?"
• Answer: "100 items"
• Question: "How many were actually converted?"
• Answer: "5"
• Problems
– Not reporting the "right results"
– These "problems" were discovered too late in the project
– Unsophisticated contractor
53Copyright 2019 by Data Blueprint Slide #
Improving Data Quality during System Migration
• Challenge
– Millions of NSN/SKUs
maintained in a catalog
– Key and other data stored in
clear text/comment fields
– Original suggestion was manual
approach to text extraction
– Left the data structuring problem unsolved
• Solution
– Proprietary, improvable text extraction process
– Converted non-tabular data into tabular data
– Saved a minimum of $5 million
– Literally person centuries of work
54Copyright 2019 by Data Blueprint Slide #
33. Unmatched
Items
Ignorable
Items
Items
Matched
Week # (% Total) (% Total) (% Total)
1 31.47% 1.34% N/A
2 21.22% 6.97% N/A
3 20.66% 7.49% N/A
4 32.48% 11.99% 55.53%
… … … …
14 9.02% 22.62% 68.36%
15 9.06% 22.62% 68.33%
16 9.53% 22.62% 67.85%
17 9.5% 22.62% 67.88%
18 7.46% 22.62% 69.92%
Determining Diminishing Returns
55Copyright 2019 by Data Blueprint Slide #
Before
After
Time needed to review all NSNs once over the life of the project:
NSNs 2,000,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 10,000,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 10,000,000
Minutes available person/year 108,000
Total Person-Years 92.6
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 93
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million
Quantitative Benefits
56Copyright 2019 by Data Blueprint Slide #
34. Time needed to review all NSNs once over the life of the project:
NSNs 2,000,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 10,000,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 10,000,000
Minutes available person/year 108,000
Total Person-Years 92.6
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 93
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million
Quantitative Benefits
57Copyright 2019 by Data Blueprint Slide #
Time needed to review all NSNs once over the life of the project:
NSNs 150,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 750,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 750,000
Minutes available person/year 108,000
Total Person-Years 7
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 7
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $420,000
Time needed to review all NSNs once over the life of the project:
NSNs 2,000,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 10,000,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 10,000,000
Minutes available person/year 108,000
Total Person-Years 92.6
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 93
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million
Quantitative Benefits
58Copyright 2019 by Data Blueprint Slide #
35. • with a PhD in Chemical Engineering
• have to know whether this product was
Y2K compliant?
Why should a knowledge worker
59Copyright 2019 by Data Blueprint Slide #
International Chemical Company Engine Testing
60Copyright 2019 by Data Blueprint Slide #
• $1billion (+) chemical
company
• Develops/manufactures
additives enhancing the
performance of oils and
fuels ...
• ... to enhance engine/
machine performance
– Helps fuels burn cleaner
– Engines run smoother
– Machines last longer
• Tens of thousands of
tests annually
– Test costs range up to
$250,000!
36. 1.Manual transfer of digital data
2.Manual file movement/duplication
3.Manual data manipulation
4.Disparate synonym reconciliation
5.Tribal knowledge requirements
6.Non-sustainable technology
61Copyright 2019 by Data Blueprint Slide #
Data Integration Solution
62Copyright 2019 by Data Blueprint Slide #
• Integrated the existing systems to
easily search on and find similar
or identical tests
• Results:
– Reduced expenses
– Improved competitive edge
and customer service
– Time savings and improve
operational capabilities
• According to our client’s internal
business case development, they
expect to realize a $25 million
gain each year thanks to this
data integration
37. Lockheed Martin
• 20 years of project email
– Example from Doug Laney
63Copyright 2019 by Data Blueprint Slide #
Logistics Company
• Fortune 450
• Room of 100 associates
• Manually correcting every
item on every customer invoice
• Upon noting this to the
responsible manager - the reply was:
– This is the best quarter
– Of the best year
– I've ever had
– Perhaps I need
to double the
number in
that room?
64Copyright 2019 by Data Blueprint Slide #
38. 65Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
US DoD Reverse Engineering Program Manager
• "Your first project is to keep me from
having to testify to a Congressional
Hearing!" (Belkis Leon-Hong former ASD-C3I)
• Problem:
– 37 systems paid personnel within DoD
– How many were needed?
– How many potential losers?
– What do you mean by employee?
• Process modeling
– Inconclusive results
• Data reverse engineering - definitive
– One legged engineer,
working in waist deep waters,
underneath rotating helicopter blades,
on overtime
66Copyright 2019 by Data Blueprint Slide #
39. Reverse Engineering New Systems
67Copyright 2019 by Data Blueprint Slide #
Reverse Engineering New Systems for Smooth Implementation. IEEE Software. March/April 1999 16(2):36-43
Platform: UniSys
OS: OS
1998 Age: 21
Data Structure: DMS (Network)
Physical Records: 4,950,000
Logical Records: 250,000
Relationships: 62
Entities: 57
Attributes: 1478
Predicting Engineering Problem Characteristics
New System
Legacy System
#1: Payroll
Legacy System
#2: Personnel
Platform: Amdahl
OS: MVS
1998 Age: 15
Data Structure: VSAM/virtual
database tables
Physical Records: 780,000
Logical Records: 60,000
Relationships: 64
Entities: 4/350
Attributes: 683
Characteristics Logical Physical
Platform: WinTel Records: 250,000 600,000
OS: Win'95 Relationships: 1,034 1,020
1998 Age: new Entities: 1,600 2,706
Data Structure: Client/Sever RDBMS Attributes: 15,000 7,073
68Copyright 2019 by Data Blueprint Slide #
40. Actual Bid From Systems Integrator
69Copyright 2019 by Data Blueprint Slide #
Extreme Data Engineers ...
2 person months = 40 person days
2,000 attributes mapped onto 15,000
2,000/40 person days = 500/person day
or 500/8 hours = 62.5 attributes/hour
and
15,000/40 person days = 375/person day
or 375/8 hours = 46.875 attributes/hour
Locate, identify, understand, map, transform, document
108 attributes/60 minutes
1.8 attributes/minute!
70Copyright 2019 by Data Blueprint Slide #
41. What did Rolls Royce Learn
• Old model
– Sell jet engines
• New model
– Sell hours of thrust power
– Power-by-the-hour
– No payment for down time
– Wing to wing
– When was it invented?
from Nascar?
71Copyright 2019 by Data Blueprint Slide #
Fan Blade Sensor
72Copyright 2019 by Data Blueprint Slide #
• 1 Sensor
– Probabilistic (generalist) maintenance
forecasts
• 100 Sensors
– Establish optimal monitoring targets
– Finer tuned and safer maintenance
– Mission Readiness ???
– Storage $$$
– Handling $$$
– Opportunity $$$
– Systemic $$$
– Maintenance $$$
– Total > $1.5 Billion
42. 73Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
Armed Force Example
• Lieutenant attempting to
correct a 4 year
underpayment
of his private's pay
– Significant impact on moral
– Immediate cash issues
– Cost tens of man hours over
months of time to resolve
74Copyright 2019 by Data Blueprint Slide #
Nugee, R. and R. S. Seiner (2010, 6/1/2010). "TDAN.com Interview with Brigadier Richard Nugee – The British Army." 2013, from http://www.tdan.com/view-special- features/13897 and personal communications.
43. Friendly Fire deaths traced to Dead Battery
• Date: Tue, 26 Mar 2002 10:47:52 -0500
From:
Subject: Friendly Fire deaths traced to dead battery
In one of the more horrifying incidents I've read about, U.S. soldiers and
allies were killed in December 2001 because of a stunningly poor design of a
GPS receiver, plus "human error."
http://www.washingtonpost.com/wp-dyn/articles/A8853-2002Mar23.html
A U.S. Special Forces air controller was calling in GPS positioning from
some sort of battery-powered device. He "had used the GPS receiver to
calculate the latitude and longitude of the Taliban position in minutes and
seconds for an airstrike by a Navy F/A-18."
• According to the *Post* story, the bomber crew "required" a "second
calculation in 'degree decimals'" -- why the crew did not have equipment to
perform the minutes-seconds conversion themselves is not explained.
• The air controller had recorded the correct value in the GPS receiver when
the battery died. Upon replacing the battery, he called in the
degree-decimal position the unit was showing -- without realizing that the
unit is set up to reset to its *own* position when the battery is replaced.
The 2,000-pound bomb landed on his position, killing three Special Forces
soldiers and injuring 20 others.
• If the information in this story is accurate, the RISKS involve replacing
memory settings with an apparently-valid default value instead of blinking 0
or some other obviously-wrong display; not having a backup battery to hold
values in memory during battery replacement; not equipping users to
translate one coordinate system to another (reminiscent of the Mars Climate
Orbiter slamming into the planet when ground crews confused English with
metric); and using a device with such flaws in a combat situation
75Copyright 2019 by Data Blueprint Slide #
Formalizing the
Role of U.S. Army
Data Governance
76Copyright 2019 by Data Blueprint Slide #
44. How one inventory item proliferates data throughout the chain
555 Subassemblies & subcomponents
17,659 Repair parts or Consumables
System 1:
18,214 Total items
75 Attributes/item
1,366,050 Total attributes
System 2
47 Total items
15+ Attributes/item
720 Total attributes
System 3
16,594 Total items
73 Attributes/item
1,211,362 Total attributes
System 4
8,535 Total items
16 Attributes/item
136,560 Total attributes
System 5
15,959 Total items
22 Attributes/item
351,098 Total attributes
Total for the five systems show above:
59,350 Items
179 Unique attributes
3,065,790 values
77Copyright 2019 by Data Blueprint Slide #
78Copyright 2019 by Data Blueprint Slide #
45. Business Implications
• National Stock Number (NSN)
Discrepancies
– If NSNs in LUAF, GABF, and RTLS are
not present in the MHIF, these records
cannot be updated in SASSY
– Additional overhead is created to correct
data before performing the real
maintenance of records
• Serial Number Duplication
– If multiple items are assigned the same
serial number in RTLS, the traceability of
those items is severely impacted
– Approximately $531 million of SAC 3
items have duplicated serial numbers
• On-Hand Quantity Discrepancies
– If the LUAF O/H QTY and number of items serialized in RTLS conflict, there
can be no clear answer as to how many items a unit actually has on-hand
– Approximately $5 billion of equipment does not tie out between the systems
79Copyright 2019 by Data Blueprint Slide #
Best approaches combines manual and automation
Humans Generally Better Machines Generally Better
• Sense low level stimuli
• Detect stimuli in noisy background
• Recognize constant patterns in varying situations
• Sense unusual and unexpected events
• Remember principles and strategies
• Retrieve pertinent details without a priori
connection
• Draw upon experience and adapt decision to
situation
• Select alternatives if original approach fails
• Reason inductively; generalize from observations
• Act in unanticipated emergencies and novel
situations
• Apply principles to solve varied problems
• Make subjective evaluations
• Develop new solutions
• Concentrate on important tasks when overload
occurs
• Adapt physical response to changes in situation
• Sense stimuli outside human's range
• Count or measure physical quantities
• Store quantities of coded information accurately
• Monitor prespecified events, especially infrequent
• Make rapid and consisted responses to input
signals
• Recall quantities of detailed information
accurately
• Retrieve pertinent detailed without a priori
connection
• Process quantitative data in prespecified ways
• Perform repetitive preprogrammed actions
reliably
• Exert great, highly controlled physical force
• Perform several activities simultaneously
• Maintain operations under heavy operation load
• Maintain performance over extended periods of
time
80Copyright 2019 by Data Blueprint Slide #
46. 81Copyright 2019 by Data Blueprint Slide #
Potential Data Sources
82Copyright 2019 by Data Blueprint Slide #
47. Data Mapping
12
Mental
illness
Deploy
ments
Work
History
Soldier Legal
Issues
Abuse
Suicide
Analysis
FAPDMSS G1 DMDC CID
Data objects
complete?
All sources
identified?
Best source for
each object?
How reconcile
differences
between
sources?
MDR
83Copyright 2019 by Data Blueprint Slide #
84Copyright 2019 by Data Blueprint Slide #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
48. 85Copyright 2019 by Data Blueprint Slide #
Senior Army Official
• Room full of Stewards
• A very heavy dose of management support
• Advised the group of his opinion on the matter
• Any questions as to future direction
– "They should make an appointment to speak directly with
me!"
• Empower the team
– The conversation turned from "can this be done?" to "how are we going
to accomplish this?"
– Mistakes along the way would be tolerated
– Implement a workable solution in prototype form
86Copyright 2019 by Data Blueprint Slide #
49. 87Copyright 2019 by Data Blueprint Slide #
Managing Data with Guidance?
• Federal employees
• 44 users from whitehouse.gov
• Thousands of military and
government e-mails
• Canadian citizens
• One-fifth of Quebec
88Copyright 2019 by Data Blueprint Slide #
50.
Ashley
Madison
37,000,000
25,000,000
OPM
70,000,000
Target
89Copyright 2019 by Data Blueprint Slide #
Target Corporation's Database Contents
90Copyright 2019 by Data Blueprint Slide #
• Your age
• Marital status
• Part of town you live in
• How long it takes you to drive
to work
• Estimated salary
• If you have recently moved
• Credit cards carried in your
wallet
• What websites you visit
• Your ethnicity
• Your job history
• The magazines you read
• Work commute
• Sexual preferences
• If you’ve ever declared
bankruptcy or got divorced
• The year you bought (or lost)
your house
• Where you went to school(s)
• What kinds of topics you talk
about online
• Whether you prefer certain
brands of coffee, paper
towels, cereal or applesauce
• Your political leanings,
reading habits, charitable
giving and
• The number of cars you own
51. 91Copyright 2019 by Data Blueprint Slide #
https://oversight.house.gov/report/opm-data-breach-government-jeopardized-national-security-generation/
How the Government Jeopardized Our National
Security for More than a Generation
• Preventable
• Leadership failed
– To heed repeated
recommendations
– To sufficiently respond
to growing threats of
sophisticated cyber
attacks, and
– To prioritize resources
for cybersecurity
• 2014 data breaches
were likely connected
and possibly
coordinated to the 2015
data breach
• OPM misled the public
on the extent of the
damage of the breach
and made false
statements to Congress
Key Findings
92Copyright 2019 by Data Blueprint Slide #
Data Quality Success Stories - Program Overview
1. Data quality must be understood as
an engineering challenge
2. Putting a price on data quality
3. DM BoK components compliment
each other well
4. Savings based stories
5. Innovation based stories
6. Non-monetary stories
7. Takeaways and Q&A
52. • Quality data requires a context specific definition
• Most business problems have data challenges (hidden data
factories) at their root
• All advanced data practices depend on quality data
• AI/ML are suffering from lack of training data
• Few 'easy' fixes exist
• Data quality engineering works well when combined with other DM
BoK 'pie wedges'
• Successful data quality stories demonstrate
– Tangible ongoing savings
– Innovative data uses
– Outcomes more important than money
Take Aways
93Copyright 2019 by Data Blueprint Slide #
+ =
94Copyright 2019 by Data Blueprint Slide #
Questions?
53. 10124 W. Broad Street, Suite C
Glen Allen, Virginia 23060
804.521.4056
Copyright 2019 by Data Blueprint Slide #
95