Presentation given to the BCS Data Management Specialist Group on 10th April 2018.
Data quality “tags” are a means of informing decision makers about the quality of the data they use within information systems. Unfortunately, these tags have not been successfully adopted because of the expense of maintaining them. This presentation will demonstrate an alternative approach that achieves improved decision making without the costly overheads.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
1. Dr Philip Woodall
Senior Research Associate
Distributed Information and Automation Lab
Department of Engineering, University of Cambridge
Data quality and decision making
3. Contents
• Data repurposing
– Data quality problems found when applying data
analytics in manufacturing
• ITALI – IT Architectures for Logistics
– New IT architecture for the sponsor company
– Aligning the physical process and the data
4. Related projects
• Data repurposing
– Data quality problems found when applying data
analytics in manufacturing
• ITALI – IT Architectures for Logistics
– New IT architecture for the sponsor company
– Aligning the physical process and the data
5. Retailers Face Inconsistent Product Data in E-Commerce Efforts
E.g. shoes on one website described as “sneakers” on another, “trainers”
Poor data consequences
Data scientists spend inordinate
amounts of time correcting data
before it can be used for
analysis/decisions
1
3
2
Some DQ problems arise because of the way we are now
attempting to use data…
6. Reuse = using data for the same or similar task again
Repurposing = using data for a completely different task
Analytics: a different use of data
7. Data repurposing
• When the use of the data changes so do the
data quality requirements
• How do you know when data is good enough
quality to be used for analytics?
• We conducted a survey to investigate the issues.
8. A survey of the DQ problems that arise
when data is repurposed in manufacturing
• What do manufacturers repurpose data for?
• Where do they get the data from?
• Data quality problems faced when repurposing
data?
• Solution: We produced a framework to help
analyse the problems
9. Results: What do manufacturers repurpose
data for?
• To calculate supplier performance, such as On-
Time In Full (OTIF) using purchase order and
good receiving data.
• To perform a parts obsolescence risk
assessment for all the parts on an aircraft using
the bill of materials.
• Identification of performance improvements to
the production line and logistics operations.
11. Results: Where do they get the data from?
Transport mechanism: Data is extracted into a
spreadsheet and emailed to the analysts.
12. Results: example DQ problems arising when
using repurposed data
• Dummy data: ‘actual
delivery date’ a copy of
‘expected delivery date’,
appears that more data is
available.
• No synchronisation:
Updates from local
spreadsheets not sent
back to the original
system.
• Unknown assumptions:
Analyst not aware of how
data was collected or pre-
processed.
• Unhelpful data cleansing:
convert data back again
(e.g. cm to m, 1 box = 50
items).
13. A framework of DQ problems faced when
repurposing manufacturing data
14. Assessing and improving data quality
• Hybrid Approach:
– Discover and measure errors
• TIRM:
– Assess risks (costs) of poor
data quality
– Simulate mitigation actions
– Select most appropriate
actions
LUL
15. A model for information risk analysis
information
16. Related projects
• Data repurposing
– Data quality problems found when applying data
analytics in manufacturing
• ITALI – IT Architectures for Logistics
– New IT architecture for the sponsor company
– Aligning the physical process and the data
17. ITALI - IT Architectures for Logistics Integration
ITALI aims to…
…investigate how existing logistics-related information systems must
evolve to address future logistics needs
Project sponsor
…by exploring
A: Mismatches between physical operations and data
B: How the existing IT systems can be organised into an architecture
that supports the next generation logistics issues (B2B->B2C).
Outputs
1: A new state of the art IT architecture for
logistics and warehousing
2: New concept: Potential Problem Data
Tagging.
For avoiding disruptions caused by data
mismatches
3: A framework for supporting both B2B to B2C
commerce.
18. Key requirements for the architecture
• How to integrate data from differing systems
– To generate analytics reports for the organisation
• IT architecture must align with CEO vision for the
company
– Required sophisticated (and flexible) connections
between information systems
19. Outcomes
• Key barriers facing organisations when attempting to
generate analytics reports:
1. Data must be integrated from different systems
2. Master data management needs to be in place
3. Differences in data models between systems
• can make it very difficult to write queries to extract data when it
is needed for another purpose (e.g. aggregate data in one
field)
4. Trivial data quality problems at data entry can render the
entire process useless
20. Related projects
• Data repurposing
– Data quality problems found when applying data
analytics in manufacturing
• ITALI – IT Architectures for Logistics
– New IT architecture for the sponsor company
– Aligning the physical process and the data
21. Potential Problem Data Tagging
• How to make sure that both process and data
are aligned
21
One example:
Pickers misplace
items in the
warehouse
22. Potential Problem Data Tagging
• How to make sure that both process and data
are aligned
22
One example:
Pickers misplace
items in the
warehouse
23. Approach:
Potential Problem Data Tagging
• Tag the data with a level of accuracy
– Count the number of times the data has been exposed to an event that
could cause it to become inaccurate.
• Only pick from the most accurate locations
23
Location Item
type
Item
quantity
Tag
1 A 30 0.05
2 A 20 0
3 B 15 0.0975
4 B 4 0.14265
5 - 0 0
24. Results of a simulation
• 100 picks
• Extra 3 to 4
disruptions
being avoided
compared to
normal
24
Error rate
1% 1% 1% 5% 5% 5% 20% 20% 20%
Degrees of freedom
2 12 60 2 12 60 2 12 60
Meannumberofdisruptionsencountered
0
1
2
3
4
5
6
7
8
Normal
Avoid
25. Results of a simulation
• Can also be used
to find
inaccuracies
• Even greater
performance:
• Extra 6 to 7
inaccuracies can
be found
compared to
normal
25
Error rate
1% 1% 1% 5% 5% 5% 20% 20% 20%
Degrees of freedom
2 12 60 2 12 60 2 12 60
Meannumberofinaccuraciesfound
0
5
10
15
Normal
Find
26. Dr Philip Woodall
Senior Research Associate
Distributed Information and Automation Lab
Department of Engineering, University of Cambridge
Thank you
27. Related papers
Repurposing
Woodall, P. (2017). The Data Repurposing Challenge: New Pressures from Data
Analytics. Journal of Data and Information Quality (2017).
Assessing and improving data quality
Woodall, P., Borek, A. and Parlikad, A. (2013). Data quality assessment: The Hybrid
Approach. Information & Management, 50 (7), p.pp.369–382.
Borek, A. et al. (2014). A risk based model for quantifying the impact of information
quality. Computers in Industry, 65 (2), p.pp.354–366.
Potential Problem Data Tagging
Woodall, P. et al. (2016). Data State Tracking: labelling good quality data to improve
warehouse operations. In International Conference on Information Quality (ICIQ).
Ciudad Real, Spain.
28. Dr Philip Woodall
Senior Research Associate
Distributed Information and Automation Lab
Department of Engineering, University of Cambridge
Thank you
Notes de l'éditeur
Wall street:
Problems can arise from the way search engines or e-commerce marketplaces pick up product data, Mr. Hogan says. For example, if a particular pair of shoes is described on one website as “sneakers” and on another as “trainers”, the customer faces the possibility of seeing the same product under two different names.