A lack of trust is inhibiting the adoption of #AI. This presentation discusses approaches to delivering trusted data pipelines for AI and machine learning
5. Data delivers competitive advantage
“Compared with their peers, high
performers report a greater variety
of actions to monetize data – with
greater revenue impact”
- McKinsey Global Survey: Fueling growth through data
monetization
“73.2%
Percentage of executives whose firms
have achieved measurable results from
Big Data and AI investments
- NewVantage Partners Big Data Executive Survey 2018
$1.8 Trillion
Projected annual revenue for
insights-driven businesses by 2021
- “Insights-Driven Businesses Set the Pace for Global
Growth,” Forrester, October 19, 2018
“85%
Firms that leverage customer behavioral
insights outperform peers by 85 percent
in sales growth and 25 percent in gross
margin
- McKinsey Global Survey: Capturing value from your
customer data
6. Common machine learning applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
7. Why do you have a data lake?
Syncsort 2019 data trends survey
Analytics Use Cases
Drive Data Lakes
and Enterprise
Data Hubs
8. Most organisations not getting full value
Syncsort 2019 data trends survey
91% of organizations
have not yet reached
a “transformational”
level of maturity in
data and analytics
- Gartner
68% of IT professionals
state that data silos
negatively impact their
organization’s ability to
get value from their data
• Every part of the
business demands
sophisticated data
analysis
• Departments need
access to the
company’s many data
sets, combined in
different ways
• IT can’t be a bottleneck
• Data has outgrown the
data warehouse
• Data lakes can be
polluted and chaotic
• Data is inconsistent
across data marts
9. Key challenges
Syncsort 2019 data trends survey
only 9% “very effective” in
getting value from data
IT decision makers waste 2 hours
daily looking for relevant data
10. 3 pronged approach
Make data easier to
find and understand
Flexible data pipe lines Debug your data
• Manage bias
• Manage data quality
at scale
• Governance /
Traceability
• Batch and streaming
• Legacy, big data and
cloud
• Data governance
• Data catalog
11. Data Architecture
Metadata/Data Modelling
Data Security
Data
Integration
MDM/ReferenceData
DataQuality
DataGovernance
Business
Intelligemce
DataWarehouse
BigData
AIandML
Business-driven
IT-driven
13. Data Governance and Catalog
AI, Big Data, and Data Governance // Stan Christiaens, Collibra
(FirstMark's Data Driven)
14. Data Governance and Catalog
AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's
Data Driven)
• The differentiator for #AI is DATA
• Bias is like “a snake in the data grass”
• Finding data is a “people and process” problem
• Data (if you treat it as a strategic asset) should
have its own business process
16. Data Scientist
• Expert in statistical analysis, machine
learning techniques, finding answers to
business questions buried in datasets.
• Does NOT want to spend 50 – 90% of their
time tinkering with data, getting it into
good shape to train models – but
frequently does, especially if there’s no
data engineer on their team.
• When machine learning model is trained,
tested, and proven it will accomplish the
goal, turns it over to data engineer to
productionize. Not skilled at taking the
model from a test sandbox into
production, especially not at large scale.
Data Engineer
• Expert in data structures, data
manipulation, and constructing production
data pipelines.
• WANTS to spend all of their time working
with data, but usually has more on their
plate than they can keep up with. Anything
that will speed up their work is helpful.
• In most successful companies, is involved
from the beginning. First gathers, cleans
and standardizes data, helps data scientist
with feature engineering, provides top
notch data, ready to train models.
• After model is tested, builds robust high
scale, data pipelines to feed the models
the data they need in the correct format in
production to provide ongoing business
value.
Data Engineer to the rescue
17. Identify and onboard all relevant data
Data Lake or Cloud
Raw Landing Zone
Access & Onboard – Elect to include data to understand
• What you don’t know CAN hurt you – e.g. bias
• If you’ve left it out, you cannot know it exists
• Data sets have more power to predict when combined
18. Ensure the quality
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Refine – cleanse, enrich, de-duplicate
• What data needs refinement? – use cases will determine
• Each data set should be refined once – don’t repeat work
19. Understand provenanc
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Track Provenance
• Data lineage documentation is necessary for establishing data can be
trusted, and for auditing, regulatory compliance
• Also, useful for reproducing steps in production machine learning
data pipelines
20. Enrich and grow
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Shop for data sets, features & validate against your questions
• Analyst, data scientist shops for data
• What do I need for my purpose?
• Quality is already assured, provenance documented
• Improves trust, saves time
21. 1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS,
web clicks, etc. all in incompatible formats, making it difficult to gather and
prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at
scale. Most data quality tools are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific
entity (person, company, product, etc.) requires sophisticated multi-field
matching algorithms and a lot of compute power. Essentially everything has to
be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in
production, in order for models to accurately make predictions on new data,
and for required audit trails. Capture of complete lineage, from source to end
point is needed.
Challenges of Engineering
Modern Data Pipelines
22. Onboard any data
22
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake
24. Hybrid and Multi-
Cloud
Strategies
• Ensure seamless data flow
to/from cloud, and among clouds
• Maximize choice for workload
optimization and interoperability
• Design once, deploy anywhere –
on premise and in the cloud
• Optimize cloud infrastructure for
cost and efficiency
• Minimize disruption and risk
• Build new skills to handle
different and emerging portfolios
Challenges
• Managing multiple clouds and
vendors
• Integrating data and applications
on-premise to cloud, across clouds
• Avoiding cloud lock-in
• Lack of skills to handle hybrid
multi-cloud world
• Cloud native or cloud first
for new applications
• Scalability and elasticity
• Hybrid: on-premises systems
and public and private
clouds
• Multi-cloud
• Cloud increases focus on
business process from tech
details
25. Seamlessly flow data to, from
and among clouds
Design Once, Deploy Anywhere – Public cloud, Private Cloud, Multi-Cloud, Hybrid or On-Prem
• Build a modern data pipeline with flexibility, agility
and elasticity
• Simplify accessing, integrating, governing your data
in a single software environment
• Get the most from the Cloud – no silos, no lock-in, no
re-work
• Move to/from on-premise to Cloud, or between
Clouds with no re-design, re-compile, no re-work
ever!
• Get excellent performance every time – without
tuning, load balancing, etc.
• Future-proof your applications
26. • Cleanse, enrich, de-duplicate
• What data needs refinement? – use
cases will determine
• Matching across massive datasets that
indicate a single specific entity
(person, company, product, etc.)
How dirty data hampers AI
Dimensional Research
27. Only 35% of senior
executives have a
high level of trust in
the accuracy of
their Big Data
Analytics*
92% of executives are
concerned about
the negative impact
of data and
analytics on
corporate
reputation*
Cost of poor data
quality rose by 50%
in 2017
(Gartner)
84% of CEOs
are concerned
about
the quality of the
data they’re basing
decisions on*
• Decision making – Trust the
data that drives your
business
• Machine learning & AI –
Train your models on
accurate data
• Customer centricity – Get a
single, complete and
accurate view of customer
for better sales, marketing
and service
• Compliance – Know your
data, and ensure its
accuracy to meet industry
and government regulations
The Modern Data Pipeline Needs Data Quality
*http://kpmg.com/guardiansoftrust
28. Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties
does not contain all the necessary
fields
• Inconsistent data formats
(measurements, languages,
postal conventions and dates)
• Names spelled differently
• Different number formatting
29. Common Data Quality Problems at Scale
Common
Challenges
• Big Data projects require:
Massive scalability
Low latency
Many data sources for a
complete view
• Data Quality processing
using a standalone server
can’t keep up
Millions of business
transactions a day are
now common
Standalone quality projects
may take several hours;
unlikely to meet end user
SLAs and/or key success
factors
Solution
Trillium Quality for Big Data
enables you to leverage the
power and scalability of Big
Data frameworks like
Spark, MapReduce
Performs data quality jobs
natively on the cluster
Leverages Intelligent Execution
– design once, deploy
anywhere – cloud, multi-
cloud, hybrid or on prem
No need to move/copy data for
quality processing; Big Data
remains in place
No coding or tuning; jobs are
automatically optimized
Benefits
• Data Pipeline delivers trusted
data for analytics
• Robust data quality processing
at Big Data scale to meet SLAs,
support use cases like Anti-
Money Laundering or
Customer 360
• No coding or tuning saves
time and resources – and
helps address Big Data skills
shortages
• Save time and network
resources by keeping data in
place
30. Cleanse data in Hadoop / Cloud
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time.
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake
31. Get end-to-end data lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time
Data changes
separately made
by MapReduce,
Spark, HiveQL.
33. 33
Analysts Get Complete Picture with Trusted Data Provenance
Data Sources
Auditors
get end-to-
end data
lineage.
Analytics,
visualizations, and
machine learning
algorithms get
clean, complete
data.
Data Lake
Analytics,
Visualization,
Machine
Learning
Data changes
separately made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
Clean,
Complete
Data
RES
T
API
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and
batch sources
outside cluster.
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or
Atlas.
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time
34. Forrester Research
The path to enterprise AI is full of twists
and turns, false starts, and lessons to
learn.
Surely without data quality, AI and
other advanced technologies can not
live up to their expectations.
The Refined Zone may be another cluster, another part of the same cluster, a Cloud, an analytic database, wherever the data sets can be easily stored and found by the people who need them. Select data sets based on use cases. Start with a use case that requires relatively few data sets and/or has relatively high business value. Get immediate ROI for that use case, then move to the next. Once a data set has been refined, it’s there for other use cases that might need the same data. Build on that by refining additional data sets for the next use case. And so on.
That’s a data marketplace, and why you need one.
IT is transforming to handle a combination of on premise, infrastructure-as-a-service, platform-as-a-service, and software-as-a-service. The best architecture will make choices affordable so an architecture with multiple cloud vendors is just as easy and powerful as using a single cloud. Going all-in on one cloud architecture puts IT in the same weak, single source position that many customers of companies such as Oracle find themselves in today. No matter what the current management of those vendors say, future managers will exploit this weakness to increase revenue.
It is crucial that you do as much of the detailed work of handling complex programming, rules, transformations, and other forms of coding in ways that protect you from changes in the underlying infrastructure. The ideal form of expression of coding is in a system that could operate on-premises or in any cloud.
Syncsort Connect for Big Data is specifically designed to simplify the process of accessing, integrating, governing and securing all your enterprise data – batch and streaming – in a single software environment. With Connect for Big Data you can:
Visually design your jobs once, and deploy them anywhere – MapReduce, Spark, Linux, Unix, Windows – on premise or in the cloud. No changes or tuning required.
Easily move applications from standalone server environments and from MapRedue to Spark – as easy as clicking on a drop-down menu
Future-proof job designs for emerging compute frameworks
Avoid tuning -- Intelligent Execution dynamically plans for applications at run-time based on the chosen compute framework
Insulate your users from the underlying complexities of Hadoop and use existing ETL skills
Cut development time in half