SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Organising The Data Lake
- Information Management In A Big Data World
Mike Ferguson
Managing Director
Intelligent Business Strategies
Hadoop Summit
Dublin, April 2016
2Copyright © Intelligent Business Strategies 1992-2016!
About Mike Ferguson
Mike Ferguson is Managing Director of
Intelligent Business Strategies Limited. As an
analyst and consultant he specialises in
business intelligence, data management and
enterprise business integration. With over 34
years of IT experience, Mike has consulted for
dozens of companies, spoken at events all over
the world and written numerous articles.
Formerly he was a principal and co-founder of
Codd and Date Europe Limited – the inventors
of the Relational Model, a Chief Architect at
Teradata on the Teradata DBMS and European
Managing Director of DataBase Associates.
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
3Copyright © Intelligent Business Strategies 1992-2016!
Topics
 The data integration complexity
 The siloed approach to managing and governing data
 A new inclusive approach to governing and managing data
 Introducing the data reservoir and data refinery
 How does a data reservoir and data refinery work?
 Mapping new data and insights into your shared business vocabulary
 The mission critical importance of an information catalog in a distributed data
landscape
 Integrating data reservoirs and data refineries into your existing environment
4Copyright © Intelligent Business Strategies 1992-2016!
The Changing Landscape – We Now Have Different Platforms Optimised For
Different Analytical Workloads
Streaming
data
Hadoop
data store
Data Warehouse
RDBMS
NoSQL
DBMS
EDW
DW & marts
NoSQL
Graph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Analytical
RDBMS
Big Data workloads result in multiple platforms now being needed for analytical processing
C
R
U
D
Prod
Asset
Cust
MDM
Traditional
query,
reporting &
analysis
Real-time
stream
processing &
decision m’gmt
Data mining,
model
development
Investigative
analysis,
Data refinery
Data mining,
model
development
Graph
analysis
Graph
analysis
5Copyright © Intelligent Business Strategies 1992-2016!
Data Integration Today Has Become Much More Complex
- Popular Data Integration Paths Between Platforms
EDW
DW
Appliance
Analytical DBMS
MDM System
C
R
U
D
Prod
Asset
Cust
XML,
JSON
social
Web
logs
ERP
CRM
SCM
Ops
Graph
DBMS
NoSQL DB
Column Fam DB
Document DB
NoSQL DB
web
Data martsTransaction data
Cloud data may
also be part of it
insights
Txns
6Copyright © Intelligent Business Strategies 1992-2016!
Issues: Siloed Analytics - Different Tools To Manage And Integrate Data For
Each Type Of Analytical And MDM Store
Analytical
tools
Data
management
tools
EDW
mart
Structured data
CRM ERP SCM
Silo
DW & marts
Analytical
tools/apps
Data
management
tools
Multi-structured
data
Silo
DW
Appliance
Advanced Analytics
(structured data)
Data
management
tools
Structured data
CRM ERP SCM
Analytical
tools
Silo
Analytical
tools/apps
Data
management
tools
NoSQL DB
e.g. graph DB
Silo
Multi-structured &
structured data
Silo
C
R
U
D
Prod
Asset
Cust
MDM
Applications
Data
management
tools
Master data
management
CRM ERP SCM
7Copyright © Intelligent Business Strategies 1992-2016!
Issues: Data Deluge - Data Is Arriving Faster Than We Can Consume It
F
D I
A L
T T
A E
R
Enterprise
Enterprise
systems
8Copyright © Intelligent Business Strategies 1992-2016!
With 000’s Of Data Sources, IT And Business Need To Working Together As IT
Will Likely Become A Bottleneck
IT
OLTP
systems
Web
logs
web
DQ/DI
job
DQ/DI
job
DQ/DI
job
Open data
IoT
machine data
social & web
C
R
U
prod cust
asset
D
MDM
DW
Data
warehousing
cloud
Data virtualisation
Can business analysts &
Data Scientists help?
DQ/DI
job
DQ/DI
job
DQ/DI
job
???
Bottleneck?
Should IT be expected
to do everything?
Big Data
9Copyright © Intelligent Business Strategies 1992-2016!
Issues: Have You Got Self-Service Data Integration Causing Chaos In The
Enterprise?
social
Web
logs
web cloud
sandbox
Data Scientists
sandbox
Data Scientists
sandbox
Data Scientists
HDFS
ETL
/ DQ
Self-service
BI tools with ETL
ETL
new
insights
SQL on
Hadoop
DW
ETL
/ DQ
DW
marts
ETL
SCM
CRM
ERP
ETL/D
Q
marts Self-service
BI tools with ETL
ETL/D
Q
Built by IT
ETL/
DQETL/
DQETL/
DQ
10Copyright © Intelligent Business Strategies 1992-2016!
Problems With The Current Approach
 Project oriented siloed approach to DI/DQ with limited collaboration
 Cost of data integration is too high
 Slow speed of development
 Multiple DI/DQ technologies and techniques being used that are not integrated
 Lots of re-invention rather than re-use
 Fractured metadata across multiple tools or no metadata at all in some cases
 Risk of duplicate inconsistent DI/DQ rules for same data
 Metadata lineage is unavailable in many places especially with hand-coded Big Data DI/DQ applications
 Multiple skill sets fractured across different projects
 Repetition of our mistakes, e.g. Big Data preparation
EDW C
R
U
D
Prod
Asset
Cust
MDMDQ/DI
DQ/DI
DQ/DI
DQ/DIDQ/DI
cloud Data
virtualisation
DQ/DIDQ/DI
DQ/DI
Self-service
11Copyright © Intelligent Business Strategies 1992-2016!
There has to be a better, more governed
way to fuel productivity and agility without
causing data inconsistency and chaos
EDW
DQ/DI
C
R
U
D
Prod
Asset
Cust
MDM
DQ/DI
DQ/DIDQ/DI
cloud Data
virtualisation
DQ/DIDQ/DIDQ/DI
DQ/DI
Self-service
Tools are available but are not well integrated
Also the whole collaborative, metadata and information catalog piece is incomplete
IT IS NOT ENOUGH – THE WHOLE THING HAS TO BE CO-ORDINATED
12Copyright © Intelligent Business Strategies 1992-2016!
We Are All In The Same Boat!
– Everyone For Themselves Is Not An Option
IT Data ArchitectData Scientist
IT Developer Business analyst
Information Management
– Introducing The Data Lake
Reservoir
Reservoir
14Copyright © Intelligent Business Strategies 1992-2016!
What Is A Data Reservoir? - A Collaborative, Governed Environment Aimed At
Rapidly Producing Information
IT Data Architect
Data ScientistDomain Expert
community
Bus. analyst
Need to work together for competitive advantage
Data ScientistIT Developer
community
Data
Architect
Data ScientistDomain Expert
community
Domain Expert
Data ScientistDomain Expert
community
Bus. analyst
Bus. analyst Data Architect
community
15Copyright © Intelligent Business Strategies 1992-2016!
Chaos Is NOT An Option – Business Alignment Of Information Being Produced
Is Critical To Success
Big Data Project
Big Data Project
DW Project
MDM ProjectProject
Strategic Objectives
Business
Strategy
• What problem are you
trying to solve?
• What data do you need?
• What kind(s) of analytic
workload are needed
We need co-ordinated
“info producer” projects in
a managed environment
16Copyright © Intelligent Business Strategies 1992-2016!
Key Capabilities In A Managed Data Reservoir - 1
 Data collection
• Automated discovery of the structure and formatting
• Data structure inferred by machine learning
• Automated cataloging, infinite storage and processing
 Data classification
• Determines how data should be governed
• Support is needed for different types of classification schemes, e.g.
Retention
Unclassified
Temporary
Project Lifetime
Managed period
Permanent
Confidential
Unclassified
Internal use
Business confidential
Supplier confidential
Sensitive (PII)
Sensitive (Financial)
Sensitive (Operations)
Restricted (Trade secret)
Confidence
Unclassified
Raw (original)
Obsolete
Archived
Trusted
Business
Value
Unclassified
Unimportant
Marginal
Important
Critical
Catastrophic
17Copyright © Intelligent Business Strategies 1992-2016!
Key Capabilities In A Managed Data Reservoir - 2
 Collaborative data governance
• Data quality
• Data trustworthiness (confidence)
• Data protection
– Data privacy, access authorisation, lifecycle management
• Compliance
 Data refinery
• Systematically clean and refine data through various stages
• Manual and guided data preparation
• “Sandbox” analyse data to produce high value insights
 Data as a Service (DaaS)
• Published high value insights available for consumption
• Search for and discover trusted insights, subscribe to receive it
 Data consumption
• Provision refined, trusted commonly understood data into any tool or application
18Copyright © Intelligent Business Strategies 1992-2016!
Data virtualisation services
A Data Reservoir Is An Organised Collection Of Raw, In-Progress And Trusted
Data (Multiple Data Stores)
DW
MDM
C
R
U
D
Prod
Asset
Cust
Data marts
Cloud object
storage
Refinedtrusted&integrateddata
Stronggovernance
Rawuntrusteddata
somegovernance
ECM Staging
areas
ODS
RDM
C
R
U
D
Code
sets
Archived
DW data
Hive
tables
feedsIoT
XML,
JSON
RDBMS Files office docssocial Cloud
clickstream
web logs web
services
NoSQL
ODS
ODS
DW
Text /
Image/
Video
Filtered
sensor
data
Published
trusted
data
Search
indexes
In-progress
data
Data Reservoir
(not a data store but a collection of stores)
Data sources and ingested reservoir
data are all known to the catalog
Info
Catalog
19Copyright © Intelligent Business Strategies 1992-2016!
Replicate
Streaming
Batch Load
Archive
Raw Data Is Being Collected In Multiple Places Across The Enterprise – We
Need To Know What’s Happening!
We need to avoid unconnected silos
But we HAVE TO know what is being collected and
filtered and where that is happening
Also who is doing it, for what business purpose?
20Copyright © Intelligent Business Strategies 1992-2016!
If Multiple Collection Points Exist Then Something Has To Catalog What Data Is
Available, Its Status And Where It Is
All data entering a
reservoir needs to
be catalogued and
organised
You need to know what data is available across the enterprise, where it
came from, what state is it in, should we trust it, can we order it
Information Catalogue
21Copyright © Intelligent Business Strategies 1992-2016!
A Distributed Data Reservoir Requires Information Management Software To
Work Across Multiple Data Stores
Enterprise Information Management (Catalog, DQ, ETL, Security, Privacy…)
The Data Reservoir is distributed but is should be managed
and function as if it were centralised
Key requirements
Define once, execute anywhere
Centralised metadata
Distributed execution of policies associated with data quality, ETL, security, lifeecycle
management across the landscape (multiple execution engines)
22Copyright © Intelligent Business Strategies 1992-2016!
Replicate
Streaming
Batch Load
Archive
A Distributed Data Reservoir Requires Management And Governance As If It
Was Centralised
The data in the reservoir is distributed but the reservoir
is managed and operated as if it were centralised
23Copyright © Intelligent Business Strategies 1992-2016!
Information Production Is A Process That Involves Refining And Integrating Data
High value
information
and /or insights
available for
consumption
Raw
data
Raw
data
Trusted
data
Collaboration is needed
to perform many tasks in
producing information,
e.g. selecting &
transforming data
Reservoir storage
Raw
data
Raw
data
In-
progress
data
Trusted
data
24Copyright © Intelligent Business Strategies 1992-2016!
The Information Production Process Works Across Zones In The Reservoir –
Zones Created By Tagging Files
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
25Copyright © Intelligent Business Strategies 1992-2016!
Organising Data In A Reservoir – The Catalog Knows About Data Sources Plus
Data In All Zones And Sandboxes
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
26Copyright © Intelligent Business Strategies 1992-2016!
Operating A Data Reservoir – The Information Production Process Is A
Production Line That Spans Reservoir Zones
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Nominate
new data
Classify
sensitivity,
quality,
retention
Tag data
(what’s it
mean?)
Assign
governance
policies based on
classification
Collaborate
about
processing
Track data
freshness
Rate its value
★★★★
Exploratory analysis
Analyse
consume
Reservoir operations are
controlled via the catalog
and workflow processes
Info
Catalog
Map to shared
business
vocabulary
27Copyright © Intelligent Business Strategies 1992-2016!
Operating A Data Reservoir – Workflows Are Everywhere And Are Components
Of An Information Production Process
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
Ingest
w/flow
movement
w/flow
movement
w/flow
Publish
w/flow
Publish
w/flow
Provision
w/flow
Refinery
w/flow
Analytical
w/flow
Gov
w/flow
Gov
w/flow
Stream
w/flow
28Copyright © Intelligent Business Strategies 1992-2016!
Trends – Data And Analytical Workflow (Pipeline) Products Requiring No
Programming Are Emerging Everywhere
Talend Alteryx
Microsoft Azure Data Factory
Hortonworks
Dataflow (Nifi)
Dell Statistica
Who is using what
tools?
Any reinvention?
29Copyright © Intelligent Business Strategies 1992-2016!
Operating A Data Reservoir – All Workflows Should Be Approved And
Registered In The Information Catalog
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
Ingest
w/flow
Publish
w/flow
Publish
w/flow
movement
w/flow
movement
w/flow
Provision
w/flow
Refinery
w/flow
Analytical
w/flow
Gov
w/flow
Gov
w/flow
Stream
w/flow
Convert SSDI workflows to data
virtualisation views to minimise re-
invention and enforce governance
virtualviewvirtualview
virtualview
30Copyright © Intelligent Business Strategies 1992-2016!
Data Strategy Requirements – We Need To Enable Information Producers And
Information Consumers
 Need to make use of
• A business glossary and information catalog
• Re-usable services to manage and process data
• Collaboration and social computing to manage, process and rate data
• Role-based data management tools aimed at IT AND business
clean &
integrate
service
raw data
trusted data
Information
catalog
BI tool or
application
search
find
shop
order consume
data scientist
IT professional
information producers
clean &
integrate
service
raw data
business analysts
information consumers
like a
“corporate
iTunes” for
data
31Copyright © Intelligent Business Strategies 1992-2016!
A ‘Production Line’ Publish And Subscribe Approach Is Used To Accelerate
Information And Insight Production
data
source
Data
Integration
publish
Info
catalog
trusted data
as a service
publish Info
catalog
trusted, integrated
data ad a service
subscribe
Analyse
(e.g. score)
consume
publishAnalytics
catalog
New predictive
analytic pipelines
(as a service)
consume
subscribe
Visualise
Decide Act
Other, e.g. embed
analytic applications
consume
subscribe
publish
Solutions
catalog
New prescriptive
analytic pipelines
publish New analytic
applications
use
crawl
discover
profile
publish
Info
catalog
discovered
data
Acquire
Acquire
Acquire
Data Preparation
(clean, transform, filter)
32Copyright © Intelligent Business Strategies 1992-2016!
Cataloging, Automated Discovery And Collaboration Are All Needed When Data
Is Ingested
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
Analyse
consume
Automated relationship
discovery, data profiling,
and document clustering
Descriptive metadata is
critical to keeping things
organised
Info
Catalog
Catalog, tag and
describe data/files
(what’s it about?)
collaborative
appraisal
33Copyright © Intelligent Business Strategies 1992-2016!
Governance In A Data Reservoir Is Controlled By Classification And Metadata In
The Information CatalogClassifications drive the governance
Governance
Rule
Governance
Rule
Governance
Rule
Classification
Classification
Information
Rule
Information
Governance
Rule
Classified by
Actioned
by
Physical Data
Description
Policy
Governs
Implemented
by
Policy
ProcessAssessed by
Business
Attribute
Classified by
Mapped to
Governs Sensitive
IT Landscape
Deployed toGovernance
Action
Describesby
Engine
Accesses
Metrics
Measures
ProcessAssessed by
Feeds
Operational
Log
Logs activity
Describes
Data storeData store/
Document/
File/API
Measures
Measures
9Source: IBM
34Copyright © Intelligent Business Strategies 1992-2016!
IBM Are Creating ‘Governance Aware’ Runtimes To Verify And Enforce Policies
In A Data Reservoir
Source: IBM
They access the information
catalog to determine what to
do at run time
35Copyright © Intelligent Business Strategies 1992-2016!
We Need A Data Refinery To Process, Clean And Analyse Data To Produce
Consumable High Value Insight
cloud On-premises
DW Analytical
RDBMS
ETL
Server
Data Virtualisation
Server
A data refinery should be able to choose where to best refine data to produce the information needed
36Copyright © Intelligent Business Strategies 1992-2016!
Data virtualisation services
A Key Requirement In A Distributed Data Reservoir Is Centralised Development,
Distributed Execution
MDM
C
R
U
D
Prod
Asset
Cust
Data marts
Cloud object
storage
Refinedtrusted&integrateddata
Stronggovernance
Rawuntrusteddata
somegovernance
ECM Staging
areas
RDM
C
R
U
D
Code
sets
Archived
DW data
Hive
tables
feedsIoT
XML,
JSON
RDBMS Files office docssocial Cloud
clickstream
web logs web
services
NoSQL
Text /
Image/
Video
Filtered
sensor
data
Published
trusted
data
Search
indexes
In-progress
data
Data Reservoir
(not a data store but a collection of stores) Info
Catalog
ODS
DW
staging
area
EIM Tool Suite (Profiling, cleansing, ELT)
ODS
ODS
Execution
engine
Execution
engine
Execution
engine
Execution
engine
Execution
engine
Execution
engine
IT User
Interface
Self-
service UI
Execution
engine
Execution
engineExecution
engineExecution
engine
Execution
engineExecution
engineExecution
engine
37Copyright © Intelligent Business Strategies 1992-2016!
On-premises
storage
DW
staging
area
Cloud
storage
Execution
engineExecution
engine
Execution
engine
Execution
engine
Execution
engine
If A Data Reservoir Is Distributed With Data Too Big To Move Then Processing
Needs To Go The Data
Not centralised,
Not distributed
But Federated
Task
Task
Task
Task
Task
38Copyright © Intelligent Business Strategies 1992-2016!
Options For Refining Data
 IT developed ETL processing using EIM tool suites
 Self-service data integration
 Multi-role EIM tool suites
• Can be used by both IT AND business users
 Data virtualisation server
 A combination of the above
39Copyright © Intelligent Business Strategies 1992-2016!
Scaling ETL Transformations For In-Hadoop ELT Processing
Data Cleansing and Integration Tool
Extract Parse Clean Transform AnalyseLoad Insights
Option 1
ETL tool generates HQL or
convert generated SQL to HQL
Option 2
ETL tool generates Pig
(compiler converts every
transform to a map reduce
job) or JAQL
Option 3
ETL tool generates 3GL MR
or Spark code
Option 4 – Other
Native massively parallel transformation and
integration bypassing any Hadoop execution
engine
E.g. Talend, IBM BigIntegrate, Informatica
40Copyright © Intelligent Business Strategies 1992-2016!
Self-Service Data Integration Tool Vendors
 Actian Dataflow
 Alteryx
 Clear Story Data
 Datameer
 IBM DataWorks
 Informatica Rev
 Paxata
 SAS Data Loader
for Hadoop
 Tamr
 Trifacta
Acquire
Data Preparation
(clean, transform, filter)
Analyse
(e.g. Score)
Visualise
Decide Act
Data
Integrationdata
Embed
Acquire
Data Preparation
(clean, transform, filter)
Analyse
(e.g. Score)
Visualise
Decide Act
Data
Integrationdata
Embed
Data preparation, integration, analysis & visualisation
Data preparation and integration
41Copyright © Intelligent Business Strategies 1992-2016!
Some Data Management Vendors Are Trying To Cover All Roles And Integrate
With Other Vendors, e.g. Informatica
Informatica
Catalog & Live
Data Map
Analyst toolData &
Metadata
Relationship
Discovery
Services
Data Quality
Profiling &
Monitoring
Services
Data
Modeling
Services
Data
Cleansing &
Matching
Services
Data
Integration
Services
Business
Glossary
/ Info Catalog
Services
Data Governance/Management Console
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
Informatica Rev
Self-service
Cloud DI
metadata
metadata
42Copyright © Intelligent Business Strategies 1992-2016!
Data &
Metadata
Relationship
Discovery
Services
Data Quality
Profiling &
Monitoring
Services
Data
Modeling
Data
Cleansing &
Matching
Services
Data
Integration
Services
(virt & ETL)
Business
Glossary
/ Info
Catalog
Services
Data Governance/Management Console
metadata
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
ESB
Information
services
C
R
U
prod cust
asset
D
MDM
DW
Data
warehousing
Big Data
Data virtualisation
cloud
Business UserIT DeveloperIT Data Architect
App Self-
Service
Enterprise Service Bus
Some Vendors Are Opening Up Their Service Oriented Data Management
Platforms To IT AND Business Users
Role-based
Uis to the same
data management
platform
Workflow
43Copyright © Intelligent Business Strategies 1992-2016!
Alternatively Interoperability Is Needed Across Tools To Use Data Preparation
Jobs Developed By Different Users
Stand-alone
Data Wrangling
tools
Data &
Metadata
Relationshi
p
Discovery
Services
Data
Quality
Profiling &
Monitoring
Services
Data
Modeling
Services
Data
Cleansing
& Matching
Services
Data
Integration
Services
Business
Glossary
/ Info
Catalog
Services
Data Governance/Management Console
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
Interoperability
metadata metadata
metadatametadata
44Copyright © Intelligent Business Strategies 1992-2016!
Metadata Management In A Data Reservoir
- EIM Platform Information Catalog And Apache Atlas
Stand-alone
Data Wrangling
tools
Services
Data Governance/Management Console
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
metadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information
Catalog
45Copyright © Intelligent Business Strategies 1992-2016!
Metadata Management In A Data Reservoir
- Stand-Alone Information Catalog And Apache Atlas
Stand-alone
Data Wrangling
tools
Services
Data Governance/Management Console
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
metadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information
Catalog
metadata atlas
46Copyright © Intelligent Business Strategies 1992-2016!
New Trusted Data Produced By Refining Un-Modelled Data Should Be Defined
In A Business Glossary
Raw data In-Progress data Refined data
Untrusted Trusted
corporate
firewall
Fit for use
Data Refinery
sandbox
Business
Glossary
DataVirtualisation
Could implement the
SBV in a data
virtualisation server
47Copyright © Intelligent Business Strategies 1992-2016!
The Critical Importance Of An Information Catalog
– We MUST Be Able To Answer This Question
Business user
What information exists
about……….?
An Information Catalogue
Where is that likely to be documented?
48Copyright © Intelligent Business Strategies 1992-2016!
The Information Catalog
- What Else Do I Want To Know?
Can I search for information? (faceted search via your SBV)
Does the data exist?
Is the data trusted? (what is the rating)
Is the data sensitive? (what is the rating)
Is it high business value (what is the rating)
Can I order it?
Can I specify where to deliver it to and in what format?
Can I see where is it used and who owns it?
Information Catalogue
49Copyright © Intelligent Business Strategies 1992-2016!
Information Catalog Example - Waterline Data
50Copyright © Intelligent Business Strategies 1992-2016!
Faceted Navigation Used In E-Commerce (e.g. Amazon) Is About To Get A
Much Bigger Role In Data Management
Add it to
your cart
Select the
products you
want
51Copyright © Intelligent Business Strategies 1992-2016!
Ordered Parcel Delivery – The Same Thing Will Happen To Provision Ordered
Data
Ordered data
52Copyright © Intelligent Business Strategies 1992-2016!
Virtual Information Provisioning Needs Policy Awareness At Runtime To Create
Virtual Views That Enforce Governance
Information
provisioning
service
Virtual data subset
Virtual full data set
security
policy
(some data not
permitted to be seen)
(all data permitted
to be seen)
“Finished-Goods”
Refined data
Information
provisioning
service
Virtual data subset
Virtual full data set
compliance
policy
(some data not
allowed to be
provisioned outside
the country)
(all data
provisioned inside
the country)
Data reservoir
All data
has SBV DataVirtualisation
53Copyright © Intelligent Business Strategies 1992-2016!
Conclusions
 The challenge is now to manage data in the entire analytical ecosystem
 Invest in new skills and training needed in this environment
 Data needs to be organised in a data reservoir to prevent chaos
 Hadoop is becoming a platform to accelerate cleansing and ETL processing to conduct
exploratory analytics
 Multiple options exist to allow IT and business users to clean and integrate data in preparation
for analysis
• Data integration vendors have added functionality to support Hadoop
• Self-service data cleansing and integration tools also exist
 The ideal solution is a single platform that supports IT and business user self-service data
integration
 An information catalog is critical for end-to-end data governance
• Understanding what data is available (descriptive metadata)
• Understand how it was transformed (metadata lineage)
 Data virtualisation is needed to see across multiple data reservoirs
 Start small and build out incrementally – don’t just load data and hope
54Copyright © Intelligent Business Strategies 1992-2016!
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You!

Contenu connexe

Tendances

Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeDataWorks Summit
 
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDemystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDenodo
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...DataWorks Summit
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Denodo
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Denodo
 
Analyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision MakingAnalyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision MakingDenodo
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementDataWorks Summit
 
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern AnalyticsThe Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern AnalyticsDenodo
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and moreDenodo
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationDatabricks
 
IBM Industry Models and Data Lake
IBM Industry Models and Data Lake IBM Industry Models and Data Lake
IBM Industry Models and Data Lake Pat O'Sullivan
 
Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Denodo
 
Why Data Virtualization Matters in Your Portfolio
Why Data Virtualization Matters in Your PortfolioWhy Data Virtualization Matters in Your Portfolio
Why Data Virtualization Matters in Your PortfolioDenodo
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarYahoo Developer Network
 
Fast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow PresentationFast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow PresentationDenodo
 
Data Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroData Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroDenodo
 
Data architecture for modern enterprise
Data architecture for modern enterpriseData architecture for modern enterprise
Data architecture for modern enterprisekayalvizhi kandasamy
 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricPrecisely
 

Tendances (20)

Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data Lake
 
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDemystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
 
Analyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision MakingAnalyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision Making
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern AnalyticsThe Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and more
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
IBM Industry Models and Data Lake
IBM Industry Models and Data Lake IBM Industry Models and Data Lake
IBM Industry Models and Data Lake
 
Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.
 
Why Data Virtualization Matters in Your Portfolio
Why Data Virtualization Matters in Your PortfolioWhy Data Virtualization Matters in Your Portfolio
Why Data Virtualization Matters in Your Portfolio
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
 
Fast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow PresentationFast Data Strategy Houston Roadshow Presentation
Fast Data Strategy Houston Roadshow Presentation
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Data Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroData Virtualization: From Zero to Hero
Data Virtualization: From Zero to Hero
 
Data architecture for modern enterprise
Data architecture for modern enterpriseData architecture for modern enterprise
Data architecture for modern enterprise
 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data Fabric
 

En vedette

Problem solving & decision making
Problem solving & decision makingProblem solving & decision making
Problem solving & decision makingSoft Skills World
 
CJN 405 Lecture 3-22-11
CJN 405 Lecture 3-22-11 CJN 405 Lecture 3-22-11
CJN 405 Lecture 3-22-11 nmusatova
 
Integrate Test Activities in Agile
Integrate Test Activities in AgileIntegrate Test Activities in Agile
Integrate Test Activities in AgileTEST Huddle
 
Complexity based leadership: Navigating complex challenges
Complexity based leadership: Navigating complex challengesComplexity based leadership: Navigating complex challenges
Complexity based leadership: Navigating complex challengesChris Jansen
 
Weaving collaboration: Exploring new possibilities in post-quake Canterbury
Weaving collaboration: Exploring new possibilities in post-quake CanterburyWeaving collaboration: Exploring new possibilities in post-quake Canterbury
Weaving collaboration: Exploring new possibilities in post-quake CanterburyChris Jansen
 
5th discipline final
5th discipline final5th discipline final
5th discipline finalRaj Louis
 
Development of the self in society grade 11
Development of the self in society grade 11Development of the self in society grade 11
Development of the self in society grade 11nomusa sadiki
 
Personality Insights for Optimal Performance
Personality Insights for Optimal Performance Personality Insights for Optimal Performance
Personality Insights for Optimal Performance Entrepreneurs Anonymous
 
Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...
Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...
Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...Chris Jansen
 
Leadershipand Sustainability Next Iteration
Leadershipand Sustainability Next IterationLeadershipand Sustainability Next Iteration
Leadershipand Sustainability Next Iterationawelch1
 
Building a thriving leadership incubator
Building a thriving leadership incubatorBuilding a thriving leadership incubator
Building a thriving leadership incubatorChris Jansen
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 

En vedette (13)

Problem solving & decision making
Problem solving & decision makingProblem solving & decision making
Problem solving & decision making
 
CJN 405 Lecture 3-22-11
CJN 405 Lecture 3-22-11 CJN 405 Lecture 3-22-11
CJN 405 Lecture 3-22-11
 
Integrate Test Activities in Agile
Integrate Test Activities in AgileIntegrate Test Activities in Agile
Integrate Test Activities in Agile
 
Complexity based leadership: Navigating complex challenges
Complexity based leadership: Navigating complex challengesComplexity based leadership: Navigating complex challenges
Complexity based leadership: Navigating complex challenges
 
Weaving collaboration: Exploring new possibilities in post-quake Canterbury
Weaving collaboration: Exploring new possibilities in post-quake CanterburyWeaving collaboration: Exploring new possibilities in post-quake Canterbury
Weaving collaboration: Exploring new possibilities in post-quake Canterbury
 
5th discipline final
5th discipline final5th discipline final
5th discipline final
 
Development of the self in society grade 11
Development of the self in society grade 11Development of the self in society grade 11
Development of the self in society grade 11
 
Personality Insights for Optimal Performance
Personality Insights for Optimal Performance Personality Insights for Optimal Performance
Personality Insights for Optimal Performance
 
Leading through change
Leading through changeLeading through change
Leading through change
 
Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...
Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...
Chris Jansen (www.Ideacreation.org) - "Leading Change: Innovation for the fut...
 
Leadershipand Sustainability Next Iteration
Leadershipand Sustainability Next IterationLeadershipand Sustainability Next Iteration
Leadershipand Sustainability Next Iteration
 
Building a thriving leadership incubator
Building a thriving leadership incubatorBuilding a thriving leadership incubator
Building a thriving leadership incubator
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similaire à Organising the Data Lake - Information Management in a Big Data World

Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...
Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...
Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...Matt Stubbs
 
RWDG Slides: Building Data Governance Through Data Stewardship
RWDG Slides: Building Data Governance Through Data StewardshipRWDG Slides: Building Data Governance Through Data Stewardship
RWDG Slides: Building Data Governance Through Data StewardshipDATAVERSITY
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseDatabricks
 
Building Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalBuilding Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalDenodo
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
Die Big Data Fabric als Enabler für Machine Learning & AI
Die Big Data Fabric als Enabler für Machine Learning & AIDie Big Data Fabric als Enabler für Machine Learning & AI
Die Big Data Fabric als Enabler für Machine Learning & AIDenodo
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesDenodo
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...Denodo
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best PracticesCapgemini
 
Data Virtualization – Gateway to a Digital Business - Barry Devlin
Data Virtualization – Gateway to a Digital Business - Barry DevlinData Virtualization – Gateway to a Digital Business - Barry Devlin
Data Virtualization – Gateway to a Digital Business - Barry DevlinDenodo
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US InformationJulian Tong
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
 
Impulser la digitalisation et modernisation de la fonction Finance grâce à la...
Impulser la digitalisation et modernisation de la fonction Finance grâce à la...Impulser la digitalisation et modernisation de la fonction Finance grâce à la...
Impulser la digitalisation et modernisation de la fonction Finance grâce à la...Denodo
 
Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Jeffrey T. Pollock
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataPentaho
 

Similaire à Organising the Data Lake - Information Management in a Big Data World (20)

Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...
Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...
Big Data LDN 2017: The Logical Data Warehouse – A Modern Analytical Architect...
 
RWDG Slides: Building Data Governance Through Data Stewardship
RWDG Slides: Building Data Governance Through Data StewardshipRWDG Slides: Building Data Governance Through Data Stewardship
RWDG Slides: Building Data Governance Through Data Stewardship
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Building Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalBuilding Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New Normal
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Die Big Data Fabric als Enabler für Machine Learning & AI
Die Big Data Fabric als Enabler für Machine Learning & AIDie Big Data Fabric als Enabler für Machine Learning & AI
Die Big Data Fabric als Enabler für Machine Learning & AI
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
 
Data Virtualization – Gateway to a Digital Business - Barry Devlin
Data Virtualization – Gateway to a Digital Business - Barry DevlinData Virtualization – Gateway to a Digital Business - Barry Devlin
Data Virtualization – Gateway to a Digital Business - Barry Devlin
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Impulser la digitalisation et modernisation de la fonction Finance grâce à la...
Impulser la digitalisation et modernisation de la fonction Finance grâce à la...Impulser la digitalisation et modernisation de la fonction Finance grâce à la...
Impulser la digitalisation et modernisation de la fonction Finance grâce à la...
 
Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Dernier (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Organising the Data Lake - Information Management in a Big Data World

  • 1. Organising The Data Lake - Information Management In A Big Data World Mike Ferguson Managing Director Intelligent Business Strategies Hadoop Summit Dublin, April 2016
  • 2. 2Copyright © Intelligent Business Strategies 1992-2016! About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence, data management and enterprise business integration. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700
  • 3. 3Copyright © Intelligent Business Strategies 1992-2016! Topics  The data integration complexity  The siloed approach to managing and governing data  A new inclusive approach to governing and managing data  Introducing the data reservoir and data refinery  How does a data reservoir and data refinery work?  Mapping new data and insights into your shared business vocabulary  The mission critical importance of an information catalog in a distributed data landscape  Integrating data reservoirs and data refineries into your existing environment
  • 4. 4Copyright © Intelligent Business Strategies 1992-2016! The Changing Landscape – We Now Have Different Platforms Optimised For Different Analytical Workloads Streaming data Hadoop data store Data Warehouse RDBMS NoSQL DBMS EDW DW & marts NoSQL Graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Analytical RDBMS Big Data workloads result in multiple platforms now being needed for analytical processing C R U D Prod Asset Cust MDM Traditional query, reporting & analysis Real-time stream processing & decision m’gmt Data mining, model development Investigative analysis, Data refinery Data mining, model development Graph analysis Graph analysis
  • 5. 5Copyright © Intelligent Business Strategies 1992-2016! Data Integration Today Has Become Much More Complex - Popular Data Integration Paths Between Platforms EDW DW Appliance Analytical DBMS MDM System C R U D Prod Asset Cust XML, JSON social Web logs ERP CRM SCM Ops Graph DBMS NoSQL DB Column Fam DB Document DB NoSQL DB web Data martsTransaction data Cloud data may also be part of it insights Txns
  • 6. 6Copyright © Intelligent Business Strategies 1992-2016! Issues: Siloed Analytics - Different Tools To Manage And Integrate Data For Each Type Of Analytical And MDM Store Analytical tools Data management tools EDW mart Structured data CRM ERP SCM Silo DW & marts Analytical tools/apps Data management tools Multi-structured data Silo DW Appliance Advanced Analytics (structured data) Data management tools Structured data CRM ERP SCM Analytical tools Silo Analytical tools/apps Data management tools NoSQL DB e.g. graph DB Silo Multi-structured & structured data Silo C R U D Prod Asset Cust MDM Applications Data management tools Master data management CRM ERP SCM
  • 7. 7Copyright © Intelligent Business Strategies 1992-2016! Issues: Data Deluge - Data Is Arriving Faster Than We Can Consume It F D I A L T T A E R Enterprise Enterprise systems
  • 8. 8Copyright © Intelligent Business Strategies 1992-2016! With 000’s Of Data Sources, IT And Business Need To Working Together As IT Will Likely Become A Bottleneck IT OLTP systems Web logs web DQ/DI job DQ/DI job DQ/DI job Open data IoT machine data social & web C R U prod cust asset D MDM DW Data warehousing cloud Data virtualisation Can business analysts & Data Scientists help? DQ/DI job DQ/DI job DQ/DI job ??? Bottleneck? Should IT be expected to do everything? Big Data
  • 9. 9Copyright © Intelligent Business Strategies 1992-2016! Issues: Have You Got Self-Service Data Integration Causing Chaos In The Enterprise? social Web logs web cloud sandbox Data Scientists sandbox Data Scientists sandbox Data Scientists HDFS ETL / DQ Self-service BI tools with ETL ETL new insights SQL on Hadoop DW ETL / DQ DW marts ETL SCM CRM ERP ETL/D Q marts Self-service BI tools with ETL ETL/D Q Built by IT ETL/ DQETL/ DQETL/ DQ
  • 10. 10Copyright © Intelligent Business Strategies 1992-2016! Problems With The Current Approach  Project oriented siloed approach to DI/DQ with limited collaboration  Cost of data integration is too high  Slow speed of development  Multiple DI/DQ technologies and techniques being used that are not integrated  Lots of re-invention rather than re-use  Fractured metadata across multiple tools or no metadata at all in some cases  Risk of duplicate inconsistent DI/DQ rules for same data  Metadata lineage is unavailable in many places especially with hand-coded Big Data DI/DQ applications  Multiple skill sets fractured across different projects  Repetition of our mistakes, e.g. Big Data preparation EDW C R U D Prod Asset Cust MDMDQ/DI DQ/DI DQ/DI DQ/DIDQ/DI cloud Data virtualisation DQ/DIDQ/DI DQ/DI Self-service
  • 11. 11Copyright © Intelligent Business Strategies 1992-2016! There has to be a better, more governed way to fuel productivity and agility without causing data inconsistency and chaos EDW DQ/DI C R U D Prod Asset Cust MDM DQ/DI DQ/DIDQ/DI cloud Data virtualisation DQ/DIDQ/DIDQ/DI DQ/DI Self-service Tools are available but are not well integrated Also the whole collaborative, metadata and information catalog piece is incomplete IT IS NOT ENOUGH – THE WHOLE THING HAS TO BE CO-ORDINATED
  • 12. 12Copyright © Intelligent Business Strategies 1992-2016! We Are All In The Same Boat! – Everyone For Themselves Is Not An Option IT Data ArchitectData Scientist IT Developer Business analyst
  • 13. Information Management – Introducing The Data Lake Reservoir Reservoir
  • 14. 14Copyright © Intelligent Business Strategies 1992-2016! What Is A Data Reservoir? - A Collaborative, Governed Environment Aimed At Rapidly Producing Information IT Data Architect Data ScientistDomain Expert community Bus. analyst Need to work together for competitive advantage Data ScientistIT Developer community Data Architect Data ScientistDomain Expert community Domain Expert Data ScientistDomain Expert community Bus. analyst Bus. analyst Data Architect community
  • 15. 15Copyright © Intelligent Business Strategies 1992-2016! Chaos Is NOT An Option – Business Alignment Of Information Being Produced Is Critical To Success Big Data Project Big Data Project DW Project MDM ProjectProject Strategic Objectives Business Strategy • What problem are you trying to solve? • What data do you need? • What kind(s) of analytic workload are needed We need co-ordinated “info producer” projects in a managed environment
  • 16. 16Copyright © Intelligent Business Strategies 1992-2016! Key Capabilities In A Managed Data Reservoir - 1  Data collection • Automated discovery of the structure and formatting • Data structure inferred by machine learning • Automated cataloging, infinite storage and processing  Data classification • Determines how data should be governed • Support is needed for different types of classification schemes, e.g. Retention Unclassified Temporary Project Lifetime Managed period Permanent Confidential Unclassified Internal use Business confidential Supplier confidential Sensitive (PII) Sensitive (Financial) Sensitive (Operations) Restricted (Trade secret) Confidence Unclassified Raw (original) Obsolete Archived Trusted Business Value Unclassified Unimportant Marginal Important Critical Catastrophic
  • 17. 17Copyright © Intelligent Business Strategies 1992-2016! Key Capabilities In A Managed Data Reservoir - 2  Collaborative data governance • Data quality • Data trustworthiness (confidence) • Data protection – Data privacy, access authorisation, lifecycle management • Compliance  Data refinery • Systematically clean and refine data through various stages • Manual and guided data preparation • “Sandbox” analyse data to produce high value insights  Data as a Service (DaaS) • Published high value insights available for consumption • Search for and discover trusted insights, subscribe to receive it  Data consumption • Provision refined, trusted commonly understood data into any tool or application
  • 18. 18Copyright © Intelligent Business Strategies 1992-2016! Data virtualisation services A Data Reservoir Is An Organised Collection Of Raw, In-Progress And Trusted Data (Multiple Data Stores) DW MDM C R U D Prod Asset Cust Data marts Cloud object storage Refinedtrusted&integrateddata Stronggovernance Rawuntrusteddata somegovernance ECM Staging areas ODS RDM C R U D Code sets Archived DW data Hive tables feedsIoT XML, JSON RDBMS Files office docssocial Cloud clickstream web logs web services NoSQL ODS ODS DW Text / Image/ Video Filtered sensor data Published trusted data Search indexes In-progress data Data Reservoir (not a data store but a collection of stores) Data sources and ingested reservoir data are all known to the catalog Info Catalog
  • 19. 19Copyright © Intelligent Business Strategies 1992-2016! Replicate Streaming Batch Load Archive Raw Data Is Being Collected In Multiple Places Across The Enterprise – We Need To Know What’s Happening! We need to avoid unconnected silos But we HAVE TO know what is being collected and filtered and where that is happening Also who is doing it, for what business purpose?
  • 20. 20Copyright © Intelligent Business Strategies 1992-2016! If Multiple Collection Points Exist Then Something Has To Catalog What Data Is Available, Its Status And Where It Is All data entering a reservoir needs to be catalogued and organised You need to know what data is available across the enterprise, where it came from, what state is it in, should we trust it, can we order it Information Catalogue
  • 21. 21Copyright © Intelligent Business Strategies 1992-2016! A Distributed Data Reservoir Requires Information Management Software To Work Across Multiple Data Stores Enterprise Information Management (Catalog, DQ, ETL, Security, Privacy…) The Data Reservoir is distributed but is should be managed and function as if it were centralised Key requirements Define once, execute anywhere Centralised metadata Distributed execution of policies associated with data quality, ETL, security, lifeecycle management across the landscape (multiple execution engines)
  • 22. 22Copyright © Intelligent Business Strategies 1992-2016! Replicate Streaming Batch Load Archive A Distributed Data Reservoir Requires Management And Governance As If It Was Centralised The data in the reservoir is distributed but the reservoir is managed and operated as if it were centralised
  • 23. 23Copyright © Intelligent Business Strategies 1992-2016! Information Production Is A Process That Involves Refining And Integrating Data High value information and /or insights available for consumption Raw data Raw data Trusted data Collaboration is needed to perform many tasks in producing information, e.g. selecting & transforming data Reservoir storage Raw data Raw data In- progress data Trusted data
  • 24. 24Copyright © Intelligent Business Strategies 1992-2016! The Information Production Process Works Across Zones In The Reservoir – Zones Created By Tagging Files sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis
  • 25. 25Copyright © Intelligent Business Strategies 1992-2016! Organising Data In A Reservoir – The Catalog Knows About Data Sources Plus Data In All Zones And Sandboxes sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis
  • 26. 26Copyright © Intelligent Business Strategies 1992-2016! Operating A Data Reservoir – The Information Production Process Is A Production Line That Spans Reservoir Zones Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Nominate new data Classify sensitivity, quality, retention Tag data (what’s it mean?) Assign governance policies based on classification Collaborate about processing Track data freshness Rate its value ★★★★ Exploratory analysis Analyse consume Reservoir operations are controlled via the catalog and workflow processes Info Catalog Map to shared business vocabulary
  • 27. 27Copyright © Intelligent Business Strategies 1992-2016! Operating A Data Reservoir – Workflows Are Everywhere And Are Components Of An Information Production Process sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis Ingest w/flow movement w/flow movement w/flow Publish w/flow Publish w/flow Provision w/flow Refinery w/flow Analytical w/flow Gov w/flow Gov w/flow Stream w/flow
  • 28. 28Copyright © Intelligent Business Strategies 1992-2016! Trends – Data And Analytical Workflow (Pipeline) Products Requiring No Programming Are Emerging Everywhere Talend Alteryx Microsoft Azure Data Factory Hortonworks Dataflow (Nifi) Dell Statistica Who is using what tools? Any reinvention?
  • 29. 29Copyright © Intelligent Business Strategies 1992-2016! Operating A Data Reservoir – All Workflows Should Be Approved And Registered In The Information Catalog sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis Ingest w/flow Publish w/flow Publish w/flow movement w/flow movement w/flow Provision w/flow Refinery w/flow Analytical w/flow Gov w/flow Gov w/flow Stream w/flow Convert SSDI workflows to data virtualisation views to minimise re- invention and enforce governance virtualviewvirtualview virtualview
  • 30. 30Copyright © Intelligent Business Strategies 1992-2016! Data Strategy Requirements – We Need To Enable Information Producers And Information Consumers  Need to make use of • A business glossary and information catalog • Re-usable services to manage and process data • Collaboration and social computing to manage, process and rate data • Role-based data management tools aimed at IT AND business clean & integrate service raw data trusted data Information catalog BI tool or application search find shop order consume data scientist IT professional information producers clean & integrate service raw data business analysts information consumers like a “corporate iTunes” for data
  • 31. 31Copyright © Intelligent Business Strategies 1992-2016! A ‘Production Line’ Publish And Subscribe Approach Is Used To Accelerate Information And Insight Production data source Data Integration publish Info catalog trusted data as a service publish Info catalog trusted, integrated data ad a service subscribe Analyse (e.g. score) consume publishAnalytics catalog New predictive analytic pipelines (as a service) consume subscribe Visualise Decide Act Other, e.g. embed analytic applications consume subscribe publish Solutions catalog New prescriptive analytic pipelines publish New analytic applications use crawl discover profile publish Info catalog discovered data Acquire Acquire Acquire Data Preparation (clean, transform, filter)
  • 32. 32Copyright © Intelligent Business Strategies 1992-2016! Cataloging, Automated Discovery And Collaboration Are All Needed When Data Is Ingested Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis Analyse consume Automated relationship discovery, data profiling, and document clustering Descriptive metadata is critical to keeping things organised Info Catalog Catalog, tag and describe data/files (what’s it about?) collaborative appraisal
  • 33. 33Copyright © Intelligent Business Strategies 1992-2016! Governance In A Data Reservoir Is Controlled By Classification And Metadata In The Information CatalogClassifications drive the governance Governance Rule Governance Rule Governance Rule Classification Classification Information Rule Information Governance Rule Classified by Actioned by Physical Data Description Policy Governs Implemented by Policy ProcessAssessed by Business Attribute Classified by Mapped to Governs Sensitive IT Landscape Deployed toGovernance Action Describesby Engine Accesses Metrics Measures ProcessAssessed by Feeds Operational Log Logs activity Describes Data storeData store/ Document/ File/API Measures Measures 9Source: IBM
  • 34. 34Copyright © Intelligent Business Strategies 1992-2016! IBM Are Creating ‘Governance Aware’ Runtimes To Verify And Enforce Policies In A Data Reservoir Source: IBM They access the information catalog to determine what to do at run time
  • 35. 35Copyright © Intelligent Business Strategies 1992-2016! We Need A Data Refinery To Process, Clean And Analyse Data To Produce Consumable High Value Insight cloud On-premises DW Analytical RDBMS ETL Server Data Virtualisation Server A data refinery should be able to choose where to best refine data to produce the information needed
  • 36. 36Copyright © Intelligent Business Strategies 1992-2016! Data virtualisation services A Key Requirement In A Distributed Data Reservoir Is Centralised Development, Distributed Execution MDM C R U D Prod Asset Cust Data marts Cloud object storage Refinedtrusted&integrateddata Stronggovernance Rawuntrusteddata somegovernance ECM Staging areas RDM C R U D Code sets Archived DW data Hive tables feedsIoT XML, JSON RDBMS Files office docssocial Cloud clickstream web logs web services NoSQL Text / Image/ Video Filtered sensor data Published trusted data Search indexes In-progress data Data Reservoir (not a data store but a collection of stores) Info Catalog ODS DW staging area EIM Tool Suite (Profiling, cleansing, ELT) ODS ODS Execution engine Execution engine Execution engine Execution engine Execution engine Execution engine IT User Interface Self- service UI Execution engine Execution engineExecution engineExecution engine Execution engineExecution engineExecution engine
  • 37. 37Copyright © Intelligent Business Strategies 1992-2016! On-premises storage DW staging area Cloud storage Execution engineExecution engine Execution engine Execution engine Execution engine If A Data Reservoir Is Distributed With Data Too Big To Move Then Processing Needs To Go The Data Not centralised, Not distributed But Federated Task Task Task Task Task
  • 38. 38Copyright © Intelligent Business Strategies 1992-2016! Options For Refining Data  IT developed ETL processing using EIM tool suites  Self-service data integration  Multi-role EIM tool suites • Can be used by both IT AND business users  Data virtualisation server  A combination of the above
  • 39. 39Copyright © Intelligent Business Strategies 1992-2016! Scaling ETL Transformations For In-Hadoop ELT Processing Data Cleansing and Integration Tool Extract Parse Clean Transform AnalyseLoad Insights Option 1 ETL tool generates HQL or convert generated SQL to HQL Option 2 ETL tool generates Pig (compiler converts every transform to a map reduce job) or JAQL Option 3 ETL tool generates 3GL MR or Spark code Option 4 – Other Native massively parallel transformation and integration bypassing any Hadoop execution engine E.g. Talend, IBM BigIntegrate, Informatica
  • 40. 40Copyright © Intelligent Business Strategies 1992-2016! Self-Service Data Integration Tool Vendors  Actian Dataflow  Alteryx  Clear Story Data  Datameer  IBM DataWorks  Informatica Rev  Paxata  SAS Data Loader for Hadoop  Tamr  Trifacta Acquire Data Preparation (clean, transform, filter) Analyse (e.g. Score) Visualise Decide Act Data Integrationdata Embed Acquire Data Preparation (clean, transform, filter) Analyse (e.g. Score) Visualise Decide Act Data Integrationdata Embed Data preparation, integration, analysis & visualisation Data preparation and integration
  • 41. 41Copyright © Intelligent Business Strategies 1992-2016! Some Data Management Vendors Are Trying To Cover All Roles And Integrate With Other Vendors, e.g. Informatica Informatica Catalog & Live Data Map Analyst toolData & Metadata Relationship Discovery Services Data Quality Profiling & Monitoring Services Data Modeling Services Data Cleansing & Matching Services Data Integration Services Business Glossary / Info Catalog Services Data Governance/Management Console Data Privacy & Lifecycle Management Services Data Audit & Protection Services EIM Tool Suite IT Data Architect Data Scientist Business Analyst Informatica Rev Self-service Cloud DI metadata metadata
  • 42. 42Copyright © Intelligent Business Strategies 1992-2016! Data & Metadata Relationship Discovery Services Data Quality Profiling & Monitoring Services Data Modeling Data Cleansing & Matching Services Data Integration Services (virt & ETL) Business Glossary / Info Catalog Services Data Governance/Management Console metadata Data Privacy & Lifecycle Management Services Data Audit & Protection Services ESB Information services C R U prod cust asset D MDM DW Data warehousing Big Data Data virtualisation cloud Business UserIT DeveloperIT Data Architect App Self- Service Enterprise Service Bus Some Vendors Are Opening Up Their Service Oriented Data Management Platforms To IT AND Business Users Role-based Uis to the same data management platform Workflow
  • 43. 43Copyright © Intelligent Business Strategies 1992-2016! Alternatively Interoperability Is Needed Across Tools To Use Data Preparation Jobs Developed By Different Users Stand-alone Data Wrangling tools Data & Metadata Relationshi p Discovery Services Data Quality Profiling & Monitoring Services Data Modeling Services Data Cleansing & Matching Services Data Integration Services Business Glossary / Info Catalog Services Data Governance/Management Console Data Privacy & Lifecycle Management Services Data Audit & Protection Services EIM Tool Suite IT Data Architect Data Scientist Business Analyst PowerQuery Self-Service DI embedded in Self- Service BI tools Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev Cloud DI Interoperability metadata metadata metadatametadata
  • 44. 44Copyright © Intelligent Business Strategies 1992-2016! Metadata Management In A Data Reservoir - EIM Platform Information Catalog And Apache Atlas Stand-alone Data Wrangling tools Services Data Governance/Management Console EIM Tool Suite IT Data Architect Data Scientist Business Analyst PowerQuery Self-Service DI embedded in Self- Service BI tools Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev Cloud DI metadata metadata metadata metadata atlas Graph store atlas atlas Information Catalog
  • 45. 45Copyright © Intelligent Business Strategies 1992-2016! Metadata Management In A Data Reservoir - Stand-Alone Information Catalog And Apache Atlas Stand-alone Data Wrangling tools Services Data Governance/Management Console EIM Tool Suite IT Data Architect Data Scientist Business Analyst PowerQuery Self-Service DI embedded in Self- Service BI tools Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev Cloud DI metadata metadata metadata metadata atlas Graph store atlas atlas Information Catalog metadata atlas
  • 46. 46Copyright © Intelligent Business Strategies 1992-2016! New Trusted Data Produced By Refining Un-Modelled Data Should Be Defined In A Business Glossary Raw data In-Progress data Refined data Untrusted Trusted corporate firewall Fit for use Data Refinery sandbox Business Glossary DataVirtualisation Could implement the SBV in a data virtualisation server
  • 47. 47Copyright © Intelligent Business Strategies 1992-2016! The Critical Importance Of An Information Catalog – We MUST Be Able To Answer This Question Business user What information exists about……….? An Information Catalogue Where is that likely to be documented?
  • 48. 48Copyright © Intelligent Business Strategies 1992-2016! The Information Catalog - What Else Do I Want To Know? Can I search for information? (faceted search via your SBV) Does the data exist? Is the data trusted? (what is the rating) Is the data sensitive? (what is the rating) Is it high business value (what is the rating) Can I order it? Can I specify where to deliver it to and in what format? Can I see where is it used and who owns it? Information Catalogue
  • 49. 49Copyright © Intelligent Business Strategies 1992-2016! Information Catalog Example - Waterline Data
  • 50. 50Copyright © Intelligent Business Strategies 1992-2016! Faceted Navigation Used In E-Commerce (e.g. Amazon) Is About To Get A Much Bigger Role In Data Management Add it to your cart Select the products you want
  • 51. 51Copyright © Intelligent Business Strategies 1992-2016! Ordered Parcel Delivery – The Same Thing Will Happen To Provision Ordered Data Ordered data
  • 52. 52Copyright © Intelligent Business Strategies 1992-2016! Virtual Information Provisioning Needs Policy Awareness At Runtime To Create Virtual Views That Enforce Governance Information provisioning service Virtual data subset Virtual full data set security policy (some data not permitted to be seen) (all data permitted to be seen) “Finished-Goods” Refined data Information provisioning service Virtual data subset Virtual full data set compliance policy (some data not allowed to be provisioned outside the country) (all data provisioned inside the country) Data reservoir All data has SBV DataVirtualisation
  • 53. 53Copyright © Intelligent Business Strategies 1992-2016! Conclusions  The challenge is now to manage data in the entire analytical ecosystem  Invest in new skills and training needed in this environment  Data needs to be organised in a data reservoir to prevent chaos  Hadoop is becoming a platform to accelerate cleansing and ETL processing to conduct exploratory analytics  Multiple options exist to allow IT and business users to clean and integrate data in preparation for analysis • Data integration vendors have added functionality to support Hadoop • Self-service data cleansing and integration tools also exist  The ideal solution is a single platform that supports IT and business user self-service data integration  An information catalog is critical for end-to-end data governance • Understanding what data is available (descriptive metadata) • Understand how it was transformed (metadata lineage)  Data virtualisation is needed to see across multiple data reservoirs  Start small and build out incrementally – don’t just load data and hope
  • 54. 54Copyright © Intelligent Business Strategies 1992-2016! www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 Thank You!