Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Organising the Data Lake - Information Management in a Big Data World

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 54 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Publicité

Similaire à Organising the Data Lake - Information Management in a Big Data World (20)

Plus par DataWorks Summit/Hadoop Summit (20)

Publicité

Organising the Data Lake - Information Management in a Big Data World

  1. 1. Organising The Data Lake - Information Management In A Big Data World Mike Ferguson Managing Director Intelligent Business Strategies Hadoop Summit Dublin, April 2016
  2. 2. 2Copyright © Intelligent Business Strategies 1992-2016! About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence, data management and enterprise business integration. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700
  3. 3. 3Copyright © Intelligent Business Strategies 1992-2016! Topics  The data integration complexity  The siloed approach to managing and governing data  A new inclusive approach to governing and managing data  Introducing the data reservoir and data refinery  How does a data reservoir and data refinery work?  Mapping new data and insights into your shared business vocabulary  The mission critical importance of an information catalog in a distributed data landscape  Integrating data reservoirs and data refineries into your existing environment
  4. 4. 4Copyright © Intelligent Business Strategies 1992-2016! The Changing Landscape – We Now Have Different Platforms Optimised For Different Analytical Workloads Streaming data Hadoop data store Data Warehouse RDBMS NoSQL DBMS EDW DW & marts NoSQL Graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Analytical RDBMS Big Data workloads result in multiple platforms now being needed for analytical processing C R U D Prod Asset Cust MDM Traditional query, reporting & analysis Real-time stream processing & decision m’gmt Data mining, model development Investigative analysis, Data refinery Data mining, model development Graph analysis Graph analysis
  5. 5. 5Copyright © Intelligent Business Strategies 1992-2016! Data Integration Today Has Become Much More Complex - Popular Data Integration Paths Between Platforms EDW DW Appliance Analytical DBMS MDM System C R U D Prod Asset Cust XML, JSON social Web logs ERP CRM SCM Ops Graph DBMS NoSQL DB Column Fam DB Document DB NoSQL DB web Data martsTransaction data Cloud data may also be part of it insights Txns
  6. 6. 6Copyright © Intelligent Business Strategies 1992-2016! Issues: Siloed Analytics - Different Tools To Manage And Integrate Data For Each Type Of Analytical And MDM Store Analytical tools Data management tools EDW mart Structured data CRM ERP SCM Silo DW & marts Analytical tools/apps Data management tools Multi-structured data Silo DW Appliance Advanced Analytics (structured data) Data management tools Structured data CRM ERP SCM Analytical tools Silo Analytical tools/apps Data management tools NoSQL DB e.g. graph DB Silo Multi-structured & structured data Silo C R U D Prod Asset Cust MDM Applications Data management tools Master data management CRM ERP SCM
  7. 7. 7Copyright © Intelligent Business Strategies 1992-2016! Issues: Data Deluge - Data Is Arriving Faster Than We Can Consume It F D I A L T T A E R Enterprise Enterprise systems
  8. 8. 8Copyright © Intelligent Business Strategies 1992-2016! With 000’s Of Data Sources, IT And Business Need To Working Together As IT Will Likely Become A Bottleneck IT OLTP systems Web logs web DQ/DI job DQ/DI job DQ/DI job Open data IoT machine data social & web C R U prod cust asset D MDM DW Data warehousing cloud Data virtualisation Can business analysts & Data Scientists help? DQ/DI job DQ/DI job DQ/DI job ??? Bottleneck? Should IT be expected to do everything? Big Data
  9. 9. 9Copyright © Intelligent Business Strategies 1992-2016! Issues: Have You Got Self-Service Data Integration Causing Chaos In The Enterprise? social Web logs web cloud sandbox Data Scientists sandbox Data Scientists sandbox Data Scientists HDFS ETL / DQ Self-service BI tools with ETL ETL new insights SQL on Hadoop DW ETL / DQ DW marts ETL SCM CRM ERP ETL/D Q marts Self-service BI tools with ETL ETL/D Q Built by IT ETL/ DQETL/ DQETL/ DQ
  10. 10. 10Copyright © Intelligent Business Strategies 1992-2016! Problems With The Current Approach  Project oriented siloed approach to DI/DQ with limited collaboration  Cost of data integration is too high  Slow speed of development  Multiple DI/DQ technologies and techniques being used that are not integrated  Lots of re-invention rather than re-use  Fractured metadata across multiple tools or no metadata at all in some cases  Risk of duplicate inconsistent DI/DQ rules for same data  Metadata lineage is unavailable in many places especially with hand-coded Big Data DI/DQ applications  Multiple skill sets fractured across different projects  Repetition of our mistakes, e.g. Big Data preparation EDW C R U D Prod Asset Cust MDMDQ/DI DQ/DI DQ/DI DQ/DIDQ/DI cloud Data virtualisation DQ/DIDQ/DI DQ/DI Self-service
  11. 11. 11Copyright © Intelligent Business Strategies 1992-2016! There has to be a better, more governed way to fuel productivity and agility without causing data inconsistency and chaos EDW DQ/DI C R U D Prod Asset Cust MDM DQ/DI DQ/DIDQ/DI cloud Data virtualisation DQ/DIDQ/DIDQ/DI DQ/DI Self-service Tools are available but are not well integrated Also the whole collaborative, metadata and information catalog piece is incomplete IT IS NOT ENOUGH – THE WHOLE THING HAS TO BE CO-ORDINATED
  12. 12. 12Copyright © Intelligent Business Strategies 1992-2016! We Are All In The Same Boat! – Everyone For Themselves Is Not An Option IT Data ArchitectData Scientist IT Developer Business analyst
  13. 13. Information Management – Introducing The Data Lake Reservoir Reservoir
  14. 14. 14Copyright © Intelligent Business Strategies 1992-2016! What Is A Data Reservoir? - A Collaborative, Governed Environment Aimed At Rapidly Producing Information IT Data Architect Data ScientistDomain Expert community Bus. analyst Need to work together for competitive advantage Data ScientistIT Developer community Data Architect Data ScientistDomain Expert community Domain Expert Data ScientistDomain Expert community Bus. analyst Bus. analyst Data Architect community
  15. 15. 15Copyright © Intelligent Business Strategies 1992-2016! Chaos Is NOT An Option – Business Alignment Of Information Being Produced Is Critical To Success Big Data Project Big Data Project DW Project MDM ProjectProject Strategic Objectives Business Strategy • What problem are you trying to solve? • What data do you need? • What kind(s) of analytic workload are needed We need co-ordinated “info producer” projects in a managed environment
  16. 16. 16Copyright © Intelligent Business Strategies 1992-2016! Key Capabilities In A Managed Data Reservoir - 1  Data collection • Automated discovery of the structure and formatting • Data structure inferred by machine learning • Automated cataloging, infinite storage and processing  Data classification • Determines how data should be governed • Support is needed for different types of classification schemes, e.g. Retention Unclassified Temporary Project Lifetime Managed period Permanent Confidential Unclassified Internal use Business confidential Supplier confidential Sensitive (PII) Sensitive (Financial) Sensitive (Operations) Restricted (Trade secret) Confidence Unclassified Raw (original) Obsolete Archived Trusted Business Value Unclassified Unimportant Marginal Important Critical Catastrophic
  17. 17. 17Copyright © Intelligent Business Strategies 1992-2016! Key Capabilities In A Managed Data Reservoir - 2  Collaborative data governance • Data quality • Data trustworthiness (confidence) • Data protection – Data privacy, access authorisation, lifecycle management • Compliance  Data refinery • Systematically clean and refine data through various stages • Manual and guided data preparation • “Sandbox” analyse data to produce high value insights  Data as a Service (DaaS) • Published high value insights available for consumption • Search for and discover trusted insights, subscribe to receive it  Data consumption • Provision refined, trusted commonly understood data into any tool or application
  18. 18. 18Copyright © Intelligent Business Strategies 1992-2016! Data virtualisation services A Data Reservoir Is An Organised Collection Of Raw, In-Progress And Trusted Data (Multiple Data Stores) DW MDM C R U D Prod Asset Cust Data marts Cloud object storage Refinedtrusted&integrateddata Stronggovernance Rawuntrusteddata somegovernance ECM Staging areas ODS RDM C R U D Code sets Archived DW data Hive tables feedsIoT XML, JSON RDBMS Files office docssocial Cloud clickstream web logs web services NoSQL ODS ODS DW Text / Image/ Video Filtered sensor data Published trusted data Search indexes In-progress data Data Reservoir (not a data store but a collection of stores) Data sources and ingested reservoir data are all known to the catalog Info Catalog
  19. 19. 19Copyright © Intelligent Business Strategies 1992-2016! Replicate Streaming Batch Load Archive Raw Data Is Being Collected In Multiple Places Across The Enterprise – We Need To Know What’s Happening! We need to avoid unconnected silos But we HAVE TO know what is being collected and filtered and where that is happening Also who is doing it, for what business purpose?
  20. 20. 20Copyright © Intelligent Business Strategies 1992-2016! If Multiple Collection Points Exist Then Something Has To Catalog What Data Is Available, Its Status And Where It Is All data entering a reservoir needs to be catalogued and organised You need to know what data is available across the enterprise, where it came from, what state is it in, should we trust it, can we order it Information Catalogue
  21. 21. 21Copyright © Intelligent Business Strategies 1992-2016! A Distributed Data Reservoir Requires Information Management Software To Work Across Multiple Data Stores Enterprise Information Management (Catalog, DQ, ETL, Security, Privacy…) The Data Reservoir is distributed but is should be managed and function as if it were centralised Key requirements Define once, execute anywhere Centralised metadata Distributed execution of policies associated with data quality, ETL, security, lifeecycle management across the landscape (multiple execution engines)
  22. 22. 22Copyright © Intelligent Business Strategies 1992-2016! Replicate Streaming Batch Load Archive A Distributed Data Reservoir Requires Management And Governance As If It Was Centralised The data in the reservoir is distributed but the reservoir is managed and operated as if it were centralised
  23. 23. 23Copyright © Intelligent Business Strategies 1992-2016! Information Production Is A Process That Involves Refining And Integrating Data High value information and /or insights available for consumption Raw data Raw data Trusted data Collaboration is needed to perform many tasks in producing information, e.g. selecting & transforming data Reservoir storage Raw data Raw data In- progress data Trusted data
  24. 24. 24Copyright © Intelligent Business Strategies 1992-2016! The Information Production Process Works Across Zones In The Reservoir – Zones Created By Tagging Files sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis
  25. 25. 25Copyright © Intelligent Business Strategies 1992-2016! Organising Data In A Reservoir – The Catalog Knows About Data Sources Plus Data In All Zones And Sandboxes sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis
  26. 26. 26Copyright © Intelligent Business Strategies 1992-2016! Operating A Data Reservoir – The Information Production Process Is A Production Line That Spans Reservoir Zones Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Nominate new data Classify sensitivity, quality, retention Tag data (what’s it mean?) Assign governance policies based on classification Collaborate about processing Track data freshness Rate its value ★★★★ Exploratory analysis Analyse consume Reservoir operations are controlled via the catalog and workflow processes Info Catalog Map to shared business vocabulary
  27. 27. 27Copyright © Intelligent Business Strategies 1992-2016! Operating A Data Reservoir – Workflows Are Everywhere And Are Components Of An Information Production Process sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis Ingest w/flow movement w/flow movement w/flow Publish w/flow Publish w/flow Provision w/flow Refinery w/flow Analytical w/flow Gov w/flow Gov w/flow Stream w/flow
  28. 28. 28Copyright © Intelligent Business Strategies 1992-2016! Trends – Data And Analytical Workflow (Pipeline) Products Requiring No Programming Are Emerging Everywhere Talend Alteryx Microsoft Azure Data Factory Hortonworks Dataflow (Nifi) Dell Statistica Who is using what tools? Any reinvention?
  29. 29. 29Copyright © Intelligent Business Strategies 1992-2016! Operating A Data Reservoir – All Workflows Should Be Approved And Registered In The Information Catalog sandbox Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive sandbox Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis Ingest w/flow Publish w/flow Publish w/flow movement w/flow movement w/flow Provision w/flow Refinery w/flow Analytical w/flow Gov w/flow Gov w/flow Stream w/flow Convert SSDI workflows to data virtualisation views to minimise re- invention and enforce governance virtualviewvirtualview virtualview
  30. 30. 30Copyright © Intelligent Business Strategies 1992-2016! Data Strategy Requirements – We Need To Enable Information Producers And Information Consumers  Need to make use of • A business glossary and information catalog • Re-usable services to manage and process data • Collaboration and social computing to manage, process and rate data • Role-based data management tools aimed at IT AND business clean & integrate service raw data trusted data Information catalog BI tool or application search find shop order consume data scientist IT professional information producers clean & integrate service raw data business analysts information consumers like a “corporate iTunes” for data
  31. 31. 31Copyright © Intelligent Business Strategies 1992-2016! A ‘Production Line’ Publish And Subscribe Approach Is Used To Accelerate Information And Insight Production data source Data Integration publish Info catalog trusted data as a service publish Info catalog trusted, integrated data ad a service subscribe Analyse (e.g. score) consume publishAnalytics catalog New predictive analytic pipelines (as a service) consume subscribe Visualise Decide Act Other, e.g. embed analytic applications consume subscribe publish Solutions catalog New prescriptive analytic pipelines publish New analytic applications use crawl discover profile publish Info catalog discovered data Acquire Acquire Acquire Data Preparation (clean, transform, filter)
  32. 32. 32Copyright © Intelligent Business Strategies 1992-2016! Cataloging, Automated Discovery And Collaboration Are All Needed When Data Is Ingested Trusted Data Zone Raw Data Zone Info Catalog master ref data DW archive Refinery zone (prepare & analyse data) In-progress data Refined data & Insights zone Data marketplace Data reservoir management ETL/ Data prepDQ Data Ingestion zone (transient data) IoT RDBMS office docs social Cloud clickstream web logs XML, JSON web services NoSQL Files DW data streams Data Reservoir Exploratory analysis Analyse consume Automated relationship discovery, data profiling, and document clustering Descriptive metadata is critical to keeping things organised Info Catalog Catalog, tag and describe data/files (what’s it about?) collaborative appraisal
  33. 33. 33Copyright © Intelligent Business Strategies 1992-2016! Governance In A Data Reservoir Is Controlled By Classification And Metadata In The Information CatalogClassifications drive the governance Governance Rule Governance Rule Governance Rule Classification Classification Information Rule Information Governance Rule Classified by Actioned by Physical Data Description Policy Governs Implemented by Policy ProcessAssessed by Business Attribute Classified by Mapped to Governs Sensitive IT Landscape Deployed toGovernance Action Describesby Engine Accesses Metrics Measures ProcessAssessed by Feeds Operational Log Logs activity Describes Data storeData store/ Document/ File/API Measures Measures 9Source: IBM
  34. 34. 34Copyright © Intelligent Business Strategies 1992-2016! IBM Are Creating ‘Governance Aware’ Runtimes To Verify And Enforce Policies In A Data Reservoir Source: IBM They access the information catalog to determine what to do at run time
  35. 35. 35Copyright © Intelligent Business Strategies 1992-2016! We Need A Data Refinery To Process, Clean And Analyse Data To Produce Consumable High Value Insight cloud On-premises DW Analytical RDBMS ETL Server Data Virtualisation Server A data refinery should be able to choose where to best refine data to produce the information needed
  36. 36. 36Copyright © Intelligent Business Strategies 1992-2016! Data virtualisation services A Key Requirement In A Distributed Data Reservoir Is Centralised Development, Distributed Execution MDM C R U D Prod Asset Cust Data marts Cloud object storage Refinedtrusted&integrateddata Stronggovernance Rawuntrusteddata somegovernance ECM Staging areas RDM C R U D Code sets Archived DW data Hive tables feedsIoT XML, JSON RDBMS Files office docssocial Cloud clickstream web logs web services NoSQL Text / Image/ Video Filtered sensor data Published trusted data Search indexes In-progress data Data Reservoir (not a data store but a collection of stores) Info Catalog ODS DW staging area EIM Tool Suite (Profiling, cleansing, ELT) ODS ODS Execution engine Execution engine Execution engine Execution engine Execution engine Execution engine IT User Interface Self- service UI Execution engine Execution engineExecution engineExecution engine Execution engineExecution engineExecution engine
  37. 37. 37Copyright © Intelligent Business Strategies 1992-2016! On-premises storage DW staging area Cloud storage Execution engineExecution engine Execution engine Execution engine Execution engine If A Data Reservoir Is Distributed With Data Too Big To Move Then Processing Needs To Go The Data Not centralised, Not distributed But Federated Task Task Task Task Task
  38. 38. 38Copyright © Intelligent Business Strategies 1992-2016! Options For Refining Data  IT developed ETL processing using EIM tool suites  Self-service data integration  Multi-role EIM tool suites • Can be used by both IT AND business users  Data virtualisation server  A combination of the above
  39. 39. 39Copyright © Intelligent Business Strategies 1992-2016! Scaling ETL Transformations For In-Hadoop ELT Processing Data Cleansing and Integration Tool Extract Parse Clean Transform AnalyseLoad Insights Option 1 ETL tool generates HQL or convert generated SQL to HQL Option 2 ETL tool generates Pig (compiler converts every transform to a map reduce job) or JAQL Option 3 ETL tool generates 3GL MR or Spark code Option 4 – Other Native massively parallel transformation and integration bypassing any Hadoop execution engine E.g. Talend, IBM BigIntegrate, Informatica
  40. 40. 40Copyright © Intelligent Business Strategies 1992-2016! Self-Service Data Integration Tool Vendors  Actian Dataflow  Alteryx  Clear Story Data  Datameer  IBM DataWorks  Informatica Rev  Paxata  SAS Data Loader for Hadoop  Tamr  Trifacta Acquire Data Preparation (clean, transform, filter) Analyse (e.g. Score) Visualise Decide Act Data Integrationdata Embed Acquire Data Preparation (clean, transform, filter) Analyse (e.g. Score) Visualise Decide Act Data Integrationdata Embed Data preparation, integration, analysis & visualisation Data preparation and integration
  41. 41. 41Copyright © Intelligent Business Strategies 1992-2016! Some Data Management Vendors Are Trying To Cover All Roles And Integrate With Other Vendors, e.g. Informatica Informatica Catalog & Live Data Map Analyst toolData & Metadata Relationship Discovery Services Data Quality Profiling & Monitoring Services Data Modeling Services Data Cleansing & Matching Services Data Integration Services Business Glossary / Info Catalog Services Data Governance/Management Console Data Privacy & Lifecycle Management Services Data Audit & Protection Services EIM Tool Suite IT Data Architect Data Scientist Business Analyst Informatica Rev Self-service Cloud DI metadata metadata
  42. 42. 42Copyright © Intelligent Business Strategies 1992-2016! Data & Metadata Relationship Discovery Services Data Quality Profiling & Monitoring Services Data Modeling Data Cleansing & Matching Services Data Integration Services (virt & ETL) Business Glossary / Info Catalog Services Data Governance/Management Console metadata Data Privacy & Lifecycle Management Services Data Audit & Protection Services ESB Information services C R U prod cust asset D MDM DW Data warehousing Big Data Data virtualisation cloud Business UserIT DeveloperIT Data Architect App Self- Service Enterprise Service Bus Some Vendors Are Opening Up Their Service Oriented Data Management Platforms To IT AND Business Users Role-based Uis to the same data management platform Workflow
  43. 43. 43Copyright © Intelligent Business Strategies 1992-2016! Alternatively Interoperability Is Needed Across Tools To Use Data Preparation Jobs Developed By Different Users Stand-alone Data Wrangling tools Data & Metadata Relationshi p Discovery Services Data Quality Profiling & Monitoring Services Data Modeling Services Data Cleansing & Matching Services Data Integration Services Business Glossary / Info Catalog Services Data Governance/Management Console Data Privacy & Lifecycle Management Services Data Audit & Protection Services EIM Tool Suite IT Data Architect Data Scientist Business Analyst PowerQuery Self-Service DI embedded in Self- Service BI tools Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev Cloud DI Interoperability metadata metadata metadatametadata
  44. 44. 44Copyright © Intelligent Business Strategies 1992-2016! Metadata Management In A Data Reservoir - EIM Platform Information Catalog And Apache Atlas Stand-alone Data Wrangling tools Services Data Governance/Management Console EIM Tool Suite IT Data Architect Data Scientist Business Analyst PowerQuery Self-Service DI embedded in Self- Service BI tools Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev Cloud DI metadata metadata metadata metadata atlas Graph store atlas atlas Information Catalog
  45. 45. 45Copyright © Intelligent Business Strategies 1992-2016! Metadata Management In A Data Reservoir - Stand-Alone Information Catalog And Apache Atlas Stand-alone Data Wrangling tools Services Data Governance/Management Console EIM Tool Suite IT Data Architect Data Scientist Business Analyst PowerQuery Self-Service DI embedded in Self- Service BI tools Microsoft Data Factory Dell Boomi SnapLogic IBM DataWorks Informatica Rev Cloud DI metadata metadata metadata metadata atlas Graph store atlas atlas Information Catalog metadata atlas
  46. 46. 46Copyright © Intelligent Business Strategies 1992-2016! New Trusted Data Produced By Refining Un-Modelled Data Should Be Defined In A Business Glossary Raw data In-Progress data Refined data Untrusted Trusted corporate firewall Fit for use Data Refinery sandbox Business Glossary DataVirtualisation Could implement the SBV in a data virtualisation server
  47. 47. 47Copyright © Intelligent Business Strategies 1992-2016! The Critical Importance Of An Information Catalog – We MUST Be Able To Answer This Question Business user What information exists about……….? An Information Catalogue Where is that likely to be documented?
  48. 48. 48Copyright © Intelligent Business Strategies 1992-2016! The Information Catalog - What Else Do I Want To Know? Can I search for information? (faceted search via your SBV) Does the data exist? Is the data trusted? (what is the rating) Is the data sensitive? (what is the rating) Is it high business value (what is the rating) Can I order it? Can I specify where to deliver it to and in what format? Can I see where is it used and who owns it? Information Catalogue
  49. 49. 49Copyright © Intelligent Business Strategies 1992-2016! Information Catalog Example - Waterline Data
  50. 50. 50Copyright © Intelligent Business Strategies 1992-2016! Faceted Navigation Used In E-Commerce (e.g. Amazon) Is About To Get A Much Bigger Role In Data Management Add it to your cart Select the products you want
  51. 51. 51Copyright © Intelligent Business Strategies 1992-2016! Ordered Parcel Delivery – The Same Thing Will Happen To Provision Ordered Data Ordered data
  52. 52. 52Copyright © Intelligent Business Strategies 1992-2016! Virtual Information Provisioning Needs Policy Awareness At Runtime To Create Virtual Views That Enforce Governance Information provisioning service Virtual data subset Virtual full data set security policy (some data not permitted to be seen) (all data permitted to be seen) “Finished-Goods” Refined data Information provisioning service Virtual data subset Virtual full data set compliance policy (some data not allowed to be provisioned outside the country) (all data provisioned inside the country) Data reservoir All data has SBV DataVirtualisation
  53. 53. 53Copyright © Intelligent Business Strategies 1992-2016! Conclusions  The challenge is now to manage data in the entire analytical ecosystem  Invest in new skills and training needed in this environment  Data needs to be organised in a data reservoir to prevent chaos  Hadoop is becoming a platform to accelerate cleansing and ETL processing to conduct exploratory analytics  Multiple options exist to allow IT and business users to clean and integrate data in preparation for analysis • Data integration vendors have added functionality to support Hadoop • Self-service data cleansing and integration tools also exist  The ideal solution is a single platform that supports IT and business user self-service data integration  An information catalog is critical for end-to-end data governance • Understanding what data is available (descriptive metadata) • Understand how it was transformed (metadata lineage)  Data virtualisation is needed to see across multiple data reservoirs  Start small and build out incrementally – don’t just load data and hope
  54. 54. 54Copyright © Intelligent Business Strategies 1992-2016! www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 Thank You!

×