SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
How to Build a
Successful Data Lake
Alex Gorelik
Waterline Data
Founder and CEO
Data Lakes Power Data-Driven Decision Making
Maximize Business Value With a Data
Lake
How Do You Democratize the Data Lake to Maximize Business
Value?
Data
Lake
Data
Puddle
Data
Swamp
No Value Enterprise Impact
Tight Control
“Governed”
Self-Service
Business Value
Data
Democratization
DW Off-
loading
Data Swamps
Raw data
Can’t find or use
data
Can’t allow access
without protecting
sensitive data
Data Warehouse Offloading: Cost Savings
I prefer a data
warehouse--it’s
more predictable
It takes IT 3 months of data
architecture and ETL work to
add new data to the data lake
I can’t get the original data

Low variety of data and low adoption
• Focused use case (e.g., fraud detection)
• Fully automated programs (e.g., ETL off-loading)
• Small user community (e.g., data science sand box)
Strong technical skill set requirement
Data Puddles: Limited Scope and Value
What Makes a Successful Data Lake?
Right Data Right InterfaceRight Platform + +
Right Platform:
• Volume—Massively scalable
• Variety—Schema on read
• Future proof—modular—same data can be used by
many different projects and technologies
• Platform cost – extremely attractive cost structure
Right Data Challenges
Most Data is Lost, So it Can’t Be Analyzed Later
Only a small portion of data in enterprises today
is saved in data warehouses
Data Exhaust
Right Data: Save Raw Data Now to Analyze Later
• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
• Save raw data, so it can be treated
correctly for each use case
Right Data: Save Raw Data Now to Analyze Later
• Departments hoard and protect
their data and do not share it with
the rest of the enterprise
• Frictionless ingestion does not
depend on data owners
Right Data Challenges: Data Silos and Data Hoarding
Right Interface: Key to Broad Adoption
• Data marketplace for
data self-service
• Providing data at the
right level of expertise
Providing Data at the Right Level of Expertise
Data scientists Business analysts
Raw data
Clean, trusted,
prepared data
Roadmap to Data Lake Success
Organize the lake
Set up for self-service
Open the lake to the users
Organize the Data Lake into Zones Organize
the lake
Multi-modal IT – Different Governance
Levels for Different Zones
Raw or
Landing Sensitive
Gold or
Curated
Work
Data Stewards
Data Scientists
Data Engineers
Data Scientists, Business Analysts
 Minimal governance
 Make sure there is no
sensitive data
 Minimal governance
 Make sure there is no
sensitive data
 Heavy governance
 Trusted, curated data
 Lineage, data quality
 Heavy governance
 Restricted access
Business Analyst Self-Service Workflow
Find and
Understand Provision Prep Analyze
Set up for
self-service
Finding, understanding and governing data in
a data lake is like shopping at a flea market
“We have 100 million fields of data – how can anyone find or trust
anything?” – Telco Executive
Botond Horvath / Shutterstock.com
DATA SCIENTIST /
BUSINESS ANALYST
DATA
STEWARD
BIG DATA
ARCHITECT
Can’t govern and trust data
(unknown metadata, data
quality, PII, data lineage)
Need data to use with self-
service tools but can’t explore
everything manually to find
and understand data
Can’t catalog all the data
manually and keep up with
data provisioning
Instead Imaging Shopping On Amazon.com
Catalog
Find, Understand And
Collaborate
Provision
Catalog
Find, Understand And
Collaborate
Provision
Waterline Data is like Amazon for Data in Hadoop
Finding and Understanding Data
• Crowdsource metadata and automate
creation of a catalog
• Institutionalize tribal data knowledge
• Automate discovery to cover all data
sets
• Establish trust
• Curated annotated data sets
• Lineage
• Data quality
• Governance
Find and
Understand
Accessing and Provisioning Data
You cannot give all access to all users
You must protect PII data and sensitive business information
Provision
Agile/Self-service
approach
Create a metadata-only catalog
When users request access,
data is de-identified and
provisioned
Top down approach
Find and de-identify all
sensitive data
Provide access to every user for
every dataset as needed
Provide a Self-Service Interface to Find,
Understand, and Provision Data
Prepare data for analytics Prep
Clean data
Remove or fix bad data, fill in
missing values, convert to
common units of measure
Shape data
Combine (join, concatenate)
Resolve entities (create a single
customer record from multiple
records or sources)
Transform (aggregate, bucketize,
filter, convert codes to names, etc.)
Blend data
Harmonize data from multiple
sources to a common schema
or model
Tooling
Many great dedicated data
wrangling tools on the horizon
Some capabilities in BI and data
visualization tools
SQL and scripting languages for
the more technical analysts
Data Analysis
• Many wonderful self-
service BI and data
visualization tools
• Mature space with many
established and
innovative vendors
Magic Quadrant for Business Intelligence and Analytics Platforms
04 February 2016 | ID:G00275847
Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
Analyze
Unlock the Value of the Data Lake with the
Waterline Data Smart Data Catalog
Time To Value Tribal Knowledge Sharing Trust
Waterline Data Is The Only Smart Data
Catalog For The Data Lake
“Use an INFORMATION
CATALOG TO MAXIMIZE
BUSINESS VALUE From
Information Assets”
“automatically identify, profile,
and metatag files in HDFS and
make them available for
analysis and exploration”
“tapped into an important and
underserved opportunity”
“comprehensive big data
governance and discovery
platform”
“opens the data to a
wider variety of people”
“fills a critical gap in big data
exploratory analytics by
automating the tagging and
cataloging of data”
Current Customers
Healthcare
Insurance
Life Sciences
Aerospace
Automotive
Banking
Government
Marketing
"Opening up a data lake for self-service analytics requires a
data catalog that's smart enough to automatically catalog every
field of data so business analysts can maximize time to value” --
Jerry Megaro, Global Head Of Data Analytics, Merck KGaA
“Understanding where your data came from and what it means
in context is vital to making a data lake initiative successful and
not just another data quagmire – the catalog plays a critical
component in this” -- Global Head of Data Governance, Risk,
and Standard, International Multi-Line Insurer
“A governed yet agile data catalog is key to open up the data
lake to business people” -- Paolo Arvati, Big Data, CSI-
Piemonte
We Run Natively On Hadoop And Integrate
With Existing Tools
Workflow of Enabling Self-Service
Analytics With Hortonworks
Hortonworks Atlas And Ranger
Data Prep Analytics &
Visualization
Smart Data
DiscoveryProfiling, Sensitive
Data & Data
Lineage
Discovery,
Automated
Tagging
Data
Stewardship
Curate Tags
Self-Service
Data
Catalog
Find, Collaborate
And Take Action
Metadata,
Tags, Data
Lineage
Metadata,
Tags, Roles &
Access Control
Roles &
Access Control
A Successful Data Lake
Right Data Right InterfaceRight Platform + +
Come to Booth 303 to see a demo
and talk to us about your data lake
Come to the Atlas session at 4:00 PM on
Thursday in room 210C
Waterline Data
The Smart Data Catalog Company

Contenu connexe

Tendances

Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfChitresh Kaushik
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 

Tendances (20)

Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdf
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 

Similaire à Build a Successful Data Lake with a Smart Data Catalog

Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
How to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptxHow to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptxTarekHassan840678
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...DataScienceConferenc1
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 

Similaire à Build a Successful Data Lake with a Smart Data Catalog (20)

Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
How to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptxHow to build a successful data lake Presentation.pptx
How to build a successful data lake Presentation.pptx
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 

Dernier (20)

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 

Build a Successful Data Lake with a Smart Data Catalog

  • 1. How to Build a Successful Data Lake Alex Gorelik Waterline Data Founder and CEO
  • 2. Data Lakes Power Data-Driven Decision Making
  • 3. Maximize Business Value With a Data Lake How Do You Democratize the Data Lake to Maximize Business Value? Data Lake Data Puddle Data Swamp No Value Enterprise Impact Tight Control “Governed” Self-Service Business Value Data Democratization DW Off- loading
  • 4. Data Swamps Raw data Can’t find or use data Can’t allow access without protecting sensitive data
  • 5. Data Warehouse Offloading: Cost Savings I prefer a data warehouse--it’s more predictable It takes IT 3 months of data architecture and ETL work to add new data to the data lake I can’t get the original data 
  • 6. Low variety of data and low adoption • Focused use case (e.g., fraud detection) • Fully automated programs (e.g., ETL off-loading) • Small user community (e.g., data science sand box) Strong technical skill set requirement Data Puddles: Limited Scope and Value
  • 7. What Makes a Successful Data Lake? Right Data Right InterfaceRight Platform + +
  • 8. Right Platform: • Volume—Massively scalable • Variety—Schema on read • Future proof—modular—same data can be used by many different projects and technologies • Platform cost – extremely attractive cost structure
  • 9. Right Data Challenges Most Data is Lost, So it Can’t Be Analyzed Later Only a small portion of data in enterprises today is saved in data warehouses Data Exhaust
  • 10. Right Data: Save Raw Data Now to Analyze Later • Don’t know now what data will be needed later • Save as much data as possible now to analyze later
  • 11. • Don’t know now what data will be needed later • Save as much data as possible now to analyze later • Save raw data, so it can be treated correctly for each use case Right Data: Save Raw Data Now to Analyze Later
  • 12. • Departments hoard and protect their data and do not share it with the rest of the enterprise • Frictionless ingestion does not depend on data owners Right Data Challenges: Data Silos and Data Hoarding
  • 13. Right Interface: Key to Broad Adoption • Data marketplace for data self-service • Providing data at the right level of expertise
  • 14. Providing Data at the Right Level of Expertise Data scientists Business analysts Raw data Clean, trusted, prepared data
  • 15. Roadmap to Data Lake Success Organize the lake Set up for self-service Open the lake to the users
  • 16. Organize the Data Lake into Zones Organize the lake
  • 17. Multi-modal IT – Different Governance Levels for Different Zones Raw or Landing Sensitive Gold or Curated Work Data Stewards Data Scientists Data Engineers Data Scientists, Business Analysts  Minimal governance  Make sure there is no sensitive data  Minimal governance  Make sure there is no sensitive data  Heavy governance  Trusted, curated data  Lineage, data quality  Heavy governance  Restricted access
  • 18. Business Analyst Self-Service Workflow Find and Understand Provision Prep Analyze Set up for self-service
  • 19. Finding, understanding and governing data in a data lake is like shopping at a flea market “We have 100 million fields of data – how can anyone find or trust anything?” – Telco Executive
  • 20. Botond Horvath / Shutterstock.com DATA SCIENTIST / BUSINESS ANALYST DATA STEWARD BIG DATA ARCHITECT Can’t govern and trust data (unknown metadata, data quality, PII, data lineage) Need data to use with self- service tools but can’t explore everything manually to find and understand data Can’t catalog all the data manually and keep up with data provisioning
  • 21. Instead Imaging Shopping On Amazon.com Catalog Find, Understand And Collaborate Provision
  • 22. Catalog Find, Understand And Collaborate Provision Waterline Data is like Amazon for Data in Hadoop
  • 23. Finding and Understanding Data • Crowdsource metadata and automate creation of a catalog • Institutionalize tribal data knowledge • Automate discovery to cover all data sets • Establish trust • Curated annotated data sets • Lineage • Data quality • Governance Find and Understand
  • 24. Accessing and Provisioning Data You cannot give all access to all users You must protect PII data and sensitive business information Provision Agile/Self-service approach Create a metadata-only catalog When users request access, data is de-identified and provisioned Top down approach Find and de-identify all sensitive data Provide access to every user for every dataset as needed
  • 25. Provide a Self-Service Interface to Find, Understand, and Provision Data
  • 26. Prepare data for analytics Prep Clean data Remove or fix bad data, fill in missing values, convert to common units of measure Shape data Combine (join, concatenate) Resolve entities (create a single customer record from multiple records or sources) Transform (aggregate, bucketize, filter, convert codes to names, etc.) Blend data Harmonize data from multiple sources to a common schema or model Tooling Many great dedicated data wrangling tools on the horizon Some capabilities in BI and data visualization tools SQL and scripting languages for the more technical analysts
  • 27. Data Analysis • Many wonderful self- service BI and data visualization tools • Mature space with many established and innovative vendors Magic Quadrant for Business Intelligence and Analytics Platforms 04 February 2016 | ID:G00275847 Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich Analyze
  • 28. Unlock the Value of the Data Lake with the Waterline Data Smart Data Catalog Time To Value Tribal Knowledge Sharing Trust
  • 29. Waterline Data Is The Only Smart Data Catalog For The Data Lake “Use an INFORMATION CATALOG TO MAXIMIZE BUSINESS VALUE From Information Assets” “automatically identify, profile, and metatag files in HDFS and make them available for analysis and exploration” “tapped into an important and underserved opportunity” “comprehensive big data governance and discovery platform” “opens the data to a wider variety of people” “fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data”
  • 30. Current Customers Healthcare Insurance Life Sciences Aerospace Automotive Banking Government Marketing "Opening up a data lake for self-service analytics requires a data catalog that's smart enough to automatically catalog every field of data so business analysts can maximize time to value” -- Jerry Megaro, Global Head Of Data Analytics, Merck KGaA “Understanding where your data came from and what it means in context is vital to making a data lake initiative successful and not just another data quagmire – the catalog plays a critical component in this” -- Global Head of Data Governance, Risk, and Standard, International Multi-Line Insurer “A governed yet agile data catalog is key to open up the data lake to business people” -- Paolo Arvati, Big Data, CSI- Piemonte
  • 31. We Run Natively On Hadoop And Integrate With Existing Tools
  • 32. Workflow of Enabling Self-Service Analytics With Hortonworks Hortonworks Atlas And Ranger Data Prep Analytics & Visualization Smart Data DiscoveryProfiling, Sensitive Data & Data Lineage Discovery, Automated Tagging Data Stewardship Curate Tags Self-Service Data Catalog Find, Collaborate And Take Action Metadata, Tags, Data Lineage Metadata, Tags, Roles & Access Control Roles & Access Control
  • 33. A Successful Data Lake Right Data Right InterfaceRight Platform + +
  • 34. Come to Booth 303 to see a demo and talk to us about your data lake Come to the Atlas session at 4:00 PM on Thursday in room 210C
  • 35. Waterline Data The Smart Data Catalog Company

Notes de l'éditeur

  1. End-user tools only provide the last mile to leverage data, but they of and by themselves don’t know where the right data is. The right data has to be found, quickly and securely.
  2. The opposite of a flea market is Amazon. It gives the consumer self-service, but it functions as a managed application.
  3. Like Amazon, we offer a solution that catalogs the data assets, provides a front-end to find, understand, and share, and provides a way to take action and quickly open the data in any end-user tool to wrangle, visualize, or analyze the data.
  4. A data lake provides one place where any data can be saved and used by business analysts and data scientists to mash up data in new ways to answer new business questions. Waterline Data enables you to open up the data lake to business analysts and data scientists so they can do data prep, analytics, or modeling. Our product delivers value along 3 dimensions (i.e., the 3 T’s). We catalog every field of data for the entire data lake and we provide an interface to quickly find, understand, and take action on the data (e.g., you can provision or open the data in Trifacta) – The end result is faster time to uncover value We don’t just discover what the data means, but we also empower subject matter experts to augment the data catalog with additional tags and comments to capture additional information, such as the intended use of the data, to help accelerate future projects We facilitate data governance by tagging data based on approved business glossaries and data stewardship curation, as well as by providing secure self-service access to the data based on roles and visibility rules
  5. Waterline Data has been acknowledged as filling an important gap in opening up data lakes for self-service data preparation and analytics. The need for a data catalog has been recognized as key to enabling a data democracy and self-service by the business. For instance Gartner just released a paper on how CDOs can leverage an information catalog to get more business value from data assets. We are the only company that can build a data catalog automatically, and at scale, for a data lake.
  6. We have customers in production across many industries. They realize value by being able to catalog all the data quickly and make it easily available to the business to do self-service data preparation and analytics. They also get value from the fact that the data catalog supports agile data governance, by enabling data stewards to quickly curate tags, and by providing several levels of access control based on the data governance policies (e.g., access to sensitive data is protected). (if they ask, data lakes range from smaller 5-node clusters to over 100 nodes, so our product can be used right away even when the lake is small, and grow to a large lake)
  7. Our product runs natively on the major platforms like AWS, Cloudera, Hortonworks, MapR, and Pivotal. We are also in the process of certifying on IIP. We integrate with existing data management tools: We can import and export data lineage and tag information with Atlas and Navigator We support access control policies and integrate with LDAP, Ranger and Sentry We can import existing business glossaries from Collibra, Informatica, or IBM (note this is done through our API so we should be able to import from any business glossary) We can integrate with ETL tools to import metadata We integrate with end-user tools through an open framework (we provide the ability to generate Hive tables automatically, as well as the ability to open the data directly in end-user tools)
  8. Waterline Data accelerates the creation of the data catalog at big data scale: We parse, profile, and discover sensitive data and data lineage, and automatically tag fields based on an integrated business glossary and tagging rules We empower data stewards to quickly curate tags We empower business analysts and data scientists to quick find the right data they need and take immediate action with the data by being able to open it with the desired end-user tool