SlideShare une entreprise Scribd logo
1  sur  25
© 2015 MapR Technologies 1© 2015 MapR Technologies
Deploying a Governed Data Lake
© 2015 MapR Technologies 2
Welcome
• Event will be recorded
• Ask your questions in the Q&A Panel in the lower right-hand
corner of your screen
• Tweet us @mapr during the event
© 2015 MapR Technologies 3
Key Points
• The data lake is becoming a “real-time” shared service to provide
data to the business to support data science and big data
analytics needs
• As the data lake becomes a trusted source of data to drive big
data analytics, security and data governance have to be
addressed
• Security and data governance policies need to be implemented
in a way that still enables self-service and quick time to value vs.
creating 3-6 month delays
© 2015 MapR Technologies 4
Deliver Data Discovery Agility with a Governed “Data Layer”
Adhere to security,
compliance and data
governance policies
Catalog data assets at scale,
with secure provisioning to
the business
Find and understand best-
suited and most trusted data
© 2015 MapR Technologies 5
The danger of the data lake becoming a flea market
Botond Horvath / Shutterstock.com
INVENTORY
DATA
Can’t create and maintain an
inventory fast enough
Big Data Architect INVENTORY
DATA
Can’t explore everything to find
the best item
Data Engineer/Data
Scientist/Business Analyst
INVENTORY
DATA
Can’t tell what’s what and what
can be trusted
CDO/Data Steward
© 2015 MapR Technologies 6
Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision
© 2015 MapR Technologies 7
Governed data lake is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
© 2015 MapR Technologies 8
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Operational
Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including
Apache Hadoop
The Governed Data Lake on Apache Hadoop
Data Inventory:
Find, understand
and govern
© 2015 MapR Technologies 9
The Governed Data Lake
Define Ingest Inventory Explore Provision
Wrangle/Model/Vi
sualize
• Critical data elements
• Sensitive data elements
• Security and data
governance policies
• Load
• Profile
• Automatic tagging
• Discover metadata
and generate tags
• Discover data lineage
• Manage tags
• Browse/search
inventory
• Inspect data quality
• Tag and annotate
• Bookmark
• Copy
• Authorized view
Governed data lake as a shared service
Data Governance Data Discovery Agility
Data protection, authentication, authorization, auditing
Can you achieve both?
© 2015 MapR Technologies 10
Find, understand and govern data in Hadoop
© 2015 MapR Technologies 11
Waterline Data is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
© 2015 MapR Technologies 12
Inventory
© 2015 MapR Technologies 13
Find and Understand
© 2015 MapR Technologies 14
Provision
Future: Generate
Drill Views
© 2015 MapR Technologies 15
Governance
© 2015 MapR Technologies 16
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Operational
Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including
Apache Hadoop
The Governed Data Lake on Apache Hadoop with MapR
Data Inventory:
Find, understand
and govern
© 2015 MapR Technologies 17
Separate Distinct Data Sets via MapR Volumes
Volumes dramatically simplify
management:
• Replication factor
• Scheduled mirroring
• Scheduled snapshots
• Data placement control
• User access and tracking
• Administrative permissions
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson
© 2015 MapR Technologies 18
MapR Trust Model (Product Security)
Flexible
Authentication
• Wire-level authentication for all
services in the cluster
• NSA-level cryptographic algorithms
• Integration with LDAP, Active
Directory and other third party
directory services
• Kerberos or username/password
authentication
1
A
AA
DP
Granular
Authorization
• Access Control Expressions
• Protect files, tables, column families,
columns, and management objects
• Extend to role-based access control
(RBAC) with custom role functions
• Drill Views
2Robust
Auditing
• All events recorded immediately
in JSON log files
• Includes data access and
administrative actions
• Ad-hoc queries and custom
reports on audit logs via SQL and
standard BI tools
3
Ubiquitous
Data Protection
• Encryption for Data in Motion
• Within a Cluster
• Between Clusters
• Between Client and Cluster
• Encryption for Data at Rest
• LUKS
• Self-Encrypting Disk
• Partners
4
© 2015 MapR Technologies 19
MapR Comprehensive Auditing
Serving Security Analysts…
Monitoring
Incident
Response
• Who touched customer records outside of
business hours?
• What actions did users take in the days
before leaving the company?
• What operations were performed without
following change control?
• Are users accessing sensitive files from
protected/secured source IPs?
• Why do my reports look different, despite
sourcing from same underlying data?
Security
© 2015 MapR Technologies 20
MapR Comprehensive Auditing (cont.)
…And Data Scientists Too
• Which data is used most frequently?
Implication: High Value; Share More
Broadly
• Which data is least commonly used?
Implication: Low Value; Candidate
for Purge
• Which data should be used more?
Implication: Underutilized; Increase
Awareness
• What administrative actions are
most commonly performed?
Implication: Candidate for
automation
Predictive Analytics
© 2015 MapR Technologies 21
MapR Audits – Key Features
Data Access
• Files
• MapR-DB Tables
Cluster Operations
• Administrative Operations
• Maprcli commands
Authentication Requests
Secure
High Performance
Flexible
• Retention Period
• Maxsize
• Coalesce Interval
JSON Format
{"timestamp":"{$date=2015-06-
01T05:24:58.231Z}","operation":"GETATTR",
"user":"root","uid":"0","ipAddress":"10.10.x.x",
"nfsServer":"10.10.x.x","srcPath":"/dbtest.0/","
srcFid":"2147.16.2","VolumeName":“mktg_file
s","volumeId":“mktg_files","status":"0"}
© 2015 MapR Technologies 22
Access Control that Scales
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
User
User
© 2015 MapR Technologies 23
Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
© 2015 MapR Technologies 24
Find, Understand and Govern Data in Hadoop
At Scale and in Real-Time
Discover and protect
sensitive data, audit
and authorize access
to the data lake,
discover data lineage,
and provide data
stewardship
CDO/Data Steward
Automate cataloging of
data assets at scale,
with secure
provisioning to
business users
Big Data Architect
Find and understand
best-suited and most
trusted data without
having to explore
every file manually
Data Engineer/Data
Scientist/Business Analyst
© 2015 MapR Technologies 25
Learn More
www.waterlinedata.com
• Watch the solution video
• Read analyst papers
• Download the free Waterline
Data / MapR sandbox
• Request a demo
• Download and evaluate the
product
www.mapr.com
• Get free On-Demand
Training for Hadoop
• Download the free Waterline
Data / MapR sandbox

Contenu connexe

Plus de MapR Technologies

Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications MapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceMapR Technologies
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital TransformationMapR Technologies
 

Plus de MapR Technologies (20)

Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 

Dernier

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Best Practices to Deploy a Governed Data Lake

  • 1. © 2015 MapR Technologies 1© 2015 MapR Technologies Deploying a Governed Data Lake
  • 2. © 2015 MapR Technologies 2 Welcome • Event will be recorded • Ask your questions in the Q&A Panel in the lower right-hand corner of your screen • Tweet us @mapr during the event
  • 3. © 2015 MapR Technologies 3 Key Points • The data lake is becoming a “real-time” shared service to provide data to the business to support data science and big data analytics needs • As the data lake becomes a trusted source of data to drive big data analytics, security and data governance have to be addressed • Security and data governance policies need to be implemented in a way that still enables self-service and quick time to value vs. creating 3-6 month delays
  • 4. © 2015 MapR Technologies 4 Deliver Data Discovery Agility with a Governed “Data Layer” Adhere to security, compliance and data governance policies Catalog data assets at scale, with secure provisioning to the business Find and understand best- suited and most trusted data
  • 5. © 2015 MapR Technologies 5 The danger of the data lake becoming a flea market Botond Horvath / Shutterstock.com INVENTORY DATA Can’t create and maintain an inventory fast enough Big Data Architect INVENTORY DATA Can’t explore everything to find the best item Data Engineer/Data Scientist/Business Analyst INVENTORY DATA Can’t tell what’s what and what can be trusted CDO/Data Steward
  • 6. © 2015 MapR Technologies 6 Imagine shopping on Amazon.com GOVERNANCE Inventory Find and Understand Provision
  • 7. © 2015 MapR Technologies 7 Governed data lake is like Amazon.com for data in Hadoop GOVERNANCE Inventory Find and Understand Provision
  • 8. © 2015 MapR Technologies 8 Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS SENSORS BLOGS, TWEETS, LINK DATA Analytics Search Schema-less data exploration BI, reporting Ad-hoc integrated analytics Operational Apps Recommendation Fraud Detection Logistics MapR-DB MapR-FS MapR Data Platform Distribution including Apache Hadoop The Governed Data Lake on Apache Hadoop Data Inventory: Find, understand and govern
  • 9. © 2015 MapR Technologies 9 The Governed Data Lake Define Ingest Inventory Explore Provision Wrangle/Model/Vi sualize • Critical data elements • Sensitive data elements • Security and data governance policies • Load • Profile • Automatic tagging • Discover metadata and generate tags • Discover data lineage • Manage tags • Browse/search inventory • Inspect data quality • Tag and annotate • Bookmark • Copy • Authorized view Governed data lake as a shared service Data Governance Data Discovery Agility Data protection, authentication, authorization, auditing Can you achieve both?
  • 10. © 2015 MapR Technologies 10 Find, understand and govern data in Hadoop
  • 11. © 2015 MapR Technologies 11 Waterline Data is like Amazon.com for data in Hadoop GOVERNANCE Inventory Find and Understand Provision
  • 12. © 2015 MapR Technologies 12 Inventory
  • 13. © 2015 MapR Technologies 13 Find and Understand
  • 14. © 2015 MapR Technologies 14 Provision Future: Generate Drill Views
  • 15. © 2015 MapR Technologies 15 Governance
  • 16. © 2015 MapR Technologies 16 Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS SENSORS BLOGS, TWEETS, LINK DATA Analytics Search Schema-less data exploration BI, reporting Ad-hoc integrated analytics Operational Apps Recommendation Fraud Detection Logistics MapR-DB MapR-FS MapR Data Platform Distribution including Apache Hadoop The Governed Data Lake on Apache Hadoop with MapR Data Inventory: Find, understand and govern
  • 17. © 2015 MapR Technologies 17 Separate Distinct Data Sets via MapR Volumes Volumes dramatically simplify management: • Replication factor • Scheduled mirroring • Scheduled snapshots • Data placement control • User access and tracking • Administrative permissions /projects /tahoe /yosemite /user /msmith /bjohnson
  • 18. © 2015 MapR Technologies 18 MapR Trust Model (Product Security) Flexible Authentication • Wire-level authentication for all services in the cluster • NSA-level cryptographic algorithms • Integration with LDAP, Active Directory and other third party directory services • Kerberos or username/password authentication 1 A AA DP Granular Authorization • Access Control Expressions • Protect files, tables, column families, columns, and management objects • Extend to role-based access control (RBAC) with custom role functions • Drill Views 2Robust Auditing • All events recorded immediately in JSON log files • Includes data access and administrative actions • Ad-hoc queries and custom reports on audit logs via SQL and standard BI tools 3 Ubiquitous Data Protection • Encryption for Data in Motion • Within a Cluster • Between Clusters • Between Client and Cluster • Encryption for Data at Rest • LUKS • Self-Encrypting Disk • Partners 4
  • 19. © 2015 MapR Technologies 19 MapR Comprehensive Auditing Serving Security Analysts… Monitoring Incident Response • Who touched customer records outside of business hours? • What actions did users take in the days before leaving the company? • What operations were performed without following change control? • Are users accessing sensitive files from protected/secured source IPs? • Why do my reports look different, despite sourcing from same underlying data? Security
  • 20. © 2015 MapR Technologies 20 MapR Comprehensive Auditing (cont.) …And Data Scientists Too • Which data is used most frequently? Implication: High Value; Share More Broadly • Which data is least commonly used? Implication: Low Value; Candidate for Purge • Which data should be used more? Implication: Underutilized; Increase Awareness • What administrative actions are most commonly performed? Implication: Candidate for automation Predictive Analytics
  • 21. © 2015 MapR Technologies 21 MapR Audits – Key Features Data Access • Files • MapR-DB Tables Cluster Operations • Administrative Operations • Maprcli commands Authentication Requests Secure High Performance Flexible • Retention Period • Maxsize • Coalesce Interval JSON Format {"timestamp":"{$date=2015-06- 01T05:24:58.231Z}","operation":"GETATTR", "user":"root","uid":"0","ipAddress":"10.10.x.x", "nfsServer":"10.10.x.x","srcPath":"/dbtest.0/"," srcFid":"2147.16.2","VolumeName":“mktg_file s","volumeId":“mktg_files","status":"0"}
  • 22. © 2015 MapR Technologies 22 Access Control that Scales PAM Authentication + User Impersonation Fine-grained row and column level access control with Drill Views – no centralized security repository required Files HBase Hive Drill View 1 Drill View 2 UUU User User
  • 23. © 2015 MapR Technologies 23 Ownership Chaining Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  • 24. © 2015 MapR Technologies 24 Find, Understand and Govern Data in Hadoop At Scale and in Real-Time Discover and protect sensitive data, audit and authorize access to the data lake, discover data lineage, and provide data stewardship CDO/Data Steward Automate cataloging of data assets at scale, with secure provisioning to business users Big Data Architect Find and understand best-suited and most trusted data without having to explore every file manually Data Engineer/Data Scientist/Business Analyst
  • 25. © 2015 MapR Technologies 25 Learn More www.waterlinedata.com • Watch the solution video • Read analyst papers • Download the free Waterline Data / MapR sandbox • Request a demo • Download and evaluate the product www.mapr.com • Get free On-Demand Training for Hadoop • Download the free Waterline Data / MapR sandbox