Enabling Data Management in a Big Data World

•Télécharger en tant que PPTX, PDF•

1 j'aime•1,415 vues

adoop has enabled a new scale of data processing that is paving the way for data driven businesses. However, business data is often riddled with compliance and regulatory requirements that can be easily lost as data is manipulated, transformed, and re-written within the Hadoop eco-system. Furthermore, enterprise data is often scattered across a wide array of systems, each with their own techniques for policy management. As data from these disparate systems is loaded into Hadoop, all of the carefully crafted policy is immediately lost, creating a potential risk for the business. Data provenance is widely recognized as a technique for applying policy in more traditional industries such as storage, databases and high-performance computing. By tracking data from its origin and across various transformations and computations, provenance tracking systems can answer questions such as: Who has seen a given piece of data? Where did this data come from? What policies existed on this data? In this talk, we will discuss traditional data management solutions, the challenges of bringing them to an eco-system like Hadoop, and approaches to enable data management in the growing Big Data world.

Technologie Business

Enabling data management in a
big data world
Craig Soules
Garth Goodson
Tanya Shastri

The problem with data management
• Hadoop is a collection of tools
– Not tightly integrated
– Everyone’s stack looks a little different
– Everything falls back to files

Agenda
• Traditional data management
• Hadoop’s eco-system
• Natero’s approach to data management

What is data management?
• What do you have?
– What data sets exist?
– Where are they stored?
– What properties do they have?
• Are you doing the right thing with it?
– Who can access data?
– Who has accessed data?
– What did they do with it?
– What rules apply to this data?

Traditional data management
External
Data
Sources
Extract
Transform
Load
DataWarehouse
Integrated storage
Data processing
Users
SQL

Key lessons of traditional systems
• Data requires the right abstraction
– Schemas have value
– Tables are easy to reason about
• Referenced by name, not location
• Narrow interface
– SQL defines the data sources and the processing
• But not where and how the data is kept!

Hadoop eco-system
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig HiveQL Mahout
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
More varied data
sources with many
more access / retention
requirements
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Data accessed through
multiple entry points
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
Lots of new
consumers of the
data
HiveQL Mahout

Steps to data management
• Provide access at the right level
• Limit the processing interfaces
• Schemas and provenance provide control
• Enforce policy
1
3
2
4

Case study: Natero
• Cloud-based analytics service
– Enable business users to take advantage of big data
– UI-driven workflow creation and automation
• Single shared Hadoop eco-system
– Need customer-level isolation and user-level access controls
• Goals:
– Provide the appropriate level of abstraction for our users
– Finer granularity of access control
– Enable policy enforcement
– Users shouldn’t have to think about policy
• Source-driven policy management

Natero application stack
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig
Access-aware workflow compiler
Schema
Extraction
Policy
and
Metadata
Manager
Provenance-aware scheduler
HiveQL Mahout
1
3
2
4

Natero execution example
Job
Sources
Job
Compiler
Metadata
Manager
Scheduler
• Fine-grain
access control
• Auditing
• Enforceable policy
• Easy for users
Natero
UI

The right level of abstraction
• Our abstraction comes with trade-offs
– More control, compliance
– No more raw Map-Reduce
• Possible to integrate with Pig/Hive
• What’s the right level of abstraction for you?
– Kinds of execution

Hadoop projects to watch
• HCatalog
– Data discovery / schema management / access
• Falcon
– Lifecycle management / workflow execution
• Knox
– Centralized access control
• Navigator
– Auditing / access management

Lessons learned
• If you want control over your data, you also
need control over data processing
• File-based access control is not enough
• Metadata is crucial
• Users aren’t motivated by policy
– Policy shouldn’t get in the way of use
– But you might get IT to reason about the sources

Recommandé

Customer Support in the Big Data EraDataWorks Summit

Lean Startups in Japanese Companies takashi tsutsumi_masato_iinoStanford University

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Recommandé

Customer Support in the Big Data EraDataWorks Summit

Lean Startups in Japanese Companies takashi tsutsumi_masato_iinoStanford University

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit

Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit

Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit

Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit

Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit

Hadoop Storage in the Cloud Native EraDataWorks Summit

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Contenu connexe

Plus de DataWorks Summit