Keynote slides from Big Data Spain Nov 2016. Has some thoughts on how Hadoop ecosystem is growing and changing to support the enterprise, including Hive, Spark, NiFi, security and governance, streaming, and the cloud.
Speaker: Alan Gates, Hortonworks Co-Founder
Title: 10 Years of Apache Hadoop and Beyond
Duration: 40 minutes
Abstract:
In 2006, Apache Hadoop had its first line of code committed to what has become a breakthrough technology. A decade later, we are witness to open source innovation that has literally changed the face of business. Hadoop and related technologies have become the enterprise data platform, fueled by a rich ecosystem capable of supporting any application, any data, anywhere. Join Hortonworks Co-Founder Alan Gates as he as he drills down into the current and future state of Hadoop and reviews community initiatives aimed at enabling the next wave of modern data applications that are well governed and easy to deploy on-premises and in the cloud.
Our Hadoop journey began in 2006 focused on executing batch MapReduce jobs on petabytes of data.
Yahoo’s decision to contribute Hadoop to the Apache Software Foundation was critical because a vibrant set of related technologies began to appear around Hadoop.
[NEXT]
Fast forward to 2011 and the concept of YARN began to emerge.
Its goal? Enable Hadoop to move from its batch-only roots and become a data platform capable of running batch, interactive, and real-time applications.
The emergence of YARN further accelerated the innovation around Hadoop with the emergence of Spark, Kafka, Storm, and many other projects that started life as Apache Incubator proposals.
[NEXT]
I want to focus for minute in one area of how Hadoop has developed. Apache Hive has participated in that move from batch to interactive, from ETL only to enterprise ready EDW
swiss army knife of big data, can do streaming, SQL, ETL, ML
available from multiple languages (python, java, scala)
So the enterprise has invested in integrating Hadoop into its data lake architecture.
Landing petabyte of data from streams, pipelines, data feeds into HDFS files, Hive and HBase tables, etc.
The question arises of how we can setup policies for these data sets that enable us to secure and govern access to it.
[NEXT ]
The community has been hard at work on integrating Apache Atlas as a metadata catalog and Apache Ranger as the centralized security system to address this need.
The result is tag-based authorization model driven by the metadata catalog (i.e. Atlas) with access and audit policies applied to those tags (via Ranger).
This enables a more flexible way to govern access to data and data sets than traditional role/group based access policies.
Ex. as data pipelines land data, they can tag that data as the data lands and the access policies setup for those tags immediately apply.
Moreover, Ranger has added the notions of time-based and location-based access policies, so users can do things like limit access to data that’s older than 90 days (for example) or limit access to data from certain geographies.
This provides important enterprise-focused capabilities that will help businesses deploy more modern data applications in a way where they have the confidence their data is secure and well-governed.
[NEXT]
TALK TRACK
People are no longer willing to wait until data is in the store before processing it
Hortonworks DataFlow is a platform for data in motion.
It is powered by Apache NiFI, Kafka, and Storm for dataflow management and stream processing.
MiNiFi/NiFi : creates dynamic, configurable data pipelines
Kafka support adaptation to differing rates of data creation and delivery
Storm for real-time streaming processing to create immediate insights at a massive scale.
There are scenarios where NiFI will provide all that you you need – especially in situations that only require dataflow management, but you will notice the orange and blue horizontal triangles provide a continuum of capability from edge to core, that indicates varying degrees of need for the different products.
So after 10 years, the Hadoop ecosystem is available everywhere.
In the Data Center, within appliances, across public and private clouds.
This maximizes choice for people interested in getting started with Hadoop and deploying it at scale for transformational use cases.
[NEXT]
13
While there are a range of great choices in the market today, there’s more that we, as a community, can and should do to make Hadoop in the cloud better and first class.
I’ll spend the remainder of this talk on the key architectural considerations
Shared Data & Storage – the shared-data-lake is on cloud storage, it is not HDFS.
Also memory and local storage play a different role – that of caching
An import distinction in the cloud is On-Demand Ephemeral Workloads – this changes a number of things in fundamental ways.
Shared Metadata, Security, and Governance remains important but need to be adjusted in the face of ephemeral clusters.
And finally, I’ll touch on Elastic Resource Management
We need to shift our thinking away from cluster resource management and more towards SLA-driven workloads
[NEXT]
In the cloud, the shared DataLake is on cloud storage. It is not HDFS of a specific Hadoop cluster.
Note this is very different from a traditional on-premise cluster where each cluster has an internal shared store representing its internal DataLake.
Moreover, it’s desirable to have this shared data be accessible by all apps, not just Hadoop apps – Cloud Native and 3rd party
Good news: goe-distribution is built-into the cloud storage and DR becomes simpler.
Cloud storage has two limitation:
Eventual consistency and its API does not match the filesystem API expected by Hadoop and normal apps.
Addressing these two issues is a key area of ongoing investment. I encourage you to attend today’s breakout session by my fellow Hortonworkers that’s focused on this topic.
Cloud Storage is designed for low cost & scale – unfortunately performance is not its strong point due to segregation from compute.
Memory and Local storage play a different role in the cloud – cache to enhance the performance.
[NEXT]
Wrt to caching we need to consider both tabular data and non-tabular data.
For tabular data, LLAP comes to rescue – it provide a tabular cache that’s shared across jobs, apps, and engines such as Hive and Spark.
LLAP only caches the needed columns, so it’s very efficient in its use of memory.
Further, data is stored in an internal serialized form to optimize compute
The design center is anti-caching – put it all only memory and spill to disk/SSD when memory is full .
LLAP currently provides read caching, but is being extended to support a write-through cache.
And LLAP addresses a key security Gap for the Hadoop eco system, it provides a convenient place to address column-level and row-level access control
that works across all kinds of Engines: Hive, Spark, Flink, or even old fashion MapReduce. Note this was not previously possible ….
From a non-tabular data perspective, HDFS can be used to cache cloud data – both a read cache and a write-through cache.
This essentially evolves HDFS to play a different role, A place to store intermediate data
and also to be a finely-tuned caching layer between the applications and the cloud storage.
[NEXT]
Always-on multitenant clusters are important for a range of mission critical use cases.
However, bringing forth an ephemeral cluster to support a specific workload is game changing.
The agile nature of the cloud allows us to create prescription workload environments.
For someone interested in modeling and analyzing data sets,
- they simply want to interact with a PRE-TUNED environment optimized for the application.
- The complexities of configuring Spark, Hive and Hadoop need to be hidden under the hood.
Whether it’s data science, data warehouse, ETL, or other common workload types,
- provide pre-configured and pre-tuned compute environments
- Further we need be able manage them in, ephemeral fashion.
The NET: deliver user experiences that are focused on business agility,
- rather than infinite configurability and cluster management.
[NEXT]
So far: I shared data and storage and how to optimize performance by caching.
Shared data fundamentally requires a shared approach to metadata, security and governance.
The Metadata is not just the classic Hive metadata that describes the tabular data,
-about storing and tracking the lineage and provenance of data,
- about details related to data pipeline processing and job management.
Tabular data needs to be available to all applications so that SQL is an option regardless of where your data is
Also, as data is ingested and processed, metadata needs to be created and adjusted
Governance and securing the data remain critical and its matadata needs to be managed across all workloads.
- The work done by projects such as Ranger and Atlas need to be evolved to fit the cloud environment.
If we don’t do this then the cloud will not be adapted aggressively for enterprise use.
Getting back to the Shared metadata – each ephemeral cluster cannot have it private copy of the metadata..
In the cloud world, metadata must be centrally stored so it is used across all ephemeral clusters.
[NEXT]
Final area: resource management.
We up-level resource manage
- So far, Yarn has focused on optimizing resources in the context cluster.
- The cloud is not about the cluster it is about the workloads, And further resources are elastic.
The scheduler needs to change its focus to managing resources in the context of a workload and meeting the workload’s SLA
- It may need get extra resources from the cloud - get the right resource to match the needs of the workload.
Sometimes adding compute power is not sufficient to meet an SLA, because latency/bandwidth to data may be the bottleneck
–e.g. spin up LLAPs memory in order to improve caching and hence meet SLA
Cloud offers another dimension – that of cost and budgets.
There are different costs tied to CPU, memory and data access bandwidth, so elasticity and Spot pricing tradeoffs should be factored in
Resource management in this new dimension is important if you want reap the benefit low cost cloud computing.
To Summerize: the better one understands the nature of a workload, the more we are able to take advantage of elasticity and spot pricing.
CONCLUDE: While one could lift and shift Hadoop on the cloud, I hope I have convinced you that we really need to evolve Hadoop to run first class in the cloud
and also to take advantage of the unique cloud features such as elasticity.
We at Hortonworks have been working on this over the few months and Ram will show you a quick demo of the tech-preview we are releasing this week.
[NEXT]
Today we have talked about evolving Hadoop to run well in the cloud.
At Hortonworks, we are focused on enabling a connected data architecture that seamlessly spans the cloud and data center.
This is illustrated on the screen.
It stress two important points – the connectedness of the cloud and the on-premise infrastructure & data..
Also it illustrates the the connected ness of data at motion and data at rest.
The Era of the Internet-of-Things demands that we manage the entire lifecycle of all data
- (data in motion and data at rest)
It’s about being able to collect and curate data across traditional silos so the various groups and lines of business can have a place where they can assemble a single view of data in order to drive deep historical insights.
It’s also about proactively managing data from its point of inception and securely acquiring and delivering it. Moreover, it’s not just about point-to-point delivery, but it’s also about enabling bi-directional data flows that can leverage both real-time and historical insights to help shape and prioritize the flow of data.
So in this diagram, for example, the upper-left edge could represent the connected car, whereas the lower-left edge can represent data from the manufacturing line. Having a connected data architecture that enables you to deal with all of this data unlocks the ability to figure out what manufacturing line issues may be causing operational issues in cars in the field, for example.
In this world of next generation applications, I am existed about evolving the Hadoop eco-system to enable these types of use cases and usage models.
[NEXT SLIDE]