This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
1. Data Processing in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
Big Data & Data Science Israel Meetup – 21.03.2017
Analytics and Data Pipelines in Practice
2. About Me
• Partner & Co-Founder at OpenCore
• Before, EMEA Chief Architect at Cloudera
• 5+ years
• Hadoop since 2007
• Apache Committer
• HBase and Whirr
• O’Reilly Author: HBase – The Definitive Guide
• Also in Japanese, Korean & Chinese
• 2nd edition out soon!
• Contact
• lars@opencore.com
• @larsgeorge 日本語版も出ました!
7. A Decade of Hadoop History on One Slide
Ten years ago, “Hadoop” referred to a scalable, fault-tolerant
filesystem (HDFS) and programming framework (MapReduce)
for distributed computing.
Today, it refers to both a kernel containing the aforementioned
pieces, as well as a constantly evolving ecosystem of 25+ data
stores, execution engines, programming and data access
frameworks, and other componentry.
Recognize this guy?
10. Popular by Demand
• More resources are poured into
Hadoop than many other
projects
• Vibrant community with many
commercial entities backing the
development
• List on the right lists separate
projects, which are combined in
Hadoop distributions
• Total would far exceed anything
else
• Literally no alternatives!
12. Data Pipeline Components
• Pipelines need data and CPUs
• Continuous ingest lands new
data in various ways
• Access to data allows for
consumers to build products
• All of this needs to be
• Automated & managed
• Done in a secure manner
• Finally, pipelines need to be
properly onboarded
• Discovery is necessary to find
schemas, data sources, etc.
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
16. Example: Cloudera
Batch,
Interactive, and
Real-Time.
Leading performance
and usability in one
platform.
• End-to-end analytic
workflows
• Access more data
• Work with data in new
ways
• Enable new users
Security and Administration
Process
Ingest
Sqoop, Flume,
NiFi
Transform
MapReduce,
Hive, Pig, Spark
Discover
Analytic Database
Impala
Search
Solr
Model
Machine Learning
SAS, R, Spark,
Mahout
Serve
NoSQL Database
HBase
Streaming
Spark Streaming
Unlimited Storage HDFS, HBase
YARN, Cloudera Manager,
Cloudera Navigator
One Platform, Many Workloads
17. Hadoop: One Platform
• Different to the silo’ed, monolithic databases, Hadoop is a single, shared
platform, with multiple entry points (access engines)
• Scale and resilience is inherently built in
• There are no silos, everything is just a directory with data inside
But…
• How do you know what is where?
• Access needs to be tightly controlled, down to the field level!
18. Analogy: The Universal Flatbed
• Hadoop is a powerful engine exposed as a platform to carry loads
• Initially the platform is bare and beckons for customization
• You can convert the flatbed to what is needed
But…
• Once converted, how to switch
between workloads?
• How do you share the engine with
different users?
19. Hadoop Architecture Today
• Components are selected to
match customer demands
• A platform has many
advantages, including paid
QA time
• Some newer components
can be added later on
• Labs etc.
• Many buzzwords that need
to be carefully vetted…
22. Hadoop - The Movie: “Divergent”
Hadoop
Core
2006
HDPCDH
2008 2011
CM
Navigator
2013
Sentry
2014
RangerAmbari
Impala
2016
CDSW
2015 2017
ZeppelinAtlas
Knox
Solr
Spark
Kafka
Kudu
YARN
23. So, Hadoop is both complicated and divergent? How can we build data
pipelines then, using its components? What else is needed?
24. Data Processing In Hadoop Today
Coasting through the "Trough of Disillusionment"
25. Wait! Before we can look at the aspects of building a data pipeline, a bit
more context on where users are coming from and what their needs are: The
Waves of Adoption.
26. Waves of Adoption #1
• The “AllSpark” (as in the Transformers movie)
• First companies to adopt Hadoop as a way to mirror Google’s approach
• Early Adopters
• Inspired by early success stories, these engineering focused companies extended on
Hadoop
• Followers
• Companies that are OK to try out new things
• Still engineering driven
• Late Bloomers
• First Enterprises
• New Wave
• Everyone else… AllSpark
Early
Adopters
Followers
Late
Bloomers
Enterprises
TODAY!
27. Waves of Adoption #2
• Simple logic at bulk (batch processing of petabytes)
• What: Reporting
• With: SQL (Hive), Pig
• Who: Analysts, Developers
• Streaming logic, likely in Lambda architecture
• What: Decision support
• With: OLAP Analytics, Druid, Oryx
• Who: Data architects, DevOps
• Complex analytics
• What: Machine Learning, AI
• With: Notebooks, DS Workbench,
• Who: Data Scientists
Batch
Lambda
Kappa?
30. Storage & Processing
Storage
• Reliable and scalable systems:
HDFS, Kafka, HBase
• What about Kudu, Cassandra, …
MongoDB?
• Data laid out in a structured
manner
• Information Architecture
• Physical storage (e.g. columnar)
Processing
• Generic framework: YARN
• What about Mesos? Non-batch
jobs?
• Resource management hooks
• Pluggable engines
• MapReduce, Spark, …
• MPP Systems?
31. Information Architecture
• There is a need to define how data
flows through the system and is
organized
• This simplifies the onboarding
process
• Can be simple, or arbitrarily
complex
• Needs to be enforced as it is used
• Living system, may need to adopt
• Define batch and stream
interfaces
32. Example: YARN Services?
• Little progress in
years
• Still batch
oriented
• Projects shoehorn
service idea into
YARN using
kludges
• Example: Slider,
Trill
34. Ingest
• Purpose
• Receive data from heterogeneous sources
• Save as-is, or do first pass processing
• Store data in best format, aggregate small files
• Comply to stack rules (security, IA)
• One of the most active areas
• Vibrant third-party ecosystem
• Streamsets, Tamr, Waterline Data, Trifacta, IBM, …
• Often a generic task, with Hadoop being only one target
• Open-source frameworks
• NiFi
• Flume (with Kafka)?
36. Access
• Hadoop has traditionally only a few interfaces
• Interactive SQL
• Shell, Notebooks, Hue
• JDBC/ODBC
• File Access
• WebHDFS/HttpFs
• Gateways
• REST, Knox
• Needs to be set up based on the use-case
• Throughput vs Latency
• Must apply security rules
38. Automation & Management
• PoCs and prototyping are not production grade!
• Need to automate the pipelines with monitoring and alerting
• Full development lifecycle needs to be established
• Precious resources need to be managed
• Easier if use-cases all fall into the same category
• Difficult when they span many systems
• One of the remaining topics not addressed at all in Hadoop
• Change management should handle dynamic reconfiguration
39. Automation
• Directed acyclic graphs (DAGs)
• Define the actions and link them
• Schedules based on various events (time or data)
• Handle errors and maintenance
• Examples
• Apache Oozie [2007, 2010 O/S, 2012 Apache]
• Java
• XML or Hue
• Azkaban (LinkedIn) [2010]
• Java
• Luigi (Spotify) [2012]
• Python
• Apache Airflow (Airbnb) [2015]
• Python
40. Example: Notebooks
• Data scientists like prototyping
• But how to bring the results into
production?
• One attempt is to boost notebooks
with a framework that can handle
their chaining and execution
• Shared resources used
• Depends on notebook backends
Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html
42. Security
• Many moving parts
• Kerberos
• RPC Level
• ACLs
• RBAC
• UIs
• Data
• Encryption (at-rest and in-
transit)
• Hard to configure properly
• Management software helps
to a degree
44. Onboarding Use-Cases
• Ask the necessary questions ahead of time
• Use the answer to set (initially) strict limits
• Use HDFS quotas, YARN queues, etc.
• Initialize the system with the defaults
• Communicate to other teams what the expected impact might be
• During onboarding explain the shared nature of Hadoop
• Avoid “long faces” due to changes (change management)
• Define costs and chargeback models
• Automate into self-service if possible
• Push updated configuration and notifications
46. Stack Architecture
• Combine the reliable components into a
whole stack
• Organize interfaces to outside systems
by users and purpose
• Separate components for ease of
maintenance
• Layer network to fit data flow
• Tight security control at vital points
51. Technology Waves
• Hadoop is just one part of the hype curve
• Technologies that follow may (heavily, or even solely) depend on it
• “Shaky foundations”?
• But… most (if not all) technologies are initially oversold and overhyped
• What happens in practice?
52. Hype Curve – The Hadoop Version
Visibility
Time
“Big Data is
Strategic for us!”
First PoC
“Where are the results?”
“Darn, Hadoop
is difficult!”
“Security? Multitenancy?
Development? Lifecycle?
Environments?”
“Maybe Hadoop
is not for us?”
Allocate more
Resources & Budget
First use-case in production
Hadoop Team Productivity
Meanwhile…
53. Summary
• Data Pipelines span many levels of architectures
• Hardware, Networking, Information, Security, Data Management
• Core Hadoop itself only provides little in that regard
• Vendors offer some support (closed or open source)
• Use-case are often unknown
• Guess as good as possible, generalize
• Careful planning is vital, mistakes are costly
• Mixed workloads are a nightmare for resource management
• Keep things simple (KISS principle)
• Knowledge needs to be built upfront
• Hire someone in the know!
As a very basic explanation, Hadoop was originally an open source implementation of internal systems built by Google in the early ‘00s to deal with the extraordinarily resource-intensive problem of indexing the Internet every night. Those systems were first described in these papers, and Cutting and Cafarella, who faced similar problems with Nutch, took notice of them quickly. (Later, Google also published its “Bigtable” paper, which led other developers to create HBase.) As Cutting puts it, periodically, “Google sends us messages from the future.”
In the beginning, the word “Hadoop” referred to just two components. Fast forward a decade, and that word now refers to that “kernel” (aka Core Hadoop) as well as to a growing ecosystem of related projects. In that sense, Hadoop now has much in common with Linux, which is also both a kernel and an ecosystem.
Cutting & Cafarella’s initial implementation of these systems consisted of just 2 components: MapReduce and HDFS.
What’s really significant about this architecture is how it unifies diverse access to common data.
In traditional approaches, you’d have separate systems to collect, store, process, explore, model, and serve data. Different teams would use different systems for each workload, and users whose roles span multiple systems would have to use several of them to achieve their objectives.
With Cloudera’s enterprise data hub:
You can perform end-to-end data workflows in a single system, dramatically lowering time to value.
Each workload can access unlimited data, thanks to the underlying data platform, enhancing the value of each workload.
Power users can now access their data in new ways: SQL, search, machine learning, programming, etc.
At the same time, new users are enabled by these diverse workloads to interact with data.
Cloudera Enterprise provides comprehensive support for batch, interactive, and real-time workloads:
Batch
Data integration with Apache Sqoop
Data processing with MapReduce, Apache Hive, Apache Pig
Memory-centric processing with Apache Spark
Interactive
Analytic SQL with Impala
Search with Apache Solr
Machine Learning with Apache Spark
Real-Time
Data integration with Apache Kafka, Apache Flume
Stream processing with Apache Spark
Data serving with Apache HBase
Shared resource management ensures that each workload is handled appropriately and abides by IT policy.
What’s more, 3rd party tools, such as SAS or Informatica can run as native workloads inside Cloudera’s enterprise data hub.
With the expansion of that ecosystem, “Hadoop” has grown much, much bigger than its original “core.”
The rapid expansion of the Hadoop ecosystem is further evidence of its meteoric adoption.