Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Data Processing in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
Big Data & Data Science Israel Meetup – 21.03.2017
Analytics and Data Pipelines in Practice

About Me
• Partner & Co-Founder at OpenCore
• Before, EMEA Chief Architect at Cloudera
• 5+ years
• Hadoop since 2007
• Apache Committer
• HBase and Whirr
• O’Reilly Author: HBase – The Definitive Guide
• Also in Japanese, Korean & Chinese
• 2nd edition out soon!
• Contact
• lars@opencore.com
• @larsgeorge 日本語版も出ました!

Agenda
• Hadoop History
• Data Pipelines
• Hadoop Components
• Data Processing
• Summary

Hadoop History
A walk through time…

Tectonic Shifting: Prevalent Data Inertia

The Original Inspirations for Hadoop
2003 2004

A Decade of Hadoop History on One Slide
Ten years ago, “Hadoop” referred to a scalable, fault-tolerant
filesystem (HDFS) and programming framework (MapReduce)
for distributed computing.
Today, it refers to both a kernel containing the aforementioned
pieces, as well as a constantly evolving ecosystem of 25+ data
stores, execution engines, programming and data access
frameworks, and other componentry.
Recognize this guy?

Hadoop’s Original Architecture
MapReduce
(Data Processing and Resource Management)
HDFS
(Filesystem/Storage)

Hadoop's Architecture Today
MapReduce
(Data Processing)
YARN
(Resource Management)
HDFS
(Storage)

Popular by Demand
• More resources are poured into
Hadoop than many other
projects
• Vibrant community with many
commercial entities backing the
development
• List on the right lists separate
projects, which are combined in
Hadoop distributions
• Total would far exceed anything
else
• Literally no alternatives!

Data Pipelines
From deluge to insight

Data Pipeline Components
• Pipelines need data and CPUs
• Continuous ingest lands new
data in various ways
• Access to data allows for
consumers to build products
• All of this needs to be
• Automated & managed
• Done in a secure manner
• Finally, pipelines need to be
properly onboarded
• Discovery is necessary to find
schemas, data sources, etc.
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems

Pipelines Increase Value of Data

Now that we know how data pipelines span many layers in both hardware
and software, we can look at what Hadoop has to offer in more detail…

Hadoop Components
Growth and Controversies

Example: Cloudera
Batch,
Interactive, and
Real-Time.
Leading performance
and usability in one
platform.
• End-to-end analytic
workflows
• Access more data
• Work with data in new
ways
• Enable new users
Security and Administration
Process
Ingest
Sqoop, Flume,
NiFi
Transform
MapReduce,
Hive, Pig, Spark
Discover
Analytic Database
Impala
Search
Solr
Model
Machine Learning
SAS, R, Spark,
Mahout
Serve
NoSQL Database
HBase
Streaming
Spark Streaming
Unlimited Storage HDFS, HBase
YARN, Cloudera Manager,
Cloudera Navigator
One Platform, Many Workloads

Hadoop: One Platform
• Different to the silo’ed, monolithic databases, Hadoop is a single, shared
platform, with multiple entry points (access engines)
• Scale and resilience is inherently built in
• There are no silos, everything is just a directory with data inside
But…
• How do you know what is where?
• Access needs to be tightly controlled, down to the field level!

Analogy: The Universal Flatbed
• Hadoop is a powerful engine exposed as a platform to carry loads
• Initially the platform is bare and beckons for customization
• You can convert the flatbed to what is needed
But…
• Once converted, how to switch
between workloads?
• How do you share the engine with
different users?

Hadoop Architecture Today
• Components are selected to
match customer demands
• A platform has many
advantages, including paid
QA time
• Some newer components
can be added later on
• Labs etc.
• Many buzzwords that need
to be carefully vetted…

2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
The stack is continually evolving and growing!
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform

Hadoop - The Movie: “Divergent”
Hadoop
Core
2006
HDPCDH
2008 2011
CM
Navigator
2013
Sentry
2014
RangerAmbari
Impala
2016
CDSW
2015 2017
ZeppelinAtlas
Knox
Solr
Spark
Kafka
Kudu
YARN

So, Hadoop is both complicated and divergent? How can we build data
pipelines then, using its components? What else is needed?

Data Processing In Hadoop Today
Coasting through the "Trough of Disillusionment"

Wait! Before we can look at the aspects of building a data pipeline, a bit
more context on where users are coming from and what their needs are: The
Waves of Adoption.

Waves of Adoption #1
• The “AllSpark” (as in the Transformers movie)
• First companies to adopt Hadoop as a way to mirror Google’s approach
• Early Adopters
• Inspired by early success stories, these engineering focused companies extended on
Hadoop
• Followers
• Companies that are OK to try out new things
• Still engineering driven
• Late Bloomers
• First Enterprises
• New Wave
• Everyone else… AllSpark
Early
Adopters
Followers
Late
Bloomers
Enterprises
TODAY!

Waves of Adoption #2
• Simple logic at bulk (batch processing of petabytes)
• What: Reporting
• With: SQL (Hive), Pig
• Who: Analysts, Developers
• Streaming logic, likely in Lambda architecture
• What: Decision support
• With: OLAP Analytics, Druid, Oryx
• Who: Data architects, DevOps
• Complex analytics
• What: Machine Learning, AI
• With: Notebooks, DS Workbench,
• Who: Data Scientists
Batch
Lambda
Kappa?

Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Access
Physical Systems

Storage & Processing
Storage
• Reliable and scalable systems:
HDFS, Kafka, HBase
• What about Kudu, Cassandra, …
MongoDB?
• Data laid out in a structured
manner
• Information Architecture
• Physical storage (e.g. columnar)
Processing
• Generic framework: YARN
• What about Mesos? Non-batch
jobs?
• Resource management hooks
• Pluggable engines
• MapReduce, Spark, …
• MPP Systems?

Information Architecture
• There is a need to define how data
flows through the system and is
organized
• This simplifies the onboarding
process
• Can be simple, or arbitrarily
complex
• Needs to be enforced as it is used
• Living system, may need to adopt
• Define batch and stream
interfaces

Example: YARN Services?
• Little progress in
years
• Still batch
oriented
• Projects shoehorn
service idea into
YARN using
kludges
• Example: Slider,
Trill

Ingest
• Purpose
• Receive data from heterogeneous sources
• Save as-is, or do first pass processing
• Store data in best format, aggregate small files
• Comply to stack rules (security, IA)
• One of the most active areas
• Vibrant third-party ecosystem
• Streamsets, Tamr, Waterline Data, Trifacta, IBM, …
• Often a generic task, with Hadoop being only one target
• Open-source frameworks
• NiFi
• Flume (with Kafka)?

Access
Physical Systems
Stage:
• Ingest
• Access
• Security

Access
• Hadoop has traditionally only a few interfaces
• Interactive SQL
• Shell, Notebooks, Hue
• JDBC/ODBC
• File Access
• WebHDFS/HttpFs
• Gateways
• REST, Knox
• Needs to be set up based on the use-case
• Throughput vs Latency
• Must apply security rules

Automation & Management
• PoCs and prototyping are not production grade!
• Need to automate the pipelines with monitoring and alerting
• Full development lifecycle needs to be established
• Precious resources need to be managed
• Easier if use-cases all fall into the same category
• Difficult when they span many systems
• One of the remaining topics not addressed at all in Hadoop
• Change management should handle dynamic reconfiguration

Automation
• Directed acyclic graphs (DAGs)
• Define the actions and link them
• Schedules based on various events (time or data)
• Handle errors and maintenance
• Examples
• Apache Oozie [2007, 2010 O/S, 2012 Apache]
• Java
• XML or Hue
• Azkaban (LinkedIn) [2010]
• Java
• Luigi (Spotify) [2012]
• Python
• Apache Airflow (Airbnb) [2015]
• Python

Example: Notebooks
• Data scientists like prototyping
• But how to bring the results into
production?
• One attempt is to boost notebooks
with a framework that can handle
their chaining and execution
• Shared resources used
• Depends on notebook backends
Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html

Security
• Many moving parts
• Kerberos
• RPC Level
• ACLs
• RBAC
• UIs
• Data
• Encryption (at-rest and in-
transit)
• Hard to configure properly
• Management software helps
to a degree

Onboarding Use-Cases
• Ask the necessary questions ahead of time
• Use the answer to set (initially) strict limits
• Use HDFS quotas, YARN queues, etc.
• Initialize the system with the defaults
• Communicate to other teams what the expected impact might be
• During onboarding explain the shared nature of Hadoop
• Avoid “long faces” due to changes (change management)
• Define costs and chargeback models
• Automate into self-service if possible
• Push updated configuration and notifications

Stage:
• Ingest
• Access
Access
Physical Systems

Stack Architecture
• Combine the reliable components into a
whole stack
• Organize interfaces to outside systems
by users and purpose
• Separate components for ease of
maintenance
• Layer network to fit data flow
• Tight security control at vital points

Wrap Up
Date pipeline deconstruction

“Oh… and I thought I just add Hadoop to our technology
landscape… you know, like a database or an appliance.”
– Misled Decision Maker

Hype Curve
Visibility
TimeTechnology Trigger
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity

Technology Waves
• Hadoop is just one part of the hype curve
• Technologies that follow may (heavily, or even solely) depend on it
• “Shaky foundations”?
• But… most (if not all) technologies are initially oversold and overhyped
• What happens in practice?

Hype Curve – The Hadoop Version
Visibility
Time
“Big Data is
Strategic for us!”
First PoC
“Where are the results?”
“Darn, Hadoop
is difficult!”
“Security? Multitenancy?
Development? Lifecycle?
Environments?”
“Maybe Hadoop
is not for us?”
Allocate more
Resources & Budget
First use-case in production
Hadoop Team Productivity
Meanwhile…

Summary
• Data Pipelines span many levels of architectures
• Hardware, Networking, Information, Security, Data Management
• Core Hadoop itself only provides little in that regard
• Vendors offer some support (closed or open source)
• Use-case are often unknown
• Guess as good as possible, generalize
• Careful planning is vital, mistakes are costly
• Mixed workloads are a nightmare for resource management
• Keep things simple (KISS principle)
• Knowledge needs to be built upfront
• Hire someone in the know!

Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Similaire à Data Pipelines in Hadoop - SAP Meetup in Tel Aviv (20)

Plus de larsgeorge

Plus de larsgeorge (13)

Dernier

Dernier (20)

Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Notes de l'éditeur