SlideShare une entreprise Scribd logo
1  sur  54
Data Processing in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
Big Data & Data Science Israel Meetup – 21.03.2017
Analytics and Data Pipelines in Practice
About Me
• Partner & Co-Founder at OpenCore
• Before, EMEA Chief Architect at Cloudera
• 5+ years
• Hadoop since 2007
• Apache Committer
• HBase and Whirr
• O’Reilly Author: HBase – The Definitive Guide
• Also in Japanese, Korean & Chinese
• 2nd edition out soon!
• Contact
• lars@opencore.com
• @larsgeorge 日本語版も出ました!
Agenda
• Hadoop History
• Data Pipelines
• Hadoop Components
• Data Processing
• Summary
Hadoop History
A walk through time…
Tectonic Shifting: Prevalent Data Inertia
The Original Inspirations for Hadoop
2003 2004
A Decade of Hadoop History on One Slide
Ten years ago, “Hadoop” referred to a scalable, fault-tolerant
filesystem (HDFS) and programming framework (MapReduce)
for distributed computing.
Today, it refers to both a kernel containing the aforementioned
pieces, as well as a constantly evolving ecosystem of 25+ data
stores, execution engines, programming and data access
frameworks, and other componentry.
Recognize this guy?
Hadoop’s Original Architecture
MapReduce
(Data Processing and Resource Management)
HDFS
(Filesystem/Storage)
Hadoop's Architecture Today
MapReduce
(Data Processing)
YARN
(Resource Management)
HDFS
(Storage)
Popular by Demand
• More resources are poured into
Hadoop than many other
projects
• Vibrant community with many
commercial entities backing the
development
• List on the right lists separate
projects, which are combined in
Hadoop distributions
• Total would far exceed anything
else
• Literally no alternatives!
Data Pipelines
From deluge to insight
Data Pipeline Components
• Pipelines need data and CPUs
• Continuous ingest lands new
data in various ways
• Access to data allows for
consumers to build products
• All of this needs to be
• Automated & managed
• Done in a secure manner
• Finally, pipelines need to be
properly onboarded
• Discovery is necessary to find
schemas, data sources, etc.
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Pipelines Increase Value of Data
Now that we know how data pipelines span many layers in both hardware
and software, we can look at what Hadoop has to offer in more detail…
Hadoop Components
Growth and Controversies
Example: Cloudera
Batch,
Interactive, and
Real-Time.
Leading performance
and usability in one
platform.
• End-to-end analytic
workflows
• Access more data
• Work with data in new
ways
• Enable new users
Security and Administration
Process
Ingest
Sqoop, Flume,
NiFi
Transform
MapReduce,
Hive, Pig, Spark
Discover
Analytic Database
Impala
Search
Solr
Model
Machine Learning
SAS, R, Spark,
Mahout
Serve
NoSQL Database
HBase
Streaming
Spark Streaming
Unlimited Storage HDFS, HBase
YARN, Cloudera Manager,
Cloudera Navigator
One Platform, Many Workloads
Hadoop: One Platform
• Different to the silo’ed, monolithic databases, Hadoop is a single, shared
platform, with multiple entry points (access engines)
• Scale and resilience is inherently built in
• There are no silos, everything is just a directory with data inside
But…
• How do you know what is where?
• Access needs to be tightly controlled, down to the field level!
Analogy: The Universal Flatbed
• Hadoop is a powerful engine exposed as a platform to carry loads
• Initially the platform is bare and beckons for customization
• You can convert the flatbed to what is needed
But…
• Once converted, how to switch
between workloads?
• How do you share the engine with
different users?
Hadoop Architecture Today
• Components are selected to
match customer demands
• A platform has many
advantages, including paid
QA time
• Some newer components
can be added later on
• Labs etc.
• Many buzzwords that need
to be carefully vetted…
2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
The stack is continually evolving and growing!
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform
And There Is More
Hadoop - The Movie: “Divergent”
Hadoop
Core
2006
HDPCDH
2008 2011
CM
Navigator
2013
Sentry
2014
RangerAmbari
Impala
2016
CDSW
2015 2017
ZeppelinAtlas
Knox
Solr
Spark
Kafka
Kudu
YARN
So, Hadoop is both complicated and divergent? How can we build data
pipelines then, using its components? What else is needed?
Data Processing In Hadoop Today
Coasting through the "Trough of Disillusionment"
Wait! Before we can look at the aspects of building a data pipeline, a bit
more context on where users are coming from and what their needs are: The
Waves of Adoption.
Waves of Adoption #1
• The “AllSpark” (as in the Transformers movie)
• First companies to adopt Hadoop as a way to mirror Google’s approach
• Early Adopters
• Inspired by early success stories, these engineering focused companies extended on
Hadoop
• Followers
• Companies that are OK to try out new things
• Still engineering driven
• Late Bloomers
• First Enterprises
• New Wave
• Everyone else… AllSpark
Early
Adopters
Followers
Late
Bloomers
Enterprises
TODAY!
Waves of Adoption #2
• Simple logic at bulk (batch processing of petabytes)
• What: Reporting
• With: SQL (Hive), Pig
• Who: Analysts, Developers
• Streaming logic, likely in Lambda architecture
• What: Decision support
• With: OLAP Analytics, Druid, Oryx
• Who: Data architects, DevOps
• Complex analytics
• What: Machine Learning, AI
• With: Notebooks, DS Workbench,
• Who: Data Scientists
Batch
Lambda
Kappa?
Hybrids: Lambda FTW?
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Storage & Processing
Storage
• Reliable and scalable systems:
HDFS, Kafka, HBase
• What about Kudu, Cassandra, …
MongoDB?
• Data laid out in a structured
manner
• Information Architecture
• Physical storage (e.g. columnar)
Processing
• Generic framework: YARN
• What about Mesos? Non-batch
jobs?
• Resource management hooks
• Pluggable engines
• MapReduce, Spark, …
• MPP Systems?
Information Architecture
• There is a need to define how data
flows through the system and is
organized
• This simplifies the onboarding
process
• Can be simple, or arbitrarily
complex
• Needs to be enforced as it is used
• Living system, may need to adopt
• Define batch and stream
interfaces
Example: YARN Services?
• Little progress in
years
• Still batch
oriented
• Projects shoehorn
service idea into
YARN using
kludges
• Example: Slider,
Trill
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Ingest
• Purpose
• Receive data from heterogeneous sources
• Save as-is, or do first pass processing
• Store data in best format, aggregate small files
• Comply to stack rules (security, IA)
• One of the most active areas
• Vibrant third-party ecosystem
• Streamsets, Tamr, Waterline Data, Trifacta, IBM, …
• Often a generic task, with Hadoop being only one target
• Open-source frameworks
• NiFi
• Flume (with Kafka)?
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Access
• Hadoop has traditionally only a few interfaces
• Interactive SQL
• Shell, Notebooks, Hue
• JDBC/ODBC
• File Access
• WebHDFS/HttpFs
• Gateways
• REST, Knox
• Needs to be set up based on the use-case
• Throughput vs Latency
• Must apply security rules
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Automation & Management
• PoCs and prototyping are not production grade!
• Need to automate the pipelines with monitoring and alerting
• Full development lifecycle needs to be established
• Precious resources need to be managed
• Easier if use-cases all fall into the same category
• Difficult when they span many systems
• One of the remaining topics not addressed at all in Hadoop
• Change management should handle dynamic reconfiguration
Automation
• Directed acyclic graphs (DAGs)
• Define the actions and link them
• Schedules based on various events (time or data)
• Handle errors and maintenance
• Examples
• Apache Oozie [2007, 2010 O/S, 2012 Apache]
• Java
• XML or Hue
• Azkaban (LinkedIn) [2010]
• Java
• Luigi (Spotify) [2012]
• Python
• Apache Airflow (Airbnb) [2015]
• Python
Example: Notebooks
• Data scientists like prototyping
• But how to bring the results into
production?
• One attempt is to boost notebooks
with a framework that can handle
their chaining and execution
• Shared resources used
• Depends on notebook backends
Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Security
• Many moving parts
• Kerberos
• RPC Level
• ACLs
• RBAC
• UIs
• Data
• Encryption (at-rest and in-
transit)
• Hard to configure properly
• Management software helps
to a degree
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Security
• Onboarding & Discovery
• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Onboarding Use-Cases
• Ask the necessary questions ahead of time
• Use the answer to set (initially) strict limits
• Use HDFS quotas, YARN queues, etc.
• Initialize the system with the defaults
• Communicate to other teams what the expected impact might be
• During onboarding explain the shared nature of Hadoop
• Avoid “long faces” due to changes (change management)
• Define costs and chargeback models
• Automate into self-service if possible
• Push updated configuration and notifications
Stage:
• Storage & Processing
• Ingest
• Access
• Automation & Management
• Onboarding & Discovery
• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Stack Architecture
• Combine the reliable components into a
whole stack
• Organize interfaces to outside systems
by users and purpose
• Separate components for ease of
maintenance
• Layer network to fit data flow
• Tight security control at vital points
Network Architecture
Wrap Up
Date pipeline deconstruction
“Oh… and I thought I just add Hadoop to our technology
landscape… you know, like a database or an appliance.”
– Misled Decision Maker
Hype Curve
Visibility
TimeTechnology Trigger
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity
Technology Waves
• Hadoop is just one part of the hype curve
• Technologies that follow may (heavily, or even solely) depend on it
• “Shaky foundations”?
• But… most (if not all) technologies are initially oversold and overhyped
• What happens in practice?
Hype Curve – The Hadoop Version
Visibility
Time
“Big Data is
Strategic for us!”
First PoC
“Where are the results?”
“Darn, Hadoop
is difficult!”
“Security? Multitenancy?
Development? Lifecycle?
Environments?”
“Maybe Hadoop
is not for us?”
Allocate more
Resources & Budget
First use-case in production
Hadoop Team Productivity
Meanwhile…
Summary
• Data Pipelines span many levels of architectures
• Hardware, Networking, Information, Security, Data Management
• Core Hadoop itself only provides little in that regard
• Vendors offer some support (closed or open source)
• Use-case are often unknown
• Guess as good as possible, generalize
• Careful planning is vital, mistakes are costly
• Mixed workloads are a nightmare for resource management
• Keep things simple (KISS principle)
• Knowledge needs to be built upfront
• Hire someone in the know!
Thank You!
@larsgeorge

Contenu connexe

Tendances

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 

Tendances (20)

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 

Similaire à Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)Alexander Alten
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Analytics using big data technologies
Analytics using big data technologiesAnalytics using big data technologies
Analytics using big data technologiesBalakrishnan Vinchu
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 

Similaire à Data Pipelines in Hadoop - SAP Meetup in Tel Aviv (20)

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Analytics using big data technologies
Analytics using big data technologiesAnalytics using big data technologies
Analytics using big data technologies
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Plus de larsgeorge

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 

Plus de larsgeorge (13)

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 

Dernier

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

  • 1. Data Processing in Hadoop Lars George – Partner and Co-Founder @ OpenCore Big Data & Data Science Israel Meetup – 21.03.2017 Analytics and Data Pipelines in Practice
  • 2. About Me • Partner & Co-Founder at OpenCore • Before, EMEA Chief Architect at Cloudera • 5+ years • Hadoop since 2007 • Apache Committer • HBase and Whirr • O’Reilly Author: HBase – The Definitive Guide • Also in Japanese, Korean & Chinese • 2nd edition out soon! • Contact • lars@opencore.com • @larsgeorge 日本語版も出ました!
  • 3. Agenda • Hadoop History • Data Pipelines • Hadoop Components • Data Processing • Summary
  • 4. Hadoop History A walk through time…
  • 6. The Original Inspirations for Hadoop 2003 2004
  • 7. A Decade of Hadoop History on One Slide Ten years ago, “Hadoop” referred to a scalable, fault-tolerant filesystem (HDFS) and programming framework (MapReduce) for distributed computing. Today, it refers to both a kernel containing the aforementioned pieces, as well as a constantly evolving ecosystem of 25+ data stores, execution engines, programming and data access frameworks, and other componentry. Recognize this guy?
  • 8. Hadoop’s Original Architecture MapReduce (Data Processing and Resource Management) HDFS (Filesystem/Storage)
  • 9. Hadoop's Architecture Today MapReduce (Data Processing) YARN (Resource Management) HDFS (Storage)
  • 10. Popular by Demand • More resources are poured into Hadoop than many other projects • Vibrant community with many commercial entities backing the development • List on the right lists separate projects, which are combined in Hadoop distributions • Total would far exceed anything else • Literally no alternatives!
  • 12. Data Pipeline Components • Pipelines need data and CPUs • Continuous ingest lands new data in various ways • Access to data allows for consumers to build products • All of this needs to be • Automated & managed • Done in a secure manner • Finally, pipelines need to be properly onboarded • Discovery is necessary to find schemas, data sources, etc. Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 14. Now that we know how data pipelines span many layers in both hardware and software, we can look at what Hadoop has to offer in more detail…
  • 16. Example: Cloudera Batch, Interactive, and Real-Time. Leading performance and usability in one platform. • End-to-end analytic workflows • Access more data • Work with data in new ways • Enable new users Security and Administration Process Ingest Sqoop, Flume, NiFi Transform MapReduce, Hive, Pig, Spark Discover Analytic Database Impala Search Solr Model Machine Learning SAS, R, Spark, Mahout Serve NoSQL Database HBase Streaming Spark Streaming Unlimited Storage HDFS, HBase YARN, Cloudera Manager, Cloudera Navigator One Platform, Many Workloads
  • 17. Hadoop: One Platform • Different to the silo’ed, monolithic databases, Hadoop is a single, shared platform, with multiple entry points (access engines) • Scale and resilience is inherently built in • There are no silos, everything is just a directory with data inside But… • How do you know what is where? • Access needs to be tightly controlled, down to the field level!
  • 18. Analogy: The Universal Flatbed • Hadoop is a powerful engine exposed as a platform to carry loads • Initially the platform is bare and beckons for customization • You can convert the flatbed to what is needed But… • Once converted, how to switch between workloads? • How do you share the engine with different users?
  • 19. Hadoop Architecture Today • Components are selected to match customer demands • A platform has many advantages, including paid QA time • Some newer components can be added later on • Labs etc. • Many buzzwords that need to be carefully vetted…
  • 20. 2006 2008 2009 2010 2011 2012 2013 Core Hadoop (HDFS, MapReduce) HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop The stack is continually evolving and growing! 2007 Solr Pig Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2014 2015 Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Evolution of the Hadoop Platform
  • 21. And There Is More
  • 22. Hadoop - The Movie: “Divergent” Hadoop Core 2006 HDPCDH 2008 2011 CM Navigator 2013 Sentry 2014 RangerAmbari Impala 2016 CDSW 2015 2017 ZeppelinAtlas Knox Solr Spark Kafka Kudu YARN
  • 23. So, Hadoop is both complicated and divergent? How can we build data pipelines then, using its components? What else is needed?
  • 24. Data Processing In Hadoop Today Coasting through the "Trough of Disillusionment"
  • 25. Wait! Before we can look at the aspects of building a data pipeline, a bit more context on where users are coming from and what their needs are: The Waves of Adoption.
  • 26. Waves of Adoption #1 • The “AllSpark” (as in the Transformers movie) • First companies to adopt Hadoop as a way to mirror Google’s approach • Early Adopters • Inspired by early success stories, these engineering focused companies extended on Hadoop • Followers • Companies that are OK to try out new things • Still engineering driven • Late Bloomers • First Enterprises • New Wave • Everyone else… AllSpark Early Adopters Followers Late Bloomers Enterprises TODAY!
  • 27. Waves of Adoption #2 • Simple logic at bulk (batch processing of petabytes) • What: Reporting • With: SQL (Hive), Pig • Who: Analysts, Developers • Streaming logic, likely in Lambda architecture • What: Decision support • With: OLAP Analytics, Druid, Oryx • Who: Data architects, DevOps • Complex analytics • What: Machine Learning, AI • With: Notebooks, DS Workbench, • Who: Data Scientists Batch Lambda Kappa?
  • 29. Stage: • Storage & Processing • Ingest • Access • Automation & Management • Security • Onboarding & Discovery • Physical Systems Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 30. Storage & Processing Storage • Reliable and scalable systems: HDFS, Kafka, HBase • What about Kudu, Cassandra, … MongoDB? • Data laid out in a structured manner • Information Architecture • Physical storage (e.g. columnar) Processing • Generic framework: YARN • What about Mesos? Non-batch jobs? • Resource management hooks • Pluggable engines • MapReduce, Spark, … • MPP Systems?
  • 31. Information Architecture • There is a need to define how data flows through the system and is organized • This simplifies the onboarding process • Can be simple, or arbitrarily complex • Needs to be enforced as it is used • Living system, may need to adopt • Define batch and stream interfaces
  • 32. Example: YARN Services? • Little progress in years • Still batch oriented • Projects shoehorn service idea into YARN using kludges • Example: Slider, Trill
  • 33. Stage: • Storage & Processing • Ingest • Access • Automation & Management • Security • Onboarding & Discovery • Physical Systems Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 34. Ingest • Purpose • Receive data from heterogeneous sources • Save as-is, or do first pass processing • Store data in best format, aggregate small files • Comply to stack rules (security, IA) • One of the most active areas • Vibrant third-party ecosystem • Streamsets, Tamr, Waterline Data, Trifacta, IBM, … • Often a generic task, with Hadoop being only one target • Open-source frameworks • NiFi • Flume (with Kafka)?
  • 35. Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems Stage: • Storage & Processing • Ingest • Access • Automation & Management • Security • Onboarding & Discovery • Physical Systems
  • 36. Access • Hadoop has traditionally only a few interfaces • Interactive SQL • Shell, Notebooks, Hue • JDBC/ODBC • File Access • WebHDFS/HttpFs • Gateways • REST, Knox • Needs to be set up based on the use-case • Throughput vs Latency • Must apply security rules
  • 37. Stage: • Storage & Processing • Ingest • Access • Automation & Management • Security • Onboarding & Discovery • Physical Systems Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 38. Automation & Management • PoCs and prototyping are not production grade! • Need to automate the pipelines with monitoring and alerting • Full development lifecycle needs to be established • Precious resources need to be managed • Easier if use-cases all fall into the same category • Difficult when they span many systems • One of the remaining topics not addressed at all in Hadoop • Change management should handle dynamic reconfiguration
  • 39. Automation • Directed acyclic graphs (DAGs) • Define the actions and link them • Schedules based on various events (time or data) • Handle errors and maintenance • Examples • Apache Oozie [2007, 2010 O/S, 2012 Apache] • Java • XML or Hue • Azkaban (LinkedIn) [2010] • Java • Luigi (Spotify) [2012] • Python • Apache Airflow (Airbnb) [2015] • Python
  • 40. Example: Notebooks • Data scientists like prototyping • But how to bring the results into production? • One attempt is to boost notebooks with a framework that can handle their chaining and execution • Shared resources used • Depends on notebook backends Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html
  • 41. Stage: • Storage & Processing • Ingest • Access • Automation & Management • Security • Onboarding & Discovery • Physical Systems Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 42. Security • Many moving parts • Kerberos • RPC Level • ACLs • RBAC • UIs • Data • Encryption (at-rest and in- transit) • Hard to configure properly • Management software helps to a degree
  • 43. Stage: • Storage & Processing • Ingest • Access • Automation & Management • Security • Onboarding & Discovery • Physical Systems Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 44. Onboarding Use-Cases • Ask the necessary questions ahead of time • Use the answer to set (initially) strict limits • Use HDFS quotas, YARN queues, etc. • Initialize the system with the defaults • Communicate to other teams what the expected impact might be • During onboarding explain the shared nature of Hadoop • Avoid “long faces” due to changes (change management) • Define costs and chargeback models • Automate into self-service if possible • Push updated configuration and notifications
  • 45. Stage: • Storage & Processing • Ingest • Access • Automation & Management • Onboarding & Discovery • Physical Systems Storage ProcessingIngest Automation + Data & Resource Management Authentication, Authorization, Audits Access Onboarding & Discovery Physical Systems
  • 46. Stack Architecture • Combine the reliable components into a whole stack • Organize interfaces to outside systems by users and purpose • Separate components for ease of maintenance • Layer network to fit data flow • Tight security control at vital points
  • 48. Wrap Up Date pipeline deconstruction
  • 49. “Oh… and I thought I just add Hadoop to our technology landscape… you know, like a database or an appliance.” – Misled Decision Maker
  • 50. Hype Curve Visibility TimeTechnology Trigger Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity
  • 51. Technology Waves • Hadoop is just one part of the hype curve • Technologies that follow may (heavily, or even solely) depend on it • “Shaky foundations”? • But… most (if not all) technologies are initially oversold and overhyped • What happens in practice?
  • 52. Hype Curve – The Hadoop Version Visibility Time “Big Data is Strategic for us!” First PoC “Where are the results?” “Darn, Hadoop is difficult!” “Security? Multitenancy? Development? Lifecycle? Environments?” “Maybe Hadoop is not for us?” Allocate more Resources & Budget First use-case in production Hadoop Team Productivity Meanwhile…
  • 53. Summary • Data Pipelines span many levels of architectures • Hardware, Networking, Information, Security, Data Management • Core Hadoop itself only provides little in that regard • Vendors offer some support (closed or open source) • Use-case are often unknown • Guess as good as possible, generalize • Careful planning is vital, mistakes are costly • Mixed workloads are a nightmare for resource management • Keep things simple (KISS principle) • Knowledge needs to be built upfront • Hire someone in the know!

Notes de l'éditeur

  1. As a very basic explanation, Hadoop was originally an open source implementation of internal systems built by Google in the early ‘00s to deal with the extraordinarily resource-intensive problem of indexing the Internet every night. Those systems were first described in these papers, and Cutting and Cafarella, who faced similar problems with Nutch, took notice of them quickly. (Later, Google also published its “Bigtable” paper, which led other developers to create HBase.) As Cutting puts it, periodically, “Google sends us messages from the future.”
  2. In the beginning, the word “Hadoop” referred to just two components. Fast forward a decade, and that word now refers to that “kernel” (aka Core Hadoop) as well as to a growing ecosystem of related projects. In that sense, Hadoop now has much in common with Linux, which is also both a kernel and an ecosystem.
  3. Cutting & Cafarella’s initial implementation of these systems consisted of just 2 components: MapReduce and HDFS.
  4. What’s really significant about this architecture is how it unifies diverse access to common data. In traditional approaches, you’d have separate systems to collect, store, process, explore, model, and serve data. Different teams would use different systems for each workload, and users whose roles span multiple systems would have to use several of them to achieve their objectives. With Cloudera’s enterprise data hub: You can perform end-to-end data workflows in a single system, dramatically lowering time to value. Each workload can access unlimited data, thanks to the underlying data platform, enhancing the value of each workload. Power users can now access their data in new ways: SQL, search, machine learning, programming, etc. At the same time, new users are enabled by these diverse workloads to interact with data. Cloudera Enterprise provides comprehensive support for batch, interactive, and real-time workloads: Batch Data integration with Apache Sqoop Data processing with MapReduce, Apache Hive, Apache Pig Memory-centric processing with Apache Spark Interactive Analytic SQL with Impala Search with Apache Solr Machine Learning with Apache Spark Real-Time Data integration with Apache Kafka, Apache Flume Stream processing with Apache Spark Data serving with Apache HBase Shared resource management ensures that each workload is handled appropriately and abides by IT policy. What’s more, 3rd party tools, such as SAS or Informatica can run as native workloads inside Cloudera’s enterprise data hub.
  5. With the expansion of that ecosystem, “Hadoop” has grown much, much bigger than its original “core.”
  6. The rapid expansion of the Hadoop ecosystem is further evidence of its meteoric adoption.
  7. Azkaban: https://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 Luigi: https://developer.spotify.com/news-stories/2012/09/24/hello-world/