SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Page 1 © Hortonworks Inc. 2014
Hortonworks Technical Workshop:
In-memory Processing with Apache Spark
Dhruv Kumar and Ajay Singh
Hortonworks. We do Hadoop.
March 12, 2015
Page 2 © Hortonworks Inc. 2014
About the presenters
Ajay Singh
Director Technical Channels.
Hortonworks Inc.
Dhruv Kumar
Partner Solutions Engineer.
Hortonworks Inc.
Page 3 © Hortonworks Inc. 2014
In this workshop
•  Introduction to HDP and Spark
•  Installing Spark on HDP
•  Spark Programming
•  Core Spark: working with RDDs
•  Spark SQL: structured data access
•  Spark Mlib: predictive analytics
•  Spark Streaming: real time data processing
•  Spark Application Demo: Twitter Language Classifier using Mlib and
Streaming
•  Tuning and Debugging Spark
•  Conclusion and Further Reading, Q/A
Page 4 © Hortonworks Inc. 2014
Introduction to HDP and Spark
Page 5 © Hortonworks Inc. 2014
HDP delivers a comprehensive data management platform
HDP 2.2
Hortonworks Data Platform
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISVs
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
(Hadoop Distributed File System)
In-Memory
Spark
Deployment Choice
Linux Windows On-Premise Cloud
YARN is the architectural
center of HDP
•  Enables batch, interactive
and real-time workloads
•  Single SQL engine for both batch
and interactive
•  Enables best of breed ISV tools to
deeply integrate into Hadoop via YARN
Provides comprehensive
enterprise capabilities
•  Governance
•  Security
•  Operations
The widest range of
deployment options
•  Linux & Windows
•  On premise & cloud
Tez
Tez
Page 6 © Hortonworks Inc. 2014
Let’s drill into one workload … Spark
HDP 2.1
Hortonworks Data Platform
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISVs
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
(Hadoop Distributed File System)
Deployment Choice
Linux Windows On-Premise Cloud
YARN is the architectural
center of HDP
•  Enables batch, interactive
and real-time workloads
•  Single SQL engine for both batch
and interactive
•  Enables best of breed ISV tools to
deeply integrate into Hadoop via YARN
Provides comprehensive
enterprise capabilities
•  Governance
•  Security
•  Operations
The widest range of
deployment options
•  Linux & Windows
•  On premise & cloud
Tez
Tez
In-Memory
Page 7 © Hortonworks Inc. 2014
What is Spark?
•  Spark is
–  an open-source software solution that performs rapid calculations
on in-memory datasets
-  Open Source [Apache hosted & licensed]
•  Free to download and use in production
•  Developed by a community of developers
-  In-memory datasets
•  RDD (Resilient Distributed Data) is the basis for what Spark enables
•  Resilient – the models can be recreated on the fly from known state
•  Distributed – the dataset is often partitioned across multiple nodes for
increased scalability and parallelism
Page 8 © Hortonworks Inc. 2014
Spark Components
Spark allows you to do data processing, ETL, machine learning,
stream processing, SQL querying from one framework
Page 9 © Hortonworks Inc. 2014
Why Spark?
•  One tool for data engineering and data science tasks
•  Native integration with Hive, HDFS and any Hadoop FileSystem
implementation
•  Faster development: concise API, Scala (~3x lesser code than Java)
•  Faster execution: for iterative jobs because of in-memory caching (not
all workloads are faster in Spark)
•  Promotes code reuse: APIs and data types are similar for batch and
streaming
Page 10 © Hortonworks Inc. 2014
Hortonworks Commitment to Spark
Hortonworks is focused on making
Apache Spark enterprise ready so
you can depend on it for mission
critical applications
YARN: Data Operating System
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
&INTEGRATION
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISVs
Tez
Tez
In-Memory
1.  YARN enable Spark to
co-exist with other engines
Spark is “YARN Ready” so its
memory & CPU intensive apps
can work with predictable
performance along side other
engines all on the same set(s) of
data.
2.  Extend Spark with
enterprise capabilities
Ensure Spark can be managed,
secured and governed all via a
single set of frameworks to
ensure consistency. Ensure
reliability and quality of service of
Spark along side other engines.
3.  Actively collaborate within
the open community
As with everything we do at
Hortonworks we work entirely
within the open community
across Spark and all related
projects to improve this key
Hadoop technology.
Page 11 © Hortonworks Inc. 2014
Spark on HDP: history and what’s coming
Spark Tech Previews we’ve done so far
Spark 0.9.1
Spark 1.0.1
Spark 1.2.0
Spark GA: Q1/Q2 of 2015
Install Spark 1.2.1 with HDP 2.2.2 & Ambari
2.0.0
Install with Ambari/ Manual package
for DEB/RPM/MSI
Spark on Kerberos enabled cluster
Spark on YARN
Spark job history Server
Full ORC support
Built with Hive 0.14 dependency
Spark Thrift Server for JDBC/ODBC
Spark embedded Beeline
Spark Simba ODBC Driver (coming)
Page 12 © Hortonworks Inc. 2014
Reference Deployment Architecture
Batch Source
Streaming
Source
Reference Data
Stream Processing
Storm/Spark-Streaming
Data Pipeline
Hive/Pig/Spark
Long Term Data
Warehouse
Hive + ORC
Data Discovery
Operational
Reporting
Business
Intelligence
Ad Hoc/On
Demand Source
Data Science Models
Spark-ML, Spark-SQL,
ISV
Advanced
Analytics
Data Sources Data Processing, Storage & Analytics Data Access
Hortonworks Data Platform
Page 13 © Hortonworks Inc. 2014
Installing Spark on HDP
Page 14 © Hortonworks Inc. 2014
Installing Spark on HDP
•  GA of Spark 1.2.1 in Q1 2015
–  Fully supported by Hortonworks
–  Install with Ambari 2.0.0 & HDP 2.2.2. Other combination unsupported.
•  Can try tech preview now
Page 15 © Hortonworks Inc. 2014
Spark Deployment Modes
Mode setup with
Ambari
•  Spark Standalone Cluster
–  For developing Spark apps against a local Spark (similar to develop/deploying in IDE)
•  Spark on YARN
–  Spark driver (SparkContext) in YARN AM(yarn-cluster)
–  Spark driver (SparkContext) in local (yarn-client)
•  Spark Shell runs in yarn-client only
Client
Executor
App
Master
Client
Executor
App
Master
Spark Driver
Spark Driver
YARN-Client YARN-Cluster
Page 16 © Hortonworks Inc. 2014
Overview of Spark Install with Ambari
Select
Spark
Assign
nodes for
Spark
History
Server &
Spark Client
Add
Service
Go to a
node with
Spark Client
Submit
spark
jobs
Hadoop
Admin
Spark
is
Ready
Spark
User
Page 17 © Hortonworks Inc. 2014
Overview of Spark App Lifecycle
Deploy in
Spark
Standalone
Test/
Develop/
REPL loop
Write
Spark
App
Deploy
Spark Apps
on YARN in
a Staging/
Production
Cluster
Monitor
Debug
Spark Job
Developer
Spark
App is
Ready
Take
Spark
App
Page 18 © Hortonworks Inc. 2014
Spark on YARN
YARN RM
App Master
Monitoring UI
Page 19 © Hortonworks Inc. 2014
Programming Spark
Page 20 © Hortonworks Inc. 2014
How Does Spark Work?
•  RDD
•  Your data is loaded in parallel into structured collections
•  Actions
•  Manipulate the state of the working model by forming new RDDs
and performing calculations upon them
•  Persistence
•  Long-term storage of an RDD’s state
Page 21 © Hortonworks Inc. 2014
Example RDD Transformations
• map(func)
• filter(func)
• distinct(func)
•  All create a new DataSet from an existing one
•  Do not create the DataSet until an action is performed (Lazy)
•  Each element in an RDD is passed to the target function and the
result forms a new RDD
Page 22 © Hortonworks Inc. 2014
Example Action Operations
• count()
• reduce(func)
• collect()
• take()
•  Either:
•  Returns a value to the driver program
•  Exports state to external system
Page 23 © Hortonworks Inc. 2014
Example Persistence Operations
• persist() -- takes options
• cache() -- only one option: in-memory
•  Stores RDD Values
•  in memory (what doesn’t fit is recalculated when necessary)
•  Replication is an option for in-memory
•  to disk
•  blended
Page 24 © Hortonworks Inc. 2014
1. Resilient Distributed Dataset [RDD] Graph
val	
  v	
  =	
  sc.textFile("hdfs://…some-­‐hdfs-­‐data")	
  
	
  
	
  
	
  
map	
  map	
   reduceByKey	
   collect	
  textFile	
  
v.flatMap(line=>line.split("	
  "))	
  
.map(word=>(word,	
  1)))	
  
.reduceByKey(_	
  +	
  _,	
  3)	
  
.collect()	
  
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Page 25 © Hortonworks Inc. 2014
Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
25
Page 26 © Hortonworks Inc. 2014
Looking at the State in the Machine
//run debug command to inspect RDD:
scala> fltr.toDebugString
//simplified output:
res1: String =
FilteredRDD[2] at filter at <console>:14
MappedRDD[1] at textFile at <console>:12
HadoopRDD[0] at textFile at <console>:12
26
Page 27 © Hortonworks Inc. 2014
A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can
be seen in the code:
flatMap( line => line.split(" ") )
27
Argument
to the
function
Body of
the
function
Page 28 © Hortonworks Inc. 2014
Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
28
Argument to the
function (type inferred)
Body of the function
Argument to the
function (explicit type)
Body of the
function
No Argument to the
function declared
(placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of
one arg for each _ present. _ essentially means ‘whatever you pass me’
Page 29 © Hortonworks Inc. 2014
And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)
Page 30 © Hortonworks Inc. 2014
Things You Can Do With RDDs
•  RDDs are objects and expose a rich set of methods:
30
Name Description Name Description
filter Return a new RDD containing only those
elements that satisfy a predicate
collect Return an array containing all the elements of
this RDD
count Return the number of elements in this
RDD
first Return the first element of this RDD
foreach Applies a function to all elements of this
RDD (does not return an RDD)
reduce Reduces the contents of this RDD
subtract Return an RDD without duplicates of
elements found in passed-in RDD
union Return an RDD that is a union of the passed-in
RDD and this one
Page 31 © Hortonworks Inc. 2014
More Things You Can Do With RDDs
•  More stuff you can do…
31
Name Description Name Description
flatMap Return a new RDD by first applying a
function to all elements of this RDD, and
then flattening the results
checkpoint Mark this RDD for checkpointing (its state will
be saved so it need not be recreated from
scratch)
cache Load the RDD into memory (what
doesn’t fit will be calculated as needed)
countByValue Return the count of each unique value in this
RDD as a map of (value, count) pairs
distinct Return a new RDD containing the
distinct elements in this RDD
persist Store the RDD to either memory, Disk, or
hybrid according to passed in value
sample Return a sampled subset of this RDD unpersist Clear any record of the RDD from disk/memory
Page 32 © Hortonworks Inc. 2014
Code ‘select count’
Equivalent SQL Statement:
Select count(*) from pagecounts WHERE state = ‘FL’
Scala statement:
val file = sc.textFile("hdfs://…/log.txt")
val numFL = file.filter(line =>
line.contains("fl")).count()
scala> println(numFL)
32
1. Load the page as an RDD
2. Filter the lines of the page
eliminating any that do not
contain “fl“
3. Count those lines that
remain
4. Print the value of the
counted lines containing ‘fl’
Page 33 © Hortonworks Inc. 2014
Spark SQL
33
Page 34 © Hortonworks Inc. 2014
What About Integration With Hive?
scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)
…
[omniture]
[omniturelogs]
[orc_table]
[raw_products]
[raw_users]
…
34
Page 35 © Hortonworks Inc. 2014
More Integration With Hive:
scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println)
[swid,string,null]
[birth_date,string,null]
[gender_cd,string,null]
scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT
5").collect().foreach(println)
[0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F]
[00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F]
[00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F]
[000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F]
[00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F]
35
Page 36 © Hortonworks Inc. 2014
Querying RDD Using SQL
// SQL statements can be run directly on RDD’s
val teenagers = 

sqlC.sql("SELECT name FROM people 

WHERE age >= 13 AND age <= 19")



// The results of SQL queries are SchemaRDDs and support 

// normal RDD operations:


val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =

 people.where('age >= 10).where('age <= 19).select('name)
Page 37 © Hortonworks Inc. 2014
Spark MLlib – Algorithms Offered
•  Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
•  Regression: generalized linear models (GLMs),
– regression tree
•  Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
•  Clustering: k-means
•  Decomposition: SVD, PCA
•  Optimization: stochastic gradient descent, L-BFGS
Page 38 © Hortonworks Inc. 2014
MicroBatch Spark Streams
Page 39 © Hortonworks Inc. 2014
Physical Execution
Page 40 © Hortonworks Inc. 2014
Spark Streaming 101
•  Spark has significant library support for streaming applications
val ssc = new StreamingContext(sc, Seconds(5))
val tweetStream = TwitterUtils.createStream(ssc, Some(auth))
•  Allows to combine Streaming with Batch/ETL,SQL & ML
•  Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom.
•  Chop input data stream into batches
•  Spark processes batches & results published in batches
•  Fundamental unit is Discretized Streams (DStreams)
Page 41 © Hortonworks Inc. 2014
Spark on HDP Demo
Page 42 © Hortonworks Inc. 2014
Twitter Language Classifier
Goal: connect to real time twitter stream and print only
those tweets whose language match our chosen language.
Main issue: how to detect the language during run time?
Solution: build a language classifier model offline capable of
detecting language of tweet (Mlib). Then, apply it to real
time twitter stream and do filtering (Spark Streaming).
Page 43 © Hortonworks Inc. 2014
Tuning and Debugging Spark
Page 44 © Hortonworks Inc. 2014
Spark Tuning - 1
• Understand how Spark works
– Spark Transformations create new RDDs
– Transformations are executed lazily when action is called.
– Collect() may call OOME when large data is sent from Executors to Driver
– Prefer ReduceByValue over ReduceByKey
• Understand your application
– ML is CPU intensive
– ETL is IO intensive
– Avoid shuffle
– Use the right join for the Table (ShuffledHashJoin vs BroadcastHashJoin)
– Manual Configuration
Page 45 © Hortonworks Inc. 2014
Spark Tuning - 2
•  Understand where time is spent
–  See http://<host>:8088/proxy/<app_id>/stages/
•  Use Kryo serializer
–  Needs configuration and registering class
•  Use toDebugString on RDD
–  counts.toDebugString
res2: String =
(2) ShuffledRDD[4] at reduceByKey at <console>:14 []
+-(2) MappedRDD[3] at map at <console>:14 []
| FlatMappedRDD[2] at flatMap at <console>:14 []
| hdfs://red1:8020/tmp/data MappedRDD[1] at textFile at <console>:12 []
| hdfs://red1:8020/tmp/data HadoopRDD[0] at textFile at <console>:12 []
Page 46 © Hortonworks Inc. 2014
When Things go wrong
•  Where to look
–  yarn application –list (get the list of running application)
–  yarn logs -applicationId <app_id>
–  Check Spark Environment : http://<host>:8088/proxy/<job_id>/environment/
•  Common Issues
–  Submitted a job but nothing happens
–  Job stays in accepted state when allocated more memory/cores than is available
–  May need to kill unresponsive/stale jobs
–  Insufficient HDFS access
–  May lead to failure such as
“Loading data to table default.testtable

Failed with exception Unable to move sourcehdfs://red1:8020/tmp/hive-spark/
hive_2015-03-04_12-45-42_404_3643812080461575333-1/-ext-10000/kv1.txt to destination hdfs://red1:8020/apps/hive/
warehouse/testtable/kv1.txt”
–  Wrong host in Beeline, shows error as invalid URL
–  “Error: Invalid URL: jdbc:hive2://localhost:10001 (state=08S01,code=0)”
–  Error about closed SQLContext, restart Thirft Server
Grant user/group
necessary HDFS access
Page 47 © Hortonworks Inc. 2014
Conclusion and Resources
Page 48 © Hortonworks Inc. 2014
Conclusion
•  Spark is a unified framework for data engineering and data
science
•  Spark can be programmed in Scala, Java and Python.
•  Spark will be supported by Hortonworks on HDP in Q1 2015
•  Certain workloads are faster in Spark because of in-memory
caching.
Page 49 © Hortonworks Inc. 2014
References and Further Reading
•  Apache Spark website: https://spark.apache.org/
•  Hortonworks Spark website: http://hortonworks.com/hadoop/spark/
•  Databricks Developer Resources: https://databricks.com/spark/
developer-resources
•  “Learning Spark” by O’Reilly Publishers

Contenu connexe

Tendances

YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...Hortonworks
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambariHortonworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderHortonworks
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 

Tendances (20)

YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 

En vedette

Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Sparkdatamantra
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopHortonworks
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHortonworks
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기Ted Won
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache AmbariHortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Reporting: From MySQL to Hadoop/Hive
Reporting: From MySQL to Hadoop/HiveReporting: From MySQL to Hadoop/Hive
Reporting: From MySQL to Hadoop/HiveManuel Aldana
 
In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
In-Hadoop, In-Database and In-Memory Processing for Predictive AnalyticsIn-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
In-Hadoop, In-Database and In-Memory Processing for Predictive AnalyticsDataWorks Summit
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshopPawel Szulc
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training Aengus Rooney
 

En vedette (20)

Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare Transformation
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Reporting: From MySQL to Hadoop/Hive
Reporting: From MySQL to Hadoop/HiveReporting: From MySQL to Hadoop/Hive
Reporting: From MySQL to Hadoop/Hive
 
In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
In-Hadoop, In-Database and In-Memory Processing for Predictive AnalyticsIn-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training
 

Similaire à Hortonworks tech workshop in-memory processing with spark

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Hortonworks
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 

Similaire à Hortonworks tech workshop in-memory processing with spark (20)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 

Plus de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Plus de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Hortonworks tech workshop in-memory processing with spark

  • 1. Page 1 © Hortonworks Inc. 2014 Hortonworks Technical Workshop: In-memory Processing with Apache Spark Dhruv Kumar and Ajay Singh Hortonworks. We do Hadoop. March 12, 2015
  • 2. Page 2 © Hortonworks Inc. 2014 About the presenters Ajay Singh Director Technical Channels. Hortonworks Inc. Dhruv Kumar Partner Solutions Engineer. Hortonworks Inc.
  • 3. Page 3 © Hortonworks Inc. 2014 In this workshop •  Introduction to HDP and Spark •  Installing Spark on HDP •  Spark Programming •  Core Spark: working with RDDs •  Spark SQL: structured data access •  Spark Mlib: predictive analytics •  Spark Streaming: real time data processing •  Spark Application Demo: Twitter Language Classifier using Mlib and Streaming •  Tuning and Debugging Spark •  Conclusion and Further Reading, Q/A
  • 4. Page 4 © Hortonworks Inc. 2014 Introduction to HDP and Spark
  • 5. Page 5 © Hortonworks Inc. 2014 HDP delivers a comprehensive data management platform HDP 2.2 Hortonworks Data Platform Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Other ISVs 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) In-Memory Spark Deployment Choice Linux Windows On-Premise Cloud YARN is the architectural center of HDP •  Enables batch, interactive and real-time workloads •  Single SQL engine for both batch and interactive •  Enables best of breed ISV tools to deeply integrate into Hadoop via YARN Provides comprehensive enterprise capabilities •  Governance •  Security •  Operations The widest range of deployment options •  Linux & Windows •  On premise & cloud Tez Tez
  • 6. Page 6 © Hortonworks Inc. 2014 Let’s drill into one workload … Spark HDP 2.1 Hortonworks Data Platform Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Other ISVs 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Deployment Choice Linux Windows On-Premise Cloud YARN is the architectural center of HDP •  Enables batch, interactive and real-time workloads •  Single SQL engine for both batch and interactive •  Enables best of breed ISV tools to deeply integrate into Hadoop via YARN Provides comprehensive enterprise capabilities •  Governance •  Security •  Operations The widest range of deployment options •  Linux & Windows •  On premise & cloud Tez Tez In-Memory
  • 7. Page 7 © Hortonworks Inc. 2014 What is Spark? •  Spark is –  an open-source software solution that performs rapid calculations on in-memory datasets -  Open Source [Apache hosted & licensed] •  Free to download and use in production •  Developed by a community of developers -  In-memory datasets •  RDD (Resilient Distributed Data) is the basis for what Spark enables •  Resilient – the models can be recreated on the fly from known state •  Distributed – the dataset is often partitioned across multiple nodes for increased scalability and parallelism
  • 8. Page 8 © Hortonworks Inc. 2014 Spark Components Spark allows you to do data processing, ETL, machine learning, stream processing, SQL querying from one framework
  • 9. Page 9 © Hortonworks Inc. 2014 Why Spark? •  One tool for data engineering and data science tasks •  Native integration with Hive, HDFS and any Hadoop FileSystem implementation •  Faster development: concise API, Scala (~3x lesser code than Java) •  Faster execution: for iterative jobs because of in-memory caching (not all workloads are faster in Spark) •  Promotes code reuse: APIs and data types are similar for batch and streaming
  • 10. Page 10 © Hortonworks Inc. 2014 Hortonworks Commitment to Spark Hortonworks is focused on making Apache Spark enterprise ready so you can depend on it for mission critical applications YARN: Data Operating System SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE &INTEGRATION OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Other ISVs Tez Tez In-Memory 1.  YARN enable Spark to co-exist with other engines Spark is “YARN Ready” so its memory & CPU intensive apps can work with predictable performance along side other engines all on the same set(s) of data. 2.  Extend Spark with enterprise capabilities Ensure Spark can be managed, secured and governed all via a single set of frameworks to ensure consistency. Ensure reliability and quality of service of Spark along side other engines. 3.  Actively collaborate within the open community As with everything we do at Hortonworks we work entirely within the open community across Spark and all related projects to improve this key Hadoop technology.
  • 11. Page 11 © Hortonworks Inc. 2014 Spark on HDP: history and what’s coming Spark Tech Previews we’ve done so far Spark 0.9.1 Spark 1.0.1 Spark 1.2.0 Spark GA: Q1/Q2 of 2015 Install Spark 1.2.1 with HDP 2.2.2 & Ambari 2.0.0 Install with Ambari/ Manual package for DEB/RPM/MSI Spark on Kerberos enabled cluster Spark on YARN Spark job history Server Full ORC support Built with Hive 0.14 dependency Spark Thrift Server for JDBC/ODBC Spark embedded Beeline Spark Simba ODBC Driver (coming)
  • 12. Page 12 © Hortonworks Inc. 2014 Reference Deployment Architecture Batch Source Streaming Source Reference Data Stream Processing Storm/Spark-Streaming Data Pipeline Hive/Pig/Spark Long Term Data Warehouse Hive + ORC Data Discovery Operational Reporting Business Intelligence Ad Hoc/On Demand Source Data Science Models Spark-ML, Spark-SQL, ISV Advanced Analytics Data Sources Data Processing, Storage & Analytics Data Access Hortonworks Data Platform
  • 13. Page 13 © Hortonworks Inc. 2014 Installing Spark on HDP
  • 14. Page 14 © Hortonworks Inc. 2014 Installing Spark on HDP •  GA of Spark 1.2.1 in Q1 2015 –  Fully supported by Hortonworks –  Install with Ambari 2.0.0 & HDP 2.2.2. Other combination unsupported. •  Can try tech preview now
  • 15. Page 15 © Hortonworks Inc. 2014 Spark Deployment Modes Mode setup with Ambari •  Spark Standalone Cluster –  For developing Spark apps against a local Spark (similar to develop/deploying in IDE) •  Spark on YARN –  Spark driver (SparkContext) in YARN AM(yarn-cluster) –  Spark driver (SparkContext) in local (yarn-client) •  Spark Shell runs in yarn-client only Client Executor App Master Client Executor App Master Spark Driver Spark Driver YARN-Client YARN-Cluster
  • 16. Page 16 © Hortonworks Inc. 2014 Overview of Spark Install with Ambari Select Spark Assign nodes for Spark History Server & Spark Client Add Service Go to a node with Spark Client Submit spark jobs Hadoop Admin Spark is Ready Spark User
  • 17. Page 17 © Hortonworks Inc. 2014 Overview of Spark App Lifecycle Deploy in Spark Standalone Test/ Develop/ REPL loop Write Spark App Deploy Spark Apps on YARN in a Staging/ Production Cluster Monitor Debug Spark Job Developer Spark App is Ready Take Spark App
  • 18. Page 18 © Hortonworks Inc. 2014 Spark on YARN YARN RM App Master Monitoring UI
  • 19. Page 19 © Hortonworks Inc. 2014 Programming Spark
  • 20. Page 20 © Hortonworks Inc. 2014 How Does Spark Work? •  RDD •  Your data is loaded in parallel into structured collections •  Actions •  Manipulate the state of the working model by forming new RDDs and performing calculations upon them •  Persistence •  Long-term storage of an RDD’s state
  • 21. Page 21 © Hortonworks Inc. 2014 Example RDD Transformations • map(func) • filter(func) • distinct(func) •  All create a new DataSet from an existing one •  Do not create the DataSet until an action is performed (Lazy) •  Each element in an RDD is passed to the target function and the result forms a new RDD
  • 22. Page 22 © Hortonworks Inc. 2014 Example Action Operations • count() • reduce(func) • collect() • take() •  Either: •  Returns a value to the driver program •  Exports state to external system
  • 23. Page 23 © Hortonworks Inc. 2014 Example Persistence Operations • persist() -- takes options • cache() -- only one option: in-memory •  Stores RDD Values •  in memory (what doesn’t fit is recalculated when necessary) •  Replication is an option for in-memory •  to disk •  blended
  • 24. Page 24 © Hortonworks Inc. 2014 1. Resilient Distributed Dataset [RDD] Graph val  v  =  sc.textFile("hdfs://…some-­‐hdfs-­‐data")         map  map   reduceByKey   collect  textFile   v.flatMap(line=>line.split("  "))   .map(word=>(word,  1)))   .reduceByKey(_  +  _,  3)   .collect()   RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  • 25. Page 25 © Hortonworks Inc. 2014 Processing A File in Scala //Load the file: val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv") //Trim away any empty rows: val fltr = file.filter(_.length > 0) //Print out the remaining rows: fltr.foreach(println) 25
  • 26. Page 26 © Hortonworks Inc. 2014 Looking at the State in the Machine //run debug command to inspect RDD: scala> fltr.toDebugString //simplified output: res1: String = FilteredRDD[2] at filter at <console>:14 MappedRDD[1] at textFile at <console>:12 HadoopRDD[0] at textFile at <console>:12 26
  • 27. Page 27 © Hortonworks Inc. 2014 A Word on Anonymous Functions Scala programmers make great use of anonymous functions as can be seen in the code: flatMap( line => line.split(" ") ) 27 Argument to the function Body of the function
  • 28. Page 28 © Hortonworks Inc. 2014 Scala Functions Come In a Variety of Styles flatMap( line => line.split(" ") ) flatMap((line:String) => line.split(" ")) flatMap(_.split(" ")) 28 Argument to the function (type inferred) Body of the function Argument to the function (explicit type) Body of the function No Argument to the function declared (placeholder) instead Body of the function includes placeholder _ which allows for exactly one use of one arg for each _ present. _ essentially means ‘whatever you pass me’
  • 29. Page 29 © Hortonworks Inc. 2014 And Finally – the Formal ‘def’ def myFunc(line:String): Array[String]={ return line.split(",") } //and now that it has a name: myFunc("Hi Mom, I’m home.").foreach(println) Return type of the function) Body of the function Argument to the function)
  • 30. Page 30 © Hortonworks Inc. 2014 Things You Can Do With RDDs •  RDDs are objects and expose a rich set of methods: 30 Name Description Name Description filter Return a new RDD containing only those elements that satisfy a predicate collect Return an array containing all the elements of this RDD count Return the number of elements in this RDD first Return the first element of this RDD foreach Applies a function to all elements of this RDD (does not return an RDD) reduce Reduces the contents of this RDD subtract Return an RDD without duplicates of elements found in passed-in RDD union Return an RDD that is a union of the passed-in RDD and this one
  • 31. Page 31 © Hortonworks Inc. 2014 More Things You Can Do With RDDs •  More stuff you can do… 31 Name Description Name Description flatMap Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results checkpoint Mark this RDD for checkpointing (its state will be saved so it need not be recreated from scratch) cache Load the RDD into memory (what doesn’t fit will be calculated as needed) countByValue Return the count of each unique value in this RDD as a map of (value, count) pairs distinct Return a new RDD containing the distinct elements in this RDD persist Store the RDD to either memory, Disk, or hybrid according to passed in value sample Return a sampled subset of this RDD unpersist Clear any record of the RDD from disk/memory
  • 32. Page 32 © Hortonworks Inc. 2014 Code ‘select count’ Equivalent SQL Statement: Select count(*) from pagecounts WHERE state = ‘FL’ Scala statement: val file = sc.textFile("hdfs://…/log.txt") val numFL = file.filter(line => line.contains("fl")).count() scala> println(numFL) 32 1. Load the page as an RDD 2. Filter the lines of the page eliminating any that do not contain “fl“ 3. Count those lines that remain 4. Print the value of the counted lines containing ‘fl’
  • 33. Page 33 © Hortonworks Inc. 2014 Spark SQL 33
  • 34. Page 34 © Hortonworks Inc. 2014 What About Integration With Hive? scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println) … [omniture] [omniturelogs] [orc_table] [raw_products] [raw_users] … 34
  • 35. Page 35 © Hortonworks Inc. 2014 More Integration With Hive: scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println) [swid,string,null] [birth_date,string,null] [gender_cd,string,null] scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT 5").collect().foreach(println) [0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F] [00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F] [00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F] [000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F] [00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F] 35
  • 36. Page 36 © Hortonworks Inc. 2014 Querying RDD Using SQL // SQL statements can be run directly on RDD’s val teenagers = 
 sqlC.sql("SELECT name FROM people 
 WHERE age >= 13 AND age <= 19")
 
 // The results of SQL queries are SchemaRDDs and support 
 // normal RDD operations: 
 val nameList = teenagers.map(t => "Name: " + t(0)).collect() // Language integrated queries (ala LINQ) val teenagers =
  people.where('age >= 10).where('age <= 19).select('name)
  • 37. Page 37 © Hortonworks Inc. 2014 Spark MLlib – Algorithms Offered •  Classification: logistic regression, linear SVM, – naïve Bayes, least squares, classification tree •  Regression: generalized linear models (GLMs), – regression tree •  Collaborative filtering: alternating least squares (ALS), – non-negative matrix factorization (NMF) •  Clustering: k-means •  Decomposition: SVD, PCA •  Optimization: stochastic gradient descent, L-BFGS
  • 38. Page 38 © Hortonworks Inc. 2014 MicroBatch Spark Streams
  • 39. Page 39 © Hortonworks Inc. 2014 Physical Execution
  • 40. Page 40 © Hortonworks Inc. 2014 Spark Streaming 101 •  Spark has significant library support for streaming applications val ssc = new StreamingContext(sc, Seconds(5)) val tweetStream = TwitterUtils.createStream(ssc, Some(auth)) •  Allows to combine Streaming with Batch/ETL,SQL & ML •  Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom. •  Chop input data stream into batches •  Spark processes batches & results published in batches •  Fundamental unit is Discretized Streams (DStreams)
  • 41. Page 41 © Hortonworks Inc. 2014 Spark on HDP Demo
  • 42. Page 42 © Hortonworks Inc. 2014 Twitter Language Classifier Goal: connect to real time twitter stream and print only those tweets whose language match our chosen language. Main issue: how to detect the language during run time? Solution: build a language classifier model offline capable of detecting language of tweet (Mlib). Then, apply it to real time twitter stream and do filtering (Spark Streaming).
  • 43. Page 43 © Hortonworks Inc. 2014 Tuning and Debugging Spark
  • 44. Page 44 © Hortonworks Inc. 2014 Spark Tuning - 1 • Understand how Spark works – Spark Transformations create new RDDs – Transformations are executed lazily when action is called. – Collect() may call OOME when large data is sent from Executors to Driver – Prefer ReduceByValue over ReduceByKey • Understand your application – ML is CPU intensive – ETL is IO intensive – Avoid shuffle – Use the right join for the Table (ShuffledHashJoin vs BroadcastHashJoin) – Manual Configuration
  • 45. Page 45 © Hortonworks Inc. 2014 Spark Tuning - 2 •  Understand where time is spent –  See http://<host>:8088/proxy/<app_id>/stages/ •  Use Kryo serializer –  Needs configuration and registering class •  Use toDebugString on RDD –  counts.toDebugString res2: String = (2) ShuffledRDD[4] at reduceByKey at <console>:14 [] +-(2) MappedRDD[3] at map at <console>:14 [] | FlatMappedRDD[2] at flatMap at <console>:14 [] | hdfs://red1:8020/tmp/data MappedRDD[1] at textFile at <console>:12 [] | hdfs://red1:8020/tmp/data HadoopRDD[0] at textFile at <console>:12 []
  • 46. Page 46 © Hortonworks Inc. 2014 When Things go wrong •  Where to look –  yarn application –list (get the list of running application) –  yarn logs -applicationId <app_id> –  Check Spark Environment : http://<host>:8088/proxy/<job_id>/environment/ •  Common Issues –  Submitted a job but nothing happens –  Job stays in accepted state when allocated more memory/cores than is available –  May need to kill unresponsive/stale jobs –  Insufficient HDFS access –  May lead to failure such as “Loading data to table default.testtable
 Failed with exception Unable to move sourcehdfs://red1:8020/tmp/hive-spark/ hive_2015-03-04_12-45-42_404_3643812080461575333-1/-ext-10000/kv1.txt to destination hdfs://red1:8020/apps/hive/ warehouse/testtable/kv1.txt” –  Wrong host in Beeline, shows error as invalid URL –  “Error: Invalid URL: jdbc:hive2://localhost:10001 (state=08S01,code=0)” –  Error about closed SQLContext, restart Thirft Server Grant user/group necessary HDFS access
  • 47. Page 47 © Hortonworks Inc. 2014 Conclusion and Resources
  • 48. Page 48 © Hortonworks Inc. 2014 Conclusion •  Spark is a unified framework for data engineering and data science •  Spark can be programmed in Scala, Java and Python. •  Spark will be supported by Hortonworks on HDP in Q1 2015 •  Certain workloads are faster in Spark because of in-memory caching.
  • 49. Page 49 © Hortonworks Inc. 2014 References and Further Reading •  Apache Spark website: https://spark.apache.org/ •  Hortonworks Spark website: http://hortonworks.com/hadoop/spark/ •  Databricks Developer Resources: https://databricks.com/spark/ developer-resources •  “Learning Spark” by O’Reilly Publishers