Contenu connexe Similaire à Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS (20) Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS1. Page1 © Hortonworks Inc. 2014
Introduction to Big Data Analytics using
Apache Spark on HDInsights on Azure
(SaaS) and/or HDP on Azure(PaaS)
Hortonworks. We do Hadoop.
Alex Zeltov
@azeltov
2. Page2 © Hortonworks Inc. 2014
In this workshop
• Introduction to HDP and Spark
• Build a Data analytics application:
- Spark Programming: Scala, Python, R
- Core Spark: working with RDDs, DataFrames
- Spark SQL: structured data access
• Conclusion and Q/A
3. Page3 © Hortonworks Inc. 2014
Introduction to HDP and Spark
http://hortonworks.com/hadoop/spark/
4. Page4 © Hortonworks Inc. 2014
What is Spark?
• Spark is
– an open-source software solution that performs rapid calculations
on in-memory datasets
- Open Source [Apache hosted & licensed]
• Free to download and use in production
• Developed by a community of developers
- Spark supports using well known languages such as: Scala, Python, R, Java
- Spark SQL: Seamlessly mix SQL queries with Spark programs
- In-memory datasets
• RDD (Resilient Distributed Data) is the basis for what Spark enables
• Resilient – the models can be recreated on the fly from known state
• Distributed – the dataset is often partitioned across multiple nodes for
increased scalability and parallelism
5. Page5 © Hortonworks Inc. 2014
Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.4
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines
6. Page6 © Hortonworks Inc. 2014
Spark Components
Spark allows you to do data processing, ETL, machine learning,
stream processing, SQL querying from one framework
7. Page7 © Hortonworks Inc. 2014
Ease of Use
• Write applications quickly in Java, Scala, Python, R.
• Spark offers over 80 high-level operators that make it easy to
build parallel apps. And you can use it interactively from the
Scala, Python and R shells.
8. Page8 © Hortonworks Inc. 2014
Generality
• Combine SQL, streaming, and complex analytics.
• Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access
diverse data sources including HDFS, Cassandra, HBase, S3, WASB
• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/
9. Page9 © Hortonworks Inc. 2014
Emerging Spark Patterns
• Spark as query federation engine
Bring data from multiple sources to join/query in Spark
• Use multiple Spark libraries together
Common to see Core, ML & Sql used together
• Use Spark with various Hadoop ecosystem projects
Use Spark & Hive together
Spark & HBase together
10. Page10 © Hortonworks Inc. 2014
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Why We Love Spark at Hortonworks
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
12. Page12 © Hortonworks Inc. 2014
Spark Motivation
• MapReduce – involves lots of disk I/O (Disk I/O is very slow)
• Spark – Keep more data in memory
•
Input
13. Page13 © Hortonworks Inc. 2014
What is Hadoop?
Apache Hadoop is an open-source software framework
written in Java for distributed storage and distributed
processing of very large data sets on computer clusters built
from commodity hardware.
The core of Apache Hadoop consists of a storage part
Hadoop Distributed File System (HDFS) and a processing
part (MapReduce).
15. Page15 © Hortonworks Inc. 2014
Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:
• Zeppelin: A web-based notebook that enables interactive data
analytics.
• Jupyter: Evolved from the IPython Project
• SparkNotebook: forked from the scala-notebook
• RStudio: for Spark R , Zeppelin support coming soon
https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html
16. Page16 © Hortonworks Inc. 2014
Apache Zeppelin
• A web-based notebook that enables interactive data
analytics.
• Multiple language backend
• Multi-purpose Notebook is the place for all your
needs
Data Ingestion
Data Discovery
Data Analytics
Data Visualization
Collaboration
17. Page17 © Hortonworks Inc. 2014
Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
18. Page18 © Hortonworks Inc. 2014
Zeppelin – Dependency Management
• Load libraries recursively from Maven repository
• Load libraries from local filesystem
• %dep
• // add maven repository
• z.addRepo("RepoName").url("RepoURL”)
• // add artifact from filesystem
• z.load("/path/to.jar")
• // add artifact from maven repository, with no dependency
• z.load("groupId:artifactId:version").excludeAll()
19. Page19 © Hortonworks Inc. 2014 19
Community Plugins
• 100+ connectors
http://spark-packages.org/
21. Page21 © Hortonworks Inc. 2014
How Does Spark Work?
• RDD
• Your data is loaded in parallel into structured collections
• Actions
• Manipulate the state of the working model by forming new RDDs
and performing calculations upon them
• Persistence
• Long-term storage of an RDD’s state
22. Page22 © Hortonworks Inc. 2014
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
23. Page23 © Hortonworks Inc. 2014
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
RDDs
• Programmer specifies number of partitions for an RDD
(Default value used if unspecified)
RDD split into 5 partitions
24. Page24 © Hortonworks Inc. 2014
RDDs
• Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk
25. Page25 © Hortonworks Inc. 2014
Example RDD Transformations
•map(func)
•filter(func)
•distinct(func)
• All create a new DataSet from an existing one
• Do not create the DataSet until an action is performed (Lazy)
• Each element in an RDD is passed to the target function and the
result forms a new RDD
26. Page26 © Hortonworks Inc. 2014
Example Action Operations
•count()
•reduce(func)
•collect()
•take()
• Either:
• Returns a value to the driver program
• Exports state to external system
27. Page27 © Hortonworks Inc. 2014
Example Persistence Operations
•persist() -- takes options
•cache() -- only one option: in-memory
• Stores RDD Values
• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk
• blended
28. Page28 © Hortonworks Inc. 2014
Spark Applications
Are a definition in code of
• RDD creation
• Actions
• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
• Each DAG is compiled into stages
• Each Stage is executed as a series of Tasks
• Each Task operates in parallel on assigned partitions
29. Page29 © Hortonworks Inc. 2014
Spark Context
• A Spark program first creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:
• are automatically created and exposed as variable names 'sc', 'sqlContext' and
'z', respectively, both in scala and python environments using Zeppelin
• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext,
ZeppelinContext instance.
30. Page30 © Hortonworks Inc. 2014
1. Resilient Distributed Dataset [RDD] Graph
val v = sc.textFile("hdfs://…some-hdfs-data")
mapmap reduceByKey collecttextFile
v.flatMap(line=>line.split(" "))
.map(word=>(word, 1)))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
31. Page31 © Hortonworks Inc. 2014
Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
31
32. Page32 © Hortonworks Inc. 2014
Looking at the State in the Machine
//run debug command to inspect RDD:
scala> fltr.toDebugString
//simplified output:
res1: String =
FilteredRDD[2] at filter at <console>:14
MappedRDD[1] at textFile at <console>:12
HadoopRDD[0] at textFile at <console>:12
32
33. Page33 © Hortonworks Inc. 2014
A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can
be seen in the code:
flatMap( line => line.split(" ") )
33
Argument
to the
function
Body of
the
function
34. Page34 © Hortonworks Inc. 2014
Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
34
Argument to the
function (type inferred)
Body of the function
Argument to the
function (explicit type)
Body of the
function
No Argument to the
function declared
(placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of
one arg for each _ present. _ essentially means ‘whatever you pass me’
35. Page35 © Hortonworks Inc. 2014
And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)
38. Page38 © Hortonworks Inc. 2014
What are DataFrames?
• Distributed Collection of Data organized in Columns
• Equivalent to Tables in Databases or DataFrame in R/PYTHON
• Much richer optimization than any other implementation of DF
• Can be constructed from a wide variety of sources and APIs
• Greater accessiblity
• Declarative rather thanimperative
• Catalyst Optimizer
Why DataFrames?
39. Page39 © Hortonworks Inc. 2014
Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()
40. Page40 © Hortonworks Inc. 2014
Querying RDD Using SQL
import org.apache.spark.sql.types.{StructType,StructField,StringType}
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("/tmp/people.txt")
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
41. Page41 © Hortonworks Inc. 2014
Querying RDD Using SQL
// SQL statements can be run directly on RDD’s
val teenagers =
sqlC.sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations:
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)
42. Page42 © Hortonworks Inc. 2014
Dataframes for Apache Spark
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
Time to aggregate 10 million integer pairs (in seconds)
DataFrames can be significantly faster than RDDs. And they
perform the same, regardless of language.
43. Page43 © Hortonworks Inc. 2014
Transformations Actions
filter count
select collect
drop show
join take
Transformations contribute to the query plan but
nothing is executed until an action is called
Dataframes – Transformation & Actions
44. Page44 © Hortonworks Inc. 2014
LAB: DataFrames
http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7
http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM
DataFrames + SQL
DataFrames JSON
45. Page45 © Hortonworks Inc. 2014
DataFrames and JDBC
val jdbc_attendees = sqlContext.load("jdbc", Map("url" ->
"jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees"))
jdbc_attendees.show()
jdbc.attendees.count()
jdbc_attendees.registerTempTable("jdbc_attendees")
val countall = sqlContext.sql("select count(*) from jdbc_attendees")
countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
46. Page46 © Hortonworks Inc. 2014
Code ‘select count’
Equivalent SQL Statement:
Select count(*) from pagecounts WHERE state = ‘FL’
Scala statement:
val file = sc.textFile("hdfs://…/log.txt")
val numFL = file.filter(line =>
line.contains("fl")).count()
scala> println(numFL)
46
1. Load the page as an RDD
2. Filter the lines of the page
eliminating any that do not
contain “fl“
3. Count those lines that
remain
4. Print the value of the
counted lines containing ‘fl’
48. Page48 © Hortonworks Inc. 2014 48
Platform APIs
• Joining Data from Different
Sources
• Access Data using DataFrames /
SQL
49. Page49 © Hortonworks Inc. 2014
LAB: JDBC and 3rd party packages
http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82
50. Page50 © Hortonworks Inc. 2014
What About Integration With Hive?
scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)
…
[omniture]
[omniturelogs]
[orc_table]
[raw_products]
[raw_users]
…
50
51. Page51 © Hortonworks Inc. 2014
More Integration With Hive:
scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println)
[swid,string,null]
[birth_date,string,null]
[gender_cd,string,null]
scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT
5").collect().foreach(println)
[0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F]
[00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F]
[00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F]
[000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F]
[00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F]
51
52. Page52 © Hortonworks Inc. 2014
LAB: HIVE ORC
http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z
56. Page56 © Hortonworks Inc. 2014
Spark Streaming 101
• Spark has significant library support for streaming applications
val ssc = new StreamingContext(sc, Seconds(5))
val tweetStream = TwitterUtils.createStream(ssc, Some(auth))
• Allows to combine Streaming with Batch/ETL,SQL & ML
• Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom.
• Chop input data stream into batches
• Spark processes batches & results published in batches
• Fundamental unit is Discretized Streams (DStreams)
58. Page58 © Hortonworks Inc. 2014
Spark MLlib – Algorithms Offered
• Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
• Regression: generalized linear models (GLMs),
– regression tree
• Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
• Clustering: k-means
• Decomposition: SVD, PCA
• Optimization: stochastic gradient descent, L-BFGS
59. Page59 © Hortonworks Inc. 2014 59
ML - Pipelines
• New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK-
8600], Bisecting KMeans
• [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352],
Weighting for
• Linear Models [SPARK-7685]
• New transformers (close to parity with SciKit learn):
CountVectorizer [SPARK-8703],
• PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455]
• Calling into single machine solvers (coming soon as a package)
60. Page60 © Hortonworks Inc. 2014
Twitter Language Classifier
Goal: connect to real time twitter stream and print only
those tweets whose language match our chosen language.
Main issue: how to detect the language during run time?
Solution: build a language classifier model offline capable of
detecting language of tweet (Mlib). Then, apply it to real
time twitter stream and do filtering (Spark Streaming).
62. Page62 © Hortonworks Inc. 2014
Learn More Spark + Hadoop Perfect Together
HDP Spark General Info:
http://hortonworks.com/hadoop/spark/
Learn more about our Focus on Spark:
http://hortonworks.com/hadoop/spark/#section_6
Get the HDP Spark 1.5.1 Tech Preview:
http://hortonworks.com/hadoop/spark/#section_5
Get started with Spark and Zeppelin and download the Sandbox:
http://hortonworks.com/sandbox
Try these tutorials:
http://hortonworks.com/hadoop/spark/#tutorials
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
Learn more about GeoSpatial Spark processing with Magellan:
http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/