Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Robert Hryniewicz
Data Evangelist
@RobertH8z
Hands-on Intro to Spark & Zeppelin
Crash Course

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Quick Demo
• Spark Overview
• Zeppelin + HDP
• Lab ~ 1hr
• Spark 2.0
• Q/A

“Big Data”
 Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars
– Weather Stations, Smart Grids
– RFID Tags, Beacons, Wearables
 User Generated Content (Web & Mobile)
– Twitter, Facebook, Snapchat, YouTube
– Clickstream, Ads, User Engagement
– Payments: Paypal, Venmo
Where does “Big Data” come from?
44ZB in 2020

The “Big Data” Problem
 A single machine cannot process or even store all the data!
Problem
Solution
 Distribute data over large clusters
Difficulty
 How to split work across machines?
 Moving data over network is expensive
 Must consider data & network locality
 How to deal with failures?
 How to deal with slow nodes?

Spark Background

History of Hadoop & Spark

Access Rates
At least an order of magnitude difference between memory and hard drive / network speed
FAST slower slowest

What is Spark?
 Apache Open Source Project
– originally developed at AMPLab (University of California Berkeley)
 Data Processing Engine
– In-memory computation – FAST!
 Elegant Developer-friendly APIs
– Supports: Scala, Python, Java and R
– Single environment for Data Wrangling, Machine Learning (ML), SQL Queries, Streaming Apps

Spark Ecosystem
Spark Core
Spark SQL Spark Streaming Spark MLlib GraphX

Apache Spark Basics

Spark Context
 Main entry point for Spark functionality
 Represents a connection to a Spark cluster
 Represented as sc in your code
What is it?

Spark SQL

Spark SQL Overview
 Spark module for structured data processing (e.g. DB tables, JSON files)
 Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API

DataFrames
 Distributed collection of data organized into named columns
 Conceptually equivalent to a table in relational DB or data frame in R/Python
– rows, columns, and schema
 API available in Scala, Java, Python, and R

DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
Created from Various Sources
 DataFrames from HIVE:
– Reading and writing HIVE tables
 DataFrames from files:
– Built-in: JSON, JDBC, ORC, Parquet, HDFS
– External plug-in: CSV, HBASE, Avro
Data is described as a DataFrame
with rows, columns and a schema

SQL Context and Hive Context
 Entry point into all functionality in Spark SQL
 All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
 Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions  UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive

Spark SQL Examples

DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+

DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

Spark Streaming

Spark Streaming
 Extension of Spark Core API
 Stream processing of live data streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT

Spark Streaming

Spark Streaming
 Apply transformations over a sliding window of data, e.g. rolling average
Window Operations

Spark MLlib

Where Can We Use Data Science / Machine Learning
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels

Spark ML: Spark API for building ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model

Spark GraphX

GraphX
 Page Rank
 Topic Modeling (LDA)
 Community Detection
Source: ampcamp.berkeley.edu

Apache Zeppelin & HDP Sandbox

What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Scala and more

What is a Note/Notebook?
• A web base GUI for small code snippets
• Write code snippets in browser
• Zeppelin sends code to backend for execution
• Zeppelin gets data back from backend
• Zeppelin visualizes data
• Zeppelin Note = Set of (Paragraphs/Cells)
• Other Features - Sharing/Collaboration/Reports/Import/Export

How does Zeppelin work?
Notebook
Author
Collaborators/
Report viewers
Zeppelin
Cluster
Spark | Hive | HBase
Any of 30+ back ends

Big Data Lifecycle
Collect
ETL /
Process
Analysis
Report
Data
Product
Business user
Customer
Data ScientistData Engineer

HDP Sandbox
What’s included in the Sandbox?
 Zeppelin
 Latest Hortonworks Data Platform (HDP)
– Spark
– YARN  Resource Management
– HDFS  Distributed Storage Layer
– And many more components... YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System

Spark 2.0

What’s New in Spark 2.0
 API Unification
– DataFrame  alias for DataSet[Row]
– SparkSession (%spark) replaces SparkContext, SQLContext, and HiveContext
• spark is the new entry point to all Spark features
 Structured Streaming
– DataFrame/DataSet for manipulating stream data
– Real-time incremental processing
– Attempt to unify streaming, interactive, and batch processing
 Performance Improvements
– Tungsten - “bare metal” code generation
– ORC & Parquet file formats

Hortonworks Community Connection

Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories

Lab Preview

Link to Lab Setup Instructions
http://tinyurl.com/hwx-spark-intro

Robert Hryniewicz
rhryniewicz@hortonworks.com
@RobertH8z
Thanks!

Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Similaire à Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin (20)

Plus de DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Dernier

Dernier (20)

Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Notes de l'éditeur