SlideShare une entreprise Scribd logo
1  sur  44
Robert Hryniewicz
Data Evangelist
@RobertH8z
Hands-on Intro to Spark & Zeppelin
Crash Course
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Quick Demo
• Spark Overview
• Zeppelin + HDP
• Lab ~ 1hr
• Spark 2.0
• Q/A
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Big Data”
 Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars
– Weather Stations, Smart Grids
– RFID Tags, Beacons, Wearables
 User Generated Content (Web & Mobile)
– Twitter, Facebook, Snapchat, YouTube
– Clickstream, Ads, User Engagement
– Payments: Paypal, Venmo
Where does “Big Data” come from?
44ZB in 2020
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The “Big Data” Problem
 A single machine cannot process or even store all the data!
Problem
Solution
 Distribute data over large clusters
Difficulty
 How to split work across machines?
 Moving data over network is expensive
 Must consider data & network locality
 How to deal with failures?
 How to deal with slow nodes?
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Background
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History of Hadoop & Spark
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access Rates
At least an order of magnitude difference between memory and hard drive / network speed
FAST slower slowest
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Spark?
 Apache Open Source Project
– originally developed at AMPLab (University of California Berkeley)
 Data Processing Engine
– In-memory computation – FAST!
 Elegant Developer-friendly APIs
– Supports: Scala, Python, Java and R
– Single environment for Data Wrangling, Machine Learning (ML), SQL Queries, Streaming Apps
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Ecosystem
Spark Core
Spark SQL Spark Streaming Spark MLlib GraphX
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Basics
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
 Main entry point for Spark functionality
 Represents a connection to a Spark cluster
 Represented as sc in your code
What is it?
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
 Spark module for structured data processing (e.g. DB tables, JSON files)
 Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
 Distributed collection of data organized into named columns
 Conceptually equivalent to a table in relational DB or data frame in R/Python
– rows, columns, and schema
 API available in Scala, Java, Python, and R
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
Created from Various Sources
 DataFrames from HIVE:
– Reading and writing HIVE tables
 DataFrames from files:
– Built-in: JSON, JDBC, ORC, Parquet, HDFS
– External plug-in: CSV, HBASE, Avro
Data is described as a DataFrame
with rows, columns and a schema
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context and Hive Context
 Entry point into all functionality in Spark SQL
 All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
 Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions  UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Examples
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
 Extension of Spark Core API
 Stream processing of live data streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming
 Apply transformations over a sliding window of data, e.g. rolling average
Window Operations
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark MLlib
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Where Can We Use Data Science / Machine Learning
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML: Spark API for building ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark GraphX
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GraphX
 Page Rank
 Topic Modeling (LDA)
 Community Detection
Source: ampcamp.berkeley.edu
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin & HDP Sandbox
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Scala and more
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a Note/Notebook?
• A web base GUI for small code snippets
• Write code snippets in browser
• Zeppelin sends code to backend for execution
• Zeppelin gets data back from backend
• Zeppelin visualizes data
• Zeppelin Note = Set of (Paragraphs/Cells)
• Other Features - Sharing/Collaboration/Reports/Import/Export
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Zeppelin work?
Notebook
Author
Collaborators/
Report viewers
Zeppelin
Cluster
Spark | Hive | HBase
Any of 30+ back ends
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Lifecycle
Collect
ETL /
Process
Analysis
Report
Data
Product
Business user
Customer
Data ScientistData Engineer
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Sandbox
What’s included in the Sandbox?
 Zeppelin
 Latest Hortonworks Data Platform (HDP)
– Spark
– YARN  Resource Management
– HDFS  Distributed Storage Layer
– And many more components... YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.0
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s New in Spark 2.0
 API Unification
– DataFrame  alias for DataSet[Row]
– SparkSession (%spark) replaces SparkContext, SQLContext, and HiveContext
• spark is the new entry point to all Spark features
 Structured Streaming
– DataFrame/DataSet for manipulating stream data
– Real-time incremental processing
– Attempt to unify streaming, interactive, and batch processing
 Performance Improvements
– Tungsten - “bare metal” code generation
– ORC & Parquet file formats
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Community Engagement
Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved
7,500+
Registered Users
15,000+
Answers
20,000+
Technical Assets
One Website!
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab Preview
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Link to Lab Setup Instructions
http://tinyurl.com/hwx-spark-intro
Robert Hryniewicz
rhryniewicz@hortonworks.com
@RobertH8z
Thanks!

Contenu connexe

Tendances

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsDataWorks Summit
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureDataWorks Summit/Hadoop Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 

Tendances (20)

What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Creating the Internet of Your Things
Creating the Internet of Your ThingsCreating the Internet of Your Things
Creating the Internet of Your Things
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 

Similaire à Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit
 
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方HortonworksJapan
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitAldrin Piri
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteHortonworks
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationAbdelkrim Hadjidj
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationIsheeta Sanghi
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionMilind Pandit
 

Similaire à Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin (20)

Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop Summit
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's Keynote
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

  • 1. Robert Hryniewicz Data Evangelist @RobertH8z Hands-on Intro to Spark & Zeppelin Crash Course
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • Quick Demo • Spark Overview • Zeppelin + HDP • Lab ~ 1hr • Spark 2.0 • Q/A
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Big Data”  Internet of Anything (IoAT) – Wind Turbines, Oil Rigs, Cars – Weather Stations, Smart Grids – RFID Tags, Beacons, Wearables  User Generated Content (Web & Mobile) – Twitter, Facebook, Snapchat, YouTube – Clickstream, Ads, User Engagement – Payments: Paypal, Venmo Where does “Big Data” come from? 44ZB in 2020
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The “Big Data” Problem  A single machine cannot process or even store all the data! Problem Solution  Distribute data over large clusters Difficulty  How to split work across machines?  Moving data over network is expensive  Must consider data & network locality  How to deal with failures?  How to deal with slow nodes?
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Background
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved History of Hadoop & Spark
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Access Rates At least an order of magnitude difference between memory and hard drive / network speed FAST slower slowest
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Spark?  Apache Open Source Project – originally developed at AMPLab (University of California Berkeley)  Data Processing Engine – In-memory computation – FAST!  Elegant Developer-friendly APIs – Supports: Scala, Python, Java and R – Single environment for Data Wrangling, Machine Learning (ML), SQL Queries, Streaming Apps
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Ecosystem Spark Core Spark SQL Spark Streaming Spark MLlib GraphX
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark Basics
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Context  Main entry point for Spark functionality  Represents a connection to a Spark cluster  Represented as sc in your code What is it?
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Overview  Spark module for structured data processing (e.g. DB tables, JSON files)  Three ways to manipulate data: – DataFrames API – SQL queries – Datasets API
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames  Distributed collection of data organized into named columns  Conceptually equivalent to a table in relational DB or data frame in R/Python – rows, columns, and schema  API available in Scala, Java, Python, and R
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames CSVAvro HIVE Spark SQL Text Col1 Col2 … … ColN DataFrame Column Row Created from Various Sources  DataFrames from HIVE: – Reading and writing HIVE tables  DataFrames from files: – Built-in: JSON, JDBC, ORC, Parquet, HDFS – External plug-in: CSV, HBASE, Avro Data is described as a DataFrame with rows, columns and a schema
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Context and Hive Context  Entry point into all functionality in Spark SQL  All you need is SparkContext val sqlContext = SQLContext(sc) SQLContext  Superset of functionality provided by basic SQLContext – Read data from Hive tables – Access to Hive Functions  UDFs HiveContext val hc = HiveContext(sc) Use when your data resides in Hive
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Examples
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrame Example val df = sqlContext.table("flightsTbl") df.select("Origin", "Dest", "DepDelay").show(5) Reading Data From Table +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 8| | IAD| TPA| 19| | IND| BWI| 8| | IND| BWI| -4| | IND| BWI| 34| +------+----+--------+
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrame Example df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5) Using DataFrame API to Filter Data (show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Example // Register Temporary Table df.registerTempTable("flights") // Use SQL to Query Dataset sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT 5").show Using SQL to Query and Filter Data (again, show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming  Extension of Spark Core API  Stream processing of live data streams – Scalable – High-throughput – Fault-tolerant Overview ZeroMQ MQTT
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming  Apply transformations over a sliding window of data, e.g. rolling average Window Operations
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark MLlib
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Where Can We Use Data Science / Machine Learning Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark ML: Spark API for building ML pipelines Feature transform 1 Feature transform 2 Combine features Linear Regression Input DataFrame Input DataFrame Output DataFrame Pipeline Pipeline Model Train Predict Export Model
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark GraphX
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GraphX  Page Rank  Topic Modeling (LDA)  Community Detection Source: ampcamp.berkeley.edu
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin & HDP Sandbox
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s Apache Zeppelin? Web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is a Note/Notebook? • A web base GUI for small code snippets • Write code snippets in browser • Zeppelin sends code to backend for execution • Zeppelin gets data back from backend • Zeppelin visualizes data • Zeppelin Note = Set of (Paragraphs/Cells) • Other Features - Sharing/Collaboration/Reports/Import/Export
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Zeppelin work? Notebook Author Collaborators/ Report viewers Zeppelin Cluster Spark | Hive | HBase Any of 30+ back ends
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Lifecycle Collect ETL / Process Analysis Report Data Product Business user Customer Data ScientistData Engineer
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Sandbox What’s included in the Sandbox?  Zeppelin  Latest Hortonworks Data Platform (HDP) – Spark – YARN  Resource Management – HDFS  Distributed Storage Layer – And many more components... YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved There’s more to HDP YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Data Lifecycle & Governance Falcon Atlas Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Others ISV Engines Tez Tez Slider Slider DATA MANAGEMENT Hortonworks Data Platform 2.4.x Deployment ChoiceLinux Windows On-Premise Cloud HDFS Hadoop Distributed File System
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark 2.0
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s New in Spark 2.0  API Unification – DataFrame  alias for DataSet[Row] – SparkSession (%spark) replaces SparkContext, SQLContext, and HiveContext • spark is the new entry point to all Spark features  Structured Streaming – DataFrame/DataSet for manipulating stream data – Real-time incremental processing – Attempt to unify streaming, interactive, and batch processing  Performance Improvements – Tungsten - “bare metal” code generation – ORC & Parquet file formats
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection Read access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Community Engagement Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved 7,500+ Registered Users 15,000+ Answers 20,000+ Technical Assets One Website!
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lab Preview
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Link to Lab Setup Instructions http://tinyurl.com/hwx-spark-intro

Notes de l'éditeur

  1. 1ZB = 1000EB All of Wikipedia (text and thumbnails) fits on a 128GB USB stick!
  2. Core modules
  3. Same execution engine for all three Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API
  4. Richer optimizations (significantly faster than RDDs) Underneath is an RDD
  5. Extra: To create DataFrames from existing RDDs use toDF() function
  6. Spark-ML is a uniform API for building and tuning ML pipelines Built on top of DataFrames
  7. Data exploration, visualization, and discovery Deeply integrated with Spark and Hadoop Pluggable interpreters Multiple languages in one notebook: R, Python, Scala
  8. It’s not an IDE It’s not a BI tool
  9. Blogs: http://hortonworks.com/blog/welcome-apache-spark-2-0/