SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
Robert	Hryniewicz
Data	Advocate
Twitter:	@RobH8z
Email:				rhryniewicz@hortonworks.com
Apache	Spark	Crash	Course	
Hadoop	Summit	Tokyo	2016
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
• Background
• Spark	Overview
• Zeppelin	Overview
• Components	of	HDP
• Lab	~	45min
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Sources
à Internet	of	Anything	(IoAT)
– Wind	Turbines,	Oil	Rigs,	Cars
– Weather	Stations,	Smart	Grids
– RFID	Tags,	Beacons,	Wearables
à User	Generated	Content	(Web	&	Mobile)
– Twitter,	Facebook,	Snapchat,	YouTube
– Clickstream,	Ads,	User	Engagement
– Payments:	Paypal,	Venmo
44ZB	in	2020
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	“Big	Data”	Problem
à A	single	machine	cannot	process	or	even	store	all	the	data!
Problem
Solution
à Distribute	data	over	large	clusters
Difficulty
à How	to	split	work	across	machines?
à Moving	data	over	network	is	expensive
à Must	consider	data	&	network	locality
à How	to	deal	with	failures?
à How	to	deal	with	slow	nodes?
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Background
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
History	of	Hadoop &	Spark
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access	Rates
At	least	an	order	of	magnitude	difference	between	memory	and	hard	drive	/	network	speed
FAST slower slowest
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	Is	Apache	Spark?
à Apache	open	source	project	
originally	developed	at	AMPLab
(University	of	California	Berkeley)
à Unified	data	processing	engine	that	
operates	across	varied	data	
workloads	and	platforms
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark?
à Elegant	Developer	APIs
– Single	environment	for	data	munging,	data	wrangling,	and	Machine	Learning	(ML)
à In-memory	computation	model	– Fast!
– Effective	for	iterative	computations	and	ML
à Machine	Learning
– Implementation	of	distributed	ML	algorithms
– Pipeline	API	(Spark	ML)
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Ecosystem
Spark	Core
Spark	SQL Spark	Streaming Spark	MLlib GraphX
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Spark	Basics
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Context
à Main	entry	point	for	Spark	functionality
à Represents	a	connection	to	a	Spark	cluster
à Represented	as		sc in	your	code	(in	Zeppelin)
What	is	it?
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Overview
à Spark	module	for	structured	data	processing	(e.g.	DB	tables,	JSON	files,	CSV)
à Three	ways	to	manipulate	data:
– DataFrames API
– SQL	queries
– Datasets	API
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
à Distributed collection of	data organized into	named	
columns
à Conceptually	equivalent	to	a	table	in	relational	DB	or	
a	data	frame	in	R/Python
à API	available	in	Scala,	Java,	Python,	and	R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data	is	described	as	a	DataFrame
with	rows,	columns,	and	a	schema
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
CSVAvro
HIVE
Spark	SQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
Created	from	Various	Sources
à DataFrames from	HIVE:
– Reading	and	writing	HIVE	tables
à DataFrames from	files:
– Built-in:	JSON,	JDBC,	ORC,	Parquet,	HDFS
– External	plug-in:	CSV,	HBASE,	Avro
JSON
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SQL	Context
à Entry	point	into	all	functionality	in	Spark	SQL
à All	you	need	is	SparkContext
val sqlContext = SQLContext(sc)
SQLContext
à Superset	of	functionality	provided	by	basic	SQLContext
– Read	data	from	Hive	tables
– Access	to	Hive	Functions	à UDFs
HiveContext
val hc = HiveContext(sc)
Use	when	your	
data	resides	in	
Hive
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Examples
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Setting	up	DataFrame API
val flightsDF = … ç Create from CSV, JSON, Hive etc.
Example:
val path = "examples/flights.json"
val flightsDF = sqlContext.read.json(path)
Create	a	DataFrame
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Setting	up	SQL	API
Register	a	Temporary	Table
flightsDF.registerTempTable("flights")
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Two	API	Examples:	DataFrame and	SQL	APIs
flightsDF.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5
SQL	API
DataFrame API
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	Stream	Processing?
Batch	Processing
• Ability	to	process	and	analyze	data	at-rest	(stored	data)
• Request-based,	bulk	evaluation	and	short-lived	processing
• Enabler	for	Retrospective,	Reactive	and	On-demand	Analytics
Stream	Processing
• Ability	to	ingest,	process	and	analyze	data	in-motion	in	real- or	near-real-time
• Event	or	micro-batch	driven,	continuous	evaluation	and	long-lived	processing
• Enabler	for	real-time	Prospective,	Proactive	and	Predictive	Analytics	 for	Next	Best	
Action
Stream	Processing	 +		Batch	Processing	 =			All	Data	Analytics
real-time (now) historical (past)
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
24
Modern	Data	Applications	approach	to	Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Extension	of	Spark	Core	API
à Stream	processing	of	live	data	streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Discretized	Streams	(DStreams)
à High-level	abstraction	representing	continuous	stream	of	data
à Internally	represented	as	a	sequence	of	RDDs
à Operation	applied	on	a	DStream translates	to	operations	on	the	underlying	RDDs
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Example:	flatMap operation
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Apply	transformations	over	a	sliding	window	of	data,	e.g.	rolling	average
Window	Operations
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	MLlib
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Where Can We Use Machine Learning (Data Science)
Healthcare
• Predict	diagnosis
• Prioritize	screenings
• Reduce	re-admittance	rates
Financial	services
• Fraud	Detection/prevention
• Predict	underwriting	risk
• New	account	risk	screens
Public	Sector
• Analyze	public	sentiment
• Optimize	resource	allocation
• Law	enforcement	&	security	
Retail
• Product	recommendation
• Inventory	management
• Price	optimization
Telco/mobile
• Predict	customer	churn
• Predict	equipment	failure
• Customer	behavior	analysis
Oil	&	Gas
• Predictive	maintenance
• Seismic	data	management
• Predict	well	production	levels
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression Model Training (one feature)
Coefficients:	2.81				Intercept:	3.05
y	=	2.81x	+	3.05
Training
Result
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark API for building ML pipelines
Feature	
transform	
1
Feature	
transform	
2
Combine	
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline	Model
Train
Predict
Export	Model
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	GraphX
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
GraphX
à Page	Rank
à Topic	Modeling	(LDA)
à Community	Detection
Source:	ampcamp.berkeley.edu
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	&	HDP	Sandbox
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Scala and more
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What is a Note/Notebook?
• A	web	based	GUI	for	small	code	snippets
• Write	code	snippets	in	browser
• Zeppelin	sends	code	to	backend	for	execution
• Zeppelin	gets	data	back	from	backend
• Zeppelin	visualizes	data
• Zeppelin	Note	=	Set	of	(Paragraphs/Cells)
• Other	Features	- Sharing/Collaboration/Reports/Import/Export
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big	Data	Lifecycle
Collect
ETL	/
Process
Analysis
Report
Data
Product
Business	user
Customer
Data	ScientistData	Engineer
All	in	one	place	in	Zeppelin!
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
How	does	Zeppelin	work?
Notebook	
Author
Collaborators/
Report	viewers
Zeppelin
Cluster
Spark	|	Hive	|	HBase
Any	of	30+	back	ends
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDP	Sandbox
What’s	included	in	the	HDP	Sandbox?
à Zeppelin
à Spark
à YARN	à Resource	Management
à HDFS	à Distributed	Storage	Layer
à And	many	more	components: Hive,	Solr etc. YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark	on	YARN?
à Resource	management	
– Share	Spark	workloads	with	other	
workloads	(HIVE,	Solr,	etc.)
à Utilizes	existing	HDP	cluster	
infrastructure
à Scheduling	and	queues
Spark	Driver
Client
Spark
Application	Master
YARN	container
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide	files	into	big	blocks	and	distribute	3	copies	randomly across	the	cluster
• Processing	Data	Locality
• Not	Just	storage	but	computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
47 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS	EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks	Data	Platform	2.4.x
Deployment	ChoiceLinux	 Windows	 On-Premise	 Cloud
HDFS Hadoop Distributed File System
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Data	Cloud
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Bringing	Multitenancy	to	Apache	Zeppelin
52 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Introducing	Livy
à Livy	is	the	open	source	REST	interface	for	interacting	with	Apache	Spark	from	anywhere	
à Installed	as	Spark	Ambari Service
Livy Client
HTTP HTTP	(RPC)
Spark	Interactive	Session
SparkContext
Spark	Batch	Session
SparkContext
Livy Server
53 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Security	Across	Zeppelin-Livy-Spark
Shiro
Ispark	Group	Interpreter
SPNego:	Kerberos Kerberos
Livy	APIs
Spark	on	YARN
Zeppelin
Driver
LDAP
Livy Server
54 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Reasons	to	Integrate	with	Livy
à Bring	Sessions	to	Apache	Zeppelin
– Isolation
– Session	sharing	
à Enable	efficient	cluster	resource	utilization
– Default	Spark	interpreter	keeps	YARN/Spark	job	running	forever
– Livy	interpreter	recycled	after	60	minutes	of	inactivity	
(controlled	by	livy.server.session.timeout )
à To	Identity	Propagation
– Send	user	identity	from	Zeppelin		>	Livy		>	Spark	on	YARN
55 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Livy Server
SparkContext	Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client	1
Client	2
Client	3
Session-1
Session-1
Session-2
56 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample	Architecture
57 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Managed	Dataflow
SOURCES
REGIONAL	
INFRASTRUCTURE
CORE	
INFRASTRUCTURE
58 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
High-Level	Overview
IoT Edge
(single	node)
IoT Edge
(single	node)
IoT Devices
IoT Devices
NiFi Hub Data	Broker
Column	
DB
Data	
Store
Live	Dashboard
Data	Center
(on	prem/cloud)
HDFS/S3 HBase/Cassandra
59 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	new	in	Spark	2.0
60 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	2.0
à API	Improvements
– SparkSession (spark)	– new	entry	point	 (Replaces	SQLContext and	HiveContext)
– Unified	DataFrame &	DataSet API	 (DataFrame à alias	for	DataSet[Row])
– Structured	Streaming/Continuous	Application		 (Concept	of	an	infinite	DataFrame)
– Temporary	Table	à Temporary	View
à Performance	Improvements
– Tungsten	Phase	2	- Multi	stage	code	gen
– ORC	&	Parquet	file	improvements
à Machine	Learning	
– ML	pipeline	the	new	API,	MLlib deprecated
– Distributed	R	algorithms	(GLM,	Naïve	Bayes,	K-Means,	Survival	Regression)
à SparkSQL
– More	SQL	support	(new	ANSI	SQL	parser,	subquery	support)
61 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	the	latest	at	Hortonworks?
à HDP	2.5
– Batch	Processing
à HDF	2.0
– Streaming	Apps
DATA	AT
REST
DATA	IN	
MOTION
ACTIONABLE
INTELLIGENCE
Modern	Data	Applications
62 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Lab	Preview
63 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Lab	Setup	Instructions
http://tinyurl.com/hwx-spark-intro
Lab	Options
- Local	Sandbox	(8GB	RAM	memory	required):
- VirtualBox or	Vmware
- Amazon	AWS	Cloud:
- Hortonworks	Data	Cloud	
è Setup	info:	http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.html
64 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Community	Connection
65 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Community	Engagement
Participate now at: community.hortonworks.com©	Hortonworks	Inc.	2011	– 2015.	All	Rights	Reserved
9,500+
Registered	Users
21,000+
Answers
32,500+
Technical	Assets
One Website!
66 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Community	Connection
Read access for everyone, join to participate and be recognized
• Full	Q&A	Platform	(like	StackOverflow)
• Knowledge	Base	Articles
• Code	Samples	and	Repositories
Robert	Hryniewicz
E:	rhryniewicz@hortonworks.com
T:	@RobH8z
Thanks!

Contenu connexe

Tendances

Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiIntelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 

Tendances (19)

Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect Together
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiIntelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 

En vedette

En vedette (20)

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
 
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
 
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 

Similaire à #HSTokyo16 Apache Spark Crash Course

Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 

Similaire à #HSTokyo16 Apache Spark Crash Course (20)

Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 

Plus de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

#HSTokyo16 Apache Spark Crash Course