SlideShare une entreprise Scribd logo
1  sur  93
Télécharger pour lire hors ligne
Hadoop	Crash	Course
Rafael	Coss
Community	Team	
Developer	Advocate
@racoss
rafael@hortonworks.com
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Evolve	with	Data	Trends	by	Using	Open	Source
à Keep	up	with	rapidly	evolving	Data	Trends
à Differentiate	your	biz	with	Insights	from	Data
à Speed	up	your	adaptation	to	constant	change	by	using	open	source
à A	Modern	Data	Architecture	is	a	Journey
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
Hadoop	Use	Cases
Traditional	Data	Architectures
What’s	Apache	Hadoop?
Data	Access	with	Hadoop
Lab	Intro
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Payment	
Tracking
Due
Diligence
Social
Mapping
Product
Design M	&	A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device	
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as	a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product	Recs
PREDICTIVE
ANALYTICS
Customers are building Modern Data
Applications to transform their industries,
renovating their IT architectures and
innovating their Data in Motion and Data at
Rest platforms to power actionable
intelligence.
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Payment	
Tracking
Due
Diligence
Social
Mapping
Product
Design M	&	A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device	
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as	a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product	Recs
PREDICTIVE
ANALYTICS
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Payment	
Tracking
Due
Diligence
Social
Mapping
Product
Design M	&	A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device	
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as	a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product	Recs
PREDICTIVE
ANALYTICS
Must	do	to	make	modern	data	applications	possible
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Payment	
Tracking
Due
Diligence
Social
Mapping
Product
Design M	&	A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device	
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as	a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product	Recs
PREDICTIVE
ANALYTICS
Must	do	to	make	modern	data	applications	possible
Powerful	means	to	
optimize current	business
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Payment	
Tracking
Due
Diligence
Social
Mapping
Product
Design M	&	A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device	
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as	a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product	Recs
PREDICTIVE
ANALYTICS
Must	do	to	make	modern	data	applications	possible
Powerful	means	to	
optimize current	business
Pathway	to	transform for	
strategic	advantage	and	
new	revenue	streams
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Customers	are	building	Modern	Data	Applications	to	transform	their	industries	–
renovating	their	IT	architectures	and	innovating	with	their	Data	in	Motion	
or	Data	at	Rest	to	power	actionable	intelligence.
Social	
Mapping
Payment	
Tracking
Factory	
Yields
Defect	
Detection
Call	Analysis
Machine	
Data
Product	
Design
M	&	A
Due	
Diligence
Next	Product	
Recs
Cyber	
Security
Risk	
Modeling
Ad	
Placement
Proactive	
Repair
Disaster	
Mitigation
Investment	
Planning
Inventory	
Predictions
Customer	
Support
Sentiment	
Analysis
Supply	Chain
Ad	
Placement
Basket	
Analysis
Segments
Cross-
Sell
Customer	
Retention
Vendor	
Scorecards
Optimize	
Inventories
OPEX	
Reduction
Mainframe	
Offloads
Historical	
Records
Data
as	a	Service
Public
Data	
Capture
Fraud	
Prevention
Device	Data
Ingest
Rapid	
Reporting
Digital	
Protection
9 © Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Future	of	Data
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
INTERNET
OF
ANYTHING
The	Future	of	Data	is	about	
actionable	intelligence	derived	from	all	
your	data	coming	from	the	Internet	of	
Anything
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	are	all	the	dimensions	of	all	your	data?
Structured	---------------- Unstructured
At	Rest	--------------------- In-motion
KPI	------------------- Data	Exhaust		
Core	------------------ Jagged	Edge
Within	Your	Firewall		----------- External	Data																.
On-prem ----------------- Cloud												 .
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	Future	of	Data
Actionable	Intelligence
D A T A 	 I N 	 M O T I O N
STORAGE
STORAGE
GROUP	2GROUP	1
GROUP	4GROUP	3
D ATA 	 AT 	 R E S T
INTERNET
OF
ANYTHING
Connected	Data	Architecture
Across	your	Data	Plane
is	powering	Actionable	Intelligence
Any	and	all	data	
from	sensors,	
machines,	
geolocation,	clicks,	
files,	social.
Secure	point-to-point	and	
bi-directional	data	flows
Collect	and	curate	all	data.
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
New	Data	Paradigm	Opens	Up	New	Opportunity
2.8	zettabytes
in	2012
44	zettabytes
in	2020
N E W
1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research
Clickstream
ERP,	CRM,	SCM
Web	&	social
Geolocation
Internet	of	Things
Server	logs
Files,	emails
Transform	every	industry	via	
full	fidelity	of	data	and	analytics
Opportunity
T R A D I T I O N A L
LAGGARDS
LEADERS
Ability	to	
Consume	Data
Enterprise	
Blind	Spot
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	disrupted	the	data	center?
?
Data?
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Traditional	Architecture	and	its	gaps
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Observation
Interaction
Intelligence
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Systems	of	Intelligence
Systems	of	
Engagements
Systems	of	
Interactions
Data	Systems
18
Systems	of	
Record
Events
In	
Gray
Actionable
Intelligence
OperatorsDevelopers
Products
Analytics
In	
Green
Systems	of	Insight
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
19
Modern	Data	Applications	Data	Scope
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
Capacity constrained down sampling of available information Whole population analytics connects the dots
Carefully cleanse all information before any analysis Analyze information as is & cleanse as needed
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
RDBMS
Sales
NoSQL
Unstructured
Visualization
&	Dashboards
Business	
Analytics
Data
Marts
Data	
Marts Archive
StatisticsOLAP
EDW
File	
Server
Clickstream	
Logs
Web	&	
Social	Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
RDBMS
Sales
NoSQL
Unstructured
Visualization
&	Dashboards
Business	
Analytics
Data
Marts
Data	
Marts Archive
StatisticsOLAP
EDW
File	
Server
Clickstream	
Logs
Web	&	
Social	Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents
à Too	expensive	and	slow	as	data	growth	keeps	accelerating
à Too	slow	to	get	the	data	prepared	for	analytics
à Analytics	is	only	leveraging	a	limited	data	set
à Cold	data	becomes	archived	and	is	no	longer	usable	for	analytics
à Data	ingest	is	rigid	and	slow	for	new	IoAT data	types
à Limited	real	time	insights
Traditional	Data	Architecture	Challenges	with	Big	Data
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
RDBMS
Sales
NoSQL
Unstructured
Visualization
&	Dashboards
Business	
Analytics
Data
Marts
Data	
Marts Archive
StatisticsOLAP
EDW
File	
Server
Clickstream	
Logs
Web	&	
Social	Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents
What’s	Apache	Hadoop?
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop	Architecture
Data	Access	Engines
Distributed	Reliable	Storage
Distributed	Compute	Framework
Resource	Management,	Data	LocalityData	Operating	System
Batch Interactive Real-time
Governance
&
Integration
Security
Applications
Deploy	Anywhere
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
runs	on
ETL
RDBMS	Import/Export
Distributed	Storage	&	Processing	Framework
Secure	NoSQL DB
SQL	on	HBase
NoSQL DB
Workflow	Management
SQL
Streaming	Data	Ingestion
Cluster	System	Operations
Secure	Gateway
Distributed	Registry
ETL
Search	&	Indexing
Even	Faster	Data	Processing
Data	Management
Machine	Learning
Hadoop	Ecosystem
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Open	Enterprise	Hadoop	Capabilities
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS	EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks	Data	Platform
Deployment	ChoiceLinux	 Windows	 On-Premise	 Cloud
HDFS Hadoop Distributed File System
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HORTONWORKS	DATA	PLATFORM
DATA	MGMT
HDP	2.2
Dec	2014
HDP	2.1
April	2014
HDP	2.0
Oct	2013
HDP	2.2
Dec	2014
HDP	2.1
April	2014
HDP	2.0
Oct	2013
2.2.0
2.4.0
2.6.0
Ongoing	Innovation	in	Apache
HDFS
YARN
MapReduce
Hadoop	Core
What	is	Apache	Hadoop?
Yahoo!
2006
Hortonworks	
Oct	2011
Yahoo!	start	focus	on	multiple	Hadoop	apps	&	clusters	
Contributes	Hadoop	to	Apache
2008
HDP	1.0
Oct	2012
Apache	Hadoop	v2	YARN
Google	publishes	GFS	&	MapReduce papers
2004-2005
HDP	2.4
March	2016 2.7.1
HDP	2.2
Dec	2014
HDP	2.3
July	2015
2.7.1
HDP 2.5
Aug	2016
2.7.3
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache Hadoop = Storage + Compute
storage storage
storage storage
Hadoop	Distributed	File	
System	(HDFS)
CPU RAM
Yet	Another	Resource	
Negotiator	(YARN)
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
`
+
/directory/structure/in/memory.txt
Resource management + schedulingDisk, CPU, Memory
Core
NameNode
HDFS
ResourceManager
YARN
Hadoop daemon
User application
NN
RM
DataNode
HDFS
NodeManager
YARN
Worker Node
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide	files	into	big	blocks	and	distribute	3	copies	randomly across	the	cluster
• Processing	Data	Locality
• Not	Just	storage	but	computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDFS	Storage	Architecture	- Before
Before
• DataNode is	a	single	storage
• Storage	is	uniform	- Only	
storage	type	Disk
• Storage	types	hidden	from	the	
file	system
All	disks	as	
single	storage
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Cloud
Storage
HDFS	Storage	Architecture	- Now
New Architecture
• DataNode is	a	collection	of	storages
• Support	different	types	of	storages
– Disk,	SSDs,	Memory
Block Storage Policies
– Describes	how	to	store	data	blocks
in	HDFS
Collection	of
tier	storage
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
It Looks Like a File System
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made	for	developing	distributed	applications	to	
process	vast	amounts	of	data	in-parallel	on	large	
clusters
• Proven
Reliable	interface	to	Hadoop	which	works	from	GB	to	
PB.	But,	batch	oriented	– Speed	is	not	it’s	strong	point.
• Ecosystem
Ported	to	Hadoop	2	to	run	on	YARN.		Supports	original	
investments	in	Hadoop	by	customers	and	partner	
ecosystem.		
DataNode1
Mapper
Data	is	shuffled
across	the	network
&	sorted
Map	Phase Shuffle/Sort Reduce	Phase
MapReduce Job	Lifecycle
Saying	that	MapReduce	is	dead	is	
preposterous
- Would	limits	us	to	only	new	workloads	
- ALL	Hadoop clusters	use	map	reduce
- Proven	at	Enterprise	Scale
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
YARN:	Data	Operating	System
Interactive Real-TimeBatch
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map	Process
Map	Process
Map	Process
Map	Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map	Process
Reduce	
Process
Reduce	
Process
Data
Read	&	ETL
Shuffle	&	
Sort Aggregation
Data
Data
Data
Data
Data
Data
Data
Data
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
1st Gen	Hadoop:	Cost	Effective	Batch	at	Scale
HADOOP	1.0
Built	for	Web-Scale	Batch	Apps
Single	App
BATCH
HDFS
Single	App
INTERACTIVE
Single	App
BATCH
HDFS
Silos	created	for	distinct	
use	casesSingle	App
BATCH
HDFS
Single	App
ONLINE
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop	emerged	as	foundation	of	new	data	architecture
Apache	Hadoop	is	an	open	source	data	platform	for	managing	large	
volumes	of	high	velocity	and	variety	of	data
• Built	by	Yahoo!	to	be	the	heartbeat	of	its	ad	&	search	business
• Donated	to	Apache	Software	Foundation	in	2005	with	rapid	adoption	by	large	
web	properties	&	early	adopter	enterprises
• Incredibly	disruptive	to	current	platform	economics
Traditional	Hadoop	Advantages
ü Manages	new	data	paradigm
ü Handles	data	at	scale
ü Cost	effective
ü Open	source
Traditional	Hadoop	Had	Limitations
Batch-only	architecture	
Single	purpose	clusters,	specific	data	sets
Difficult	to	integrate	with	existing	investments
Not	enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
YARN	extends	Hadoop	into	data	center	leaders
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
YARN : Data Operating System
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	does	iOS 6	and	Windows	3.1	have	in	common?
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop	Beyond	Batch	with	YARN
Single	Use	Sysztem
Batch	Apps
Multi	Use	Data	Platform
Batch,	Interactive,	Online,	Streaming,	…
A	shift	from	the	old	to	the	new…
HADOOP 1
MapReduce
(cluster resource management
& data processing)
Data Flow
Pig
SQL
Hive
Others
API,
Engine,
and
System
YARN
(Data Operating System: resource management, etc.)
Data Flow
Pig
SQL
Hive
Other
ISV
Apache Yarn as a Base
System
Engine
API’s
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(redundant, reliable storage)
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
(redundant, reliable storage)
Batch
MapReduce
Tez Tez
MapReduce as the Base
HADOOP 2
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop	Workload	Evolution
Single	Use	System
Batch	Apps
Multi	Use	Data	Platform
Batch,	Interactive,	Online,	Streaming,	…
A	shift	from	the	old	to	the	new… Multi	Use	Platform
Data	&	Beyond
HADOOP 1
YARN
HADOOP 2
1 ° ° ° °
° ° ° ° N
HDFS
(redundant, reliable storage)
1 ° ° °
° ° ° N
HDFS
MapReduce
HADOOP.Next
YARN ‘
1 ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(redundant, reliable storage)
DATA ACCESS APPS
Docker
MySQLMR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
MR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
Docker
Tomcat
Docker
Other
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	new	in	Apache	Hadoop	3.0?
Storage	Optimization
HDFS:	Erasure	codes
Improved	Utilization
YARN:	Long	Running	Services
YARN:	Schedule	Enhancements
Additional	Workloads
YARN:	Docker	&	Isolation
Easier	to	Use
New	User	Interface	
Refactor	Base
Lots of	Trunk	content	
JDK8 and	newer	dependent	libraries
- 3.0.0-alpha1	- Sep/3/2016
- Alpha2	- Jan/25/2017
- Alpha3	- Q2	2017	(Estimated)
- Beta/GA	- Q3/Q4	2017	(Estimated)
Release	Timeline
3.0
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Gartner:	What	is	Hadoop?
à Common	Apache	Projects
– ALL														=	7						(6)
– Except	for	1	=	3						(5)
– Except	for	2	=	4						(4)
² About	14	Common	Projects
à Uncommon	Projects
– Only	1									=	9							(1)
– Only	2									=	7 (2)
– Only	3									=	6 (3)
² About	22	Uncommon	Projects	
http://blogs.gartner.com/merv-adrian/2015/07/02/now-what-is-hadoop/
ODPi
ODPi
ODPi
ODPi
ODPi ODPi ODPi
Hortonworks.com/tutorials
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
1446	Market	Street	San	Francisco,	CA	94102
HORTONWORKS	DATA	PLATFORM
Hadoop
&	YARN	
Flume
Oozie
Pig
Hive
Tez
Sqoop
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA	MGMT DATA		ACCESS GOVERNANCE	&	INTEGRATION OPERATIONS SECURITY
HDP	2.2
Dec	2014
HDP	2.1
April	2014
HDP	2.0
Oct	2013
HDP	2.2
Dec	2014
HDP	2.1
April	2014
HDP	2.0
Oct	2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP	2.3
Oct	2015
4.2.0
0.96.1
0.98.0 0.9.1
0.8.1
1.4.1 1.1.2
2.7.3 1.4.6 0.11.0 0.7.02.5.00.10.1.0 3.4.61.5.25.5.1 0.91.0 0.8.01.7.04.7.0 1.1.0 0.10.00.7.0
1.2.1+
2.1***
0.16.0
HDP	2.6*
1H2017
4.2.0
1.6.3+
2.1**
1.1.2
2.7.1 1.4.6 0.6.0 0.5.02.2.10.9.0 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP	2.4
Mar	2016
4.2.01.6.0 1.1.2
Zeppelin
Ongoing	Innovation	in	Apache
0.7.0
*	HDP	2.6	– Shows	current	Apache	branches	being	used.		Final	component	version	subject	to	change	based	on	Apache	release	process.
**	Spark	1.6.3+	Spark	2.1	– HDP	2.6	supports	both	Spark	1.6.3	and	Spark	2.1	as	GA
***	Hive	2.1	is	GA	within	HDP	2.6.
2.7.3 1.4.6 0.9.0 0.6.02.4.00.10.0 3.4.61.5.25.5.1 0.91.0 0.7.01.7.04.7.0 1.0.1 0.10.00.7.0
1.2.1+
2.1***
0.16.0
HDP	2.5
Aug	2016
4.2.0
1.6.2+
2.0**
1.1.20.6.0
Druid
0.9.2
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Next	Generation	Data	Vendors	Investment	for	the	Enterprise
Vertical
Integration with
YARN and HDFS
Ensure engines can run
reliably and respectfully
in a YARN based
cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load	data	and	
manage	
according	
to	policy
Provide	layered	
approach	to
security	through	
Authentication,	
Authorization,	
Accounting,	and	
Data	Protection
SECURITYGOVERNANCE
Deploy	and	
effectively	
manage	the	
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
YARN: Data Operating System
(Cluster	Resource	Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	do	distributions	do?
à Define	a	stack	of	components
• Rich	and	latest	set	of	Apache	Projects	(open	source	&	open	community)	without	lock	in
à Vertical	and	Horizontal	integration	of	components
• Vertical:	Best	Speed	and	Scale
• Horizontal:	Open	Enterprise	Ready
à Provision	and	Upgrade	stack
• Robust,	Easy	and	Anywhere
à Accelerate	time	to	value	(easy	of	use)
• New	Face	of	Hadoop	with	Uis from	Ambari,	Ambari	Views,	Ranger,	Falcon,	Atlas
à Partner	Ecosystem
• Rich	and	Deep	
à Support
• Industry’s	best,	SmartSense and	influence	community
Hadoop
Operations	&	Tools
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
How Do You Operate a Hadoop Cluster?
Apache™ Ambari is	a	platform	
to	provision,	manage	and	
monitor	Hadoop	clusters
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Ambari Core Features and Extensibility
Install	&	Configure
Operate,	Manage	&	
Administer
Develop
Optimize	&	Tune
Developer
Data	Architect
Ambari	provides	core	services	for	operations,	development	and	
extensions	points	for	both
Extensibility	Features
Stacks,	Blueprints	&	REST	APIs
Core	Features
Install	Wizard	&	Web
Web,	Operator	Views,	
Metrics	&	Alerts
User	Views
User	Views
Views	Framework	&	REST	APIs
Views	Framework
Views	Framework
How?
Cluster	Admin
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
New	user	interface	enables	fast	&	
easy	SQL	definition	and	execution.
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Worker	and	DevOps	Tooling
à A	pluggable	way	to	provide	a	common	user	
experience	across	multiple	user	personas.
Ambari Views
HDP
System	Admin/operators
Data	Workers
Application	Developers
A M B A R I
à Single	point	of	entry	for	all	users.	
à Open	Community
52 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
New User Views for DevOps
Capacity	Scheduler	View
Browse	and	manage	YARN	queues
Tez View
View	information	related	to	Tez jobs	that	
are	executing	on	the	cluster
53 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
New	User	Views	for	Development
Pig	View
Author	and	execute	Pig	Scripts.
Hive	View
Author,	execute	and	debug	Hive	
queries.
Files	View
Browse	HDFS	file	system.
54 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin
• Web-based	notebook	for	data	engineers,	data	
analysts	and	data	scientists
• Brings	interactive	data	ingestion,	data	
exploration,	visualization,	sharing	and	
collaboration	features	to	Hadoop	and	
Spark
• Modern	data	science	studio
• Scala	with	Spark
• Python	with	Spark
• SparkSQL
• Apache	Hive,	and	more.
Hadoop
Data	Access
56 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
57 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop
• Quickly find value in raw data files
• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…
SensorMobile
Weblog
Operational
/	MPP
SQL	Queries
Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Architecture
User issues SQL query
Hive parses and plans query
Query converted to
MapReduce and executed on
Hadoop
2
3
Web UI
JDBC /
ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce, Tez or Spark Job
Data DataData
Hadoop 3
Data-local processing
59 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hive and the Stinger Initiative
Base	Optimizations
Generate	simplified	DAGs
In-memory	Hash	Joins
Vector	Query	Engine
Optimized	for	modern	processor	
architectures
Tez
Express	tasks	more	simply
Eliminate	disk	writes
Pre-warmed	Containers
ORCFile
Column	Store
High	Compression
Predicate	/	Filter	Pushdowns
YARN
Next-gen	Hadoop	data	processing	
framework
+ +
Query	Planner
Intelligent	Cost-Based	Optimizer
Performance	Optimizations
100x+	faster	time	to	insight
Deeper	analytical	capabilities
60 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Stinger.next and	Sub-Second	SQL
Emergence of LLAP brings Sub-Second SQL response times within reach with Hive.
BATCH & INTERACTIVE BATCH & INTERACTIVE BATCH, INTERACTIVE & SUB-SECONDSPEED
DELIVERY
SQL
UPDATES
ENGINES
STINGER
DELIVERED
PROGRESS
DELIVERED
FINAL
VERSION
HDP 2.1
VERSION
0.13
VERSION
HDP 2.3
VERSION
1.2.1
SQL:2003+ SQL:2011 SUBSET
READ-ONLY SQL INSERT/UPDATE/DELETE
MR, TEZ MR, TEZ
FUTURE
STINGER NEXT
COMPLETE ACID SUPPORT INCLUDING MERGE
COMPREHENSIVE SQL:2011 BASED ANALYTICS
MR, TEZ, LLAP
DELIVERED IN DEVELOPMENT
Tiered	Data	Storage
Stinger.next Phase	3
YARN:	Containerized	
Applications
61 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Types SQL Features File Formats Latest Additions…
Numeric Core	SQL	Features Columnar Scalable	Cross	Product
FLOAT/DOUBLE Date,	Time and	Arithmetical	Functions ORCFile Primary	Key	/	Foreign Key
DECIMAL INNER,	OUTER,	CROSS	and	SEMI	Joins Parquet Non-Equijoin
INT/TINYINT/SMALLINT/BIGINT Derived	Table	Subqueries
Text
Tech	Preview:
Proc.	Extensions	(PL/SQL)
BOOLEAN Correlated	+ Uncorrelated	Subqueries CSV Future
String UNION	ALL Logfile ACID	MERGE
CHAR	/	VARCHAR UDFs, UDAFs,	UDTFs Nested	/	Complex Multi	Subquery
STRING Common	Table	Expressions Avro Comparison	to	sub-select
BINARY UNION	DISTINCT JSON INTERSECT and	EXCEPT
Date, Time Advanced	Analytics XML
DATE OLAP	and	Windowing	Functions Custom	Formats
TIMESTAMP CUBE and	Grouping	Sets Other	Features
Interval	Types Nested	Data	Analytics XPath Analytics
Complex	Types Nested	Data	Traversal
ARRAY Lateral	Views
MAP ACID	Transactions
STRUCT INSERT	/	UPDATE	/	DELETE
UNION
Apache	Hive:	Journey	to	SQL:2011	Analytics
Legend
Existing
Future
New	with	Hive	2.0
62 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row	Engine Vector	Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux	Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In-development
Legend
Apache Hive: Modern Architecture
63 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Tez	is	a	critical	innovation	of	the	Stinger	Initiative.
• Along with YARN, Tez not only improves
Hive, but improves	all	things	batch	and interactive	
for	Hadoop;	Pig,	Cascading…
• More Efficient Processing than MapReduce
• Reduce	operations	and	complexity	of	back	end	processing
• Allows	for	Map	Reduce	Reduce	which	saves	hard	disk	operations
• Implements	a	“service”	which	is	always	on,	decreasing	start	times	of	jobs
• Allows	Caching	of	Data	in	Memory
YARN
Dev
Cascading/
Scalding
Why	is	Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scripting
Pig
SQL
Hive
Tez Tez
Applications
Tez
YARN:	Data	Operating	System
Interactive Real-TimeBatch
64 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Tez
Hive	– MapReduce Hive	– Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT	a.state
JOIN	(a,	c)
SELECT	c.price
SELECT	b.id
JOIN(a,	b)
GROUP	BY	a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT	a.state,
c.itemId
JOIN	(a,	c)
JOIN(a,	b)
GROUP	BY	a.state
COUNT(*)
AVG(c.price)
SELECT	b.id
Tez avoids	unneeded	writes	to	
HDFS
Tez allows	Reducer-only	jobs	
within	the	DAG
65 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sub-second	Queries	in	Hive:		LLAP	(Live	Long	and	Process)
à Persistent	daemons
– Saves	time	on	process	start	up	(eliminates	container	allocation	and	JVM	start	up	time)
– All	code	JITed within	a	query	or	two
à Data	caching	with	an	async I/O	elevator
– Hot	data	cached	in	memory	(columnar	aware,	so	only	hot	columns	cached)
– When	possible	work	scheduled	on	node	with	data	cached,	if	not	work	will	be	run	in	other	node
à Operators	can	be	executed	inside	LLAP	when	it	makes	sense
– Large,	ETL	style	queries	usually	don’t	make	sense
– User	code	not	run	in	LLAP	for	security
à Working	on	interface	to	allow	other	data	engines	to	read	securely	in	parallel
à Beta	in	2.0
66 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hive	2	with	LLAP:	Architecture	Overview
Deep	
Storage
HDFS
S3	+	Other	HDFS	
Compatible	Filesystems
YARN	Cluster
LLAP	Daemon
Query
Executors
In-Memory	
Cache
LLAP	Daemon
Query
Executors
In-Memory	
Cache
LLAP	Daemon
Query
Executors
In-Memory	
Cache
LLAP	Daemon
Query
Executors
In-Memory	
Cache
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2	
(Query	
Endpoint)
ODBC	/
JDBC SQL
Queries
67 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hive	With	LLAP	Execution	Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only
68 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Scripting Data Pipeline & ETL
Apache Pig
• Data	flow	engine	and	scripting	language	(Pig	Latin)
• Allows	you	to	transform data	and	datasets
Advantages over MapReduce
• Reduces	time	to	write	jobs
• Community	support
• Piggybank	has	a	significant	number	of	UDF’s	to	help	adoption
• There	are	a	large	number	of	existing	shops	using	PIG
YARN:	Data	Operating	System
Interactive Real-TimeBatch
69 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	use	Pig?
• Maybe	we	want	to	join	two	datasets,	from	different	sources,	on	a	
common	value,	and	want	to	filter,	and	sort,	and	get	top	5	sites
70 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Resource Management
Storage
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Apache	Spark	enthusiasm
Applications
Spark	Core	Engine
Scala
Java
Python
libraries
MLlib	
(Machine	
learning)
Spark	
SQL*
Spark	
Streaming*
Spark	Core	Engine
71 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache Spark & Apache Hadoop Perfect Together
General Purpose Data Access Engine
for	fast,	large-scale	data	processing
Designed for Iterative, In-Memory
computations	and	interactive	data	mining
Expressive Multi-Language APIs
for	Java,	Scala,	Python	and	R
Built-in Libraries
Enable	data	workers	to	rapidly	iterate	over	data	for:		
ETL,	Machine	Learning,	SQL	and	Stream	processing
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
72 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache Projects Enable Access Patterns
Various open source projects have
incubated in order to meet these access
pattern needs
Today, they can all run on a single cluster
on a single set of data because of YARN
All powered by a broad open community
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive
Solr
Spark
Hive
Pig
Real-Time
HBase
Accumulo
Storm
Batch
MapReduce
Applications
Kafka
Evolve	with	Data	Trends	by	Using	Open	Source
à Keep	up	with	rapidly	evolving	Data	Trends
à Differentiate	your	biz	with	Insights	from	Data
à Speed	up	your	adaptation	to	constant	change	by	using	open	source
à A	Modern	Data	Architecture	is	a	Journey
Hands	on	Lab	Overview
HDP	2.5	Sandbox
à Provides	Free	preconfigured	
HDP
– Runs	in	a		Virtual	Machine,	
Docker	or	Azure
Hortonworks.com/sandbox
à Easy	to	Use
– Operations
• Ambari
– Dev	and	DevOps
• Ambari	User	Views
– Web	Notebook
• Zeppelin
à Works	with	60+	Free	tutorial
Hortonworks.com/tutorials
Sandbox	Start
à Import	Sandbox	Image
à Takes	about	4	min
à Splash	Screen
– Virtual	Box	Sample	Splash	Screen
– Save	the	host	IP
http://localhost:8888
Data	Discovery	Lab
• Elefante Wine	Company	has	a	fleet	of	over	100	trucks.
• The	geolocation	data	collected	from	the	trucks	contains	events	generated	while	the	truck	drivers	are	
driving.
• The	company’s	goal	with	Hadoop	is	to	Mitigate	Risk:
o Understand	correlations	between	miles	driven	and	events
o Compute	the	risk	factor	for	each	driver	based	on	mileage	&	events
o Lab	Env
o Sandbox	2.5
o Lab	Doc
o URL:	http://tinyurl.com/hello-hdp
o Load	Data
o Query	Data
o Process	Data
Elefante Wine Current Challenges
The Company
Elefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine
in a highly-regulated industry with stringent transportation requirements.
The Situation
Recently a number of driver violations led to fines and increased insurance rates
The Challenges
• Rising Operational Costs
• Driver Safety
• Risk Management
• Logistics Optimization
© Hortonworks Inc. 2012
Professional Services
Elefante Wine	Company	has	a	large	fleet	of	trucks	in	USA
A	truck	generates	millions	of	events	for	a	
given	route;	an	event	could	be:
§ 'Normal'	events:	starting	/	stopping	of	the	
vehicle
§ ‘Violation’	events:	speeding,	excessive	
acceleration	and	breaking,	unsafe	tail	distance
Company	uses	an	application	that	monitors	
truck	locations	and	violations	from	the	
truck/driver	in	real-time	to	calculate	risk
Route?
Truck?
Driver?
Analysts	query	a	broad	
history	to	understand	if	
today’s	violations	are	
part	of	a	larger	problem	
with	specific	routes,	
trucks,	or	drivers
Elefante Wine Risk and Driver Safety Challenges
Trucks	outfitted	with	new	sensors	generating	large	
volumes	of	new	data:
• Location
• Speed
• Driver	Violations
Need	to	be	integrate	real-time	&	historical	data
Increase safety and reduce liabilities
Anticipate driver violations BEFORE they
happen and take precautionary actions
Find	predictive	correlations	in	driver	behavior	over	
large	volumes	of	real-time	data
Difficult to deliver timely insights to the right
people and systems to take action
Data Discovery
Uncover new
findings
Predictive Analytics
Identify your next best
action
Better Understanding
of the Past
Better Prediction
of the Future
What’s	our	goal?
à Solution:
– Collect	additional	data	via	sensors	in	trucks	to	better	understand	Risk	Factors
à How:
– Quickly	store	new	sensor	data	in	a	common	repository
– Prepare	the	data	for	analysis
– Explore	the	data
– Calculate	Risk
– Generate	a	report
geolocation.csv
trucks.csv
Temporary	table	B Geolocation
Temporary	table	T
Trucks
csv
csv ORC
ORC
SQL
SQL
LOAD
LOAD
Temporary	Tables
Created
Loaded
Deleted
Geolocation
Trucks
ORC
ORC
SQL
SQL
PIG	or	Spark
Risk	Calculation
Truck_mileage
ORC
Avg_mileage
ORC
DriverMileage
ORC
RiskFactor
ORC
Events
ORC
Trucking Risk Analysis – Hadoop ELT
Calculate	Risk
Getting	Started	Resources
87 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Connected	Data	Architecture	with	HDC	for	AWS	Market	Place
C L O U D
Ideal	Use	Cases:
Data	Science	and	Exploration
(Spark,	Zeppelin)
ETL	and	Data	Preparation
(Hive,	Spark)
Analytics	and	Reporting
(Hive2	w/LLAP,	Zeppelin)
Cloud	Data	
Processing
(HDC	for	AWS)
88 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big	Data	Tutorials
à Get	Started
– hortonworks.com/tutorials
– Apache	Hadoop	&	Ecosystem
• hor.tn/hello-hdp
– Apache	Spark
• hor.tn/spark-zep-intro
– Apache	NiFi
• hor.tn/nifi-intro
– Use	Case
• IoT
• Social	Media
89 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
developer.hortonworks.com
90 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Nourishes	the	Community
H ORTON WORKS	
COMMUN IT Y 	CON N EC T ION
HO RTO NWO RKS 	
PART NE RWO RKS
https://community.hortonworks.com
91 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Want	to	continue	the	technical	Introduction?
à Hadoop	Summit	Crash	Courses
– Replays
– Free
à hadoopsummit.org/san-jose/agenda
– Apache	Hadoop
– Apache	Spark
– Apache	NiFi
– IoT &	Streaming
– Data	Science
92 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Thank	you!
rafael@hortonworks.com
@racoss
93 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Protecting	the	Elephant	in	the	Castle…..
Kerberos,	
Wire	Encryption
HDFS Encryption
Apache	Ranger
Network	Segmentation,	
Firewalls	
LDAP/AD
Apache	Knox

Contenu connexe

Tendances

Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetThiago Santiago
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceThiago Santiago
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHortonworks
 
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...DataWorks Summit/Hadoop Summit
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in FinancialYifeng Jiang
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash CourseDataWorks Summit
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
The Power of your Data Achieved - Next Gen Modernization
The Power of your Data Achieved - Next Gen ModernizationThe Power of your Data Achieved - Next Gen Modernization
The Power of your Data Achieved - Next Gen ModernizationHortonworks
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Hortonworks
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Hortonworks
 
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouMake Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouHortonworks
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic EcosystemsHortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...DataWorks Summit
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysiszafarali1981
 

Tendances (20)

Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and Superset
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare Transformation
 
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in Financial
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
The Power of your Data Achieved - Next Gen Modernization
The Power of your Data Achieved - Next Gen ModernizationThe Power of your Data Achieved - Next Gen Modernization
The Power of your Data Achieved - Next Gen Modernization
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
 
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouMake Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for You
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 

Similaire à Hadoop Crash Course: Introduction to Modern Data Applications

Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Daniel Madrigal
 
Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer. Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer. Mats Johansson
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureMats Johansson
 
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataBig Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataMatt Stubbs
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Hortonworks laurie maclachlan
Hortonworks laurie maclachlanHortonworks laurie maclachlan
Hortonworks laurie maclachlanBigDataExpo
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...DataWorks Summit
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJDaniel Madrigal
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopHortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopMats Johansson
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
 
Reinvent Your Data Management Strategy for Successful Digital Transformation
Reinvent Your Data Management Strategy for Successful Digital TransformationReinvent Your Data Management Strategy for Successful Digital Transformation
Reinvent Your Data Management Strategy for Successful Digital TransformationDenodo
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingDataWorks Summit
 
HP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEn
HP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEnHP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEn
HP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEnHP Enterprise Italia
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HPMITEF México
 

Similaire à Hadoop Crash Course: Introduction to Modern Data Applications (20)

Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ
 
Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer. Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer.
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
 
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataBig Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Hortonworks laurie maclachlan
Hortonworks laurie maclachlanHortonworks laurie maclachlan
Hortonworks laurie maclachlan
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopHortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with Hadoop
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Reinvent Your Data Management Strategy for Successful Digital Transformation
Reinvent Your Data Management Strategy for Successful Digital TransformationReinvent Your Data Management Strategy for Successful Digital Transformation
Reinvent Your Data Management Strategy for Successful Digital Transformation
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
 
HP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEn
HP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEnHP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEn
HP Software Performance Tour 2014 - Vincere i Big Data con HP HAVEn
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HP
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Hadoop Crash Course: Introduction to Modern Data Applications