Contenu connexe Similaire à Storm Demo Talk - Colorado Springs May 2015 (20) Storm Demo Talk - Colorado Springs May 20151. Page
1
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Real-Time Processing in Hadoop
Colorado Springs OSS Meet-up
Shane
Kumpf
&
Mac
Moore
SoluEons
Engineers,
Hortonworks
May
2015
2. Page
2
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Agenda
§ IntroducEon
&
about
Hortonworks
HDP
§ Overview
of
logisEcs
industry
scenario
§ Overview
of
streaming
architecture
on
HDP
§ Streaming
Demo
#1
§ IntegraEng
PredicEve
AnalyEcs
in
streaming
scenarios
§ Streaming
Demo
with
PredicEve
addiEons
§ Q
&
A
Page
2
3. Page
3
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Preface:
Enabling
Technologies
Page
3
• Problems solved at scale, via fundamentally new approaches…
• Make it possible, even simple, to produce new products/applications that would have
been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-‐Ion
baUeries,
reEna
displays,
GPS
&
Eny
HD
cameras
(from
smartphones)
have
enabled
Electric
cars,
quad-‐copters,
VR
displays,
&
more…
• Hadoop
has
similarly
led
to
breakthroughs
in
big
data
scale
&
capability,
and
enables
new
real-‐Eme
advanced
analyEc
applicaEons.
4. Page
4
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Why did Hadoop emerge?
April
2015
5. Page
5
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business
Value
Clickstream
GeolocaEon
Web
Data
Internet
of
Things
Docs,
emails
Server
logs
2012
2.8
Ze5abytes
2020
40
Ze5abytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP
CRM
SCM
New
TradiKonal
6. Page
6
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Hadoop for the Enterprise: Implement a
Modern Data Architecture with HDP
Spring
2015
Hortonworks. We do Hadoop.
7. Page
7
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Hadoop
for
the
Enterprise:
Implement
a
Modern
Data
Architecture
with
HDP
Customer Momentum
• 330+ customers (as of year-end 2014)
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners
8. Page
8
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Customer Partnerships matter
Driving
our
innovaKon
through
Apache
SoSware
FoundaKon
Projects
Apache
Project
Commi5ers
PMC
Members
Hadoop
27
21
Pig
5
5
Hive
18
6
Tez
16
15
HBase
6
4
Phoenix
4
4
Accumulo
2
2
Storm
3
2
Slider
11
11
Falcon
5
3
Flume
1
1
Sqoop
1
1
Ambari
34
27
Oozie
3
2
Zookeeper
2
1
Knox
13
3
Ranger
10
n/a
TOTAL
161
108
Source:
Apache
Sobware
FoundaEon.
As
of
11/7/2014.
Hortonworkers
are
the
architects
and
engineers
that
lead
development
of
open
source
Apache
Hadoop
at
the
ASF
• ExperKse
Uniquely
capable
to
solve
the
most
complex
issues
&
ensure
success
with
latest
features
• ConnecKon
Provide
customers
&
partners
direct
input
into
the
community
roadmap
• Partnership
We
partner
with
customers
with
subscripEon
offering.
Our
success
is
predicated
on
yours.
27
Cloudera:
11
Facebook:
5
LinkedIn:
2
IBM:
2
Others:
23
Yahoo
10
9. Page
9
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Technology Partnerships matter
Apache
Project
Hortonworks
RelaKonship
Named
Partner
CerEfied
SoluEon
Resells
Joint
Engr
MicrosoS
u
u
u
u
HP
u
u
u
u
SAS
u
u
u
SAP
u
u
u
u
IBM
u
u
u
Pivotal
u
u
u
Redhat
u
u
u
Teradata
u
u
u
u
InformaKca
u
u
u
Oracle
u
u
It
is
not
just
about
packaging
and
cerEfying
sobware…
Our
joint
engineering
with
our
partners
drives
open
source
standards
for
Apache
Hadoop
HDP
is
Apache
Hadoop
10. Page
10
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
HDP delivers a Centralized Architecture
Modern Data Architecture
• Unifies data and processing.
• Enables applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream
Web
&
Social
GeolocaKon
Sensor
&
Machine
Server
Logs
Unstructured
SOURCES
ExisKng
Systems
ERP
CRM
SCM
ANALYTICS
Data
Marts
Business
AnalyKcs
VisualizaKon
&
Dashboards
ANALYTICS
ApplicaKons
Business
AnalyKcs
VisualizaKon
&
Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop
Distributed
File
System)
YARN:
Data
OperaKng
System
Interactive Real-TimeBatch Partner ISVBatch Batch
MPP
EDW
11. Page
11
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
HDP delivers a completely open data platform
Hortonworks
Data
Placorm
2.2
Hortonworks
Data
Placorm
provides
Hadoop
for
the
Enterprise:
a
centralized
architecture
of
core
enterprise
services,
for
any
applicaEon
and
any
data.
Completely Open
• HDP incorporates every element
required of an enterprise data
platform: data storage, data access,
governance, security, operations
• All components are developed in
open source and then rigorously
tested, certified, and delivered as an
integrated open source platform
that’s easy to consume and use by
the enterprise and ecosystem.
YARN: Data Operating System
(Cluster
Resource
Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
(Hadoop Distributed File System)
GOVERNANCE
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache
Zookeeper
Apache Oozie
12. Page
12
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Real World Use Case:
Trucking Company
Spring
2015
Hortonworks. We do Hadoop.
13. Page
13
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Scenario Overview
.
14. Page
14
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Trucking
company
w/
large
fleet
of
trucks
in
Midwest
A
truck
generates
millions
of
events
for
a
given
route;
an
event
could
be:
§ 'Normal'
events:
starEng
/
stopping
of
the
vehicle
§ ‘ViolaEon’
events:
speeding,
excessive
acceleraEon
and
breaking,
unsafe
tail
distance
Company
uses
an
applicaKon
that
monitors
truck
locaKons
and
violaKons
from
the
truck/
driver
in
real-‐Kme
Route?
Truck?
Driver?
Analysts
query
a
broad
history
to
understand
if
today’s
violaEons
are
part
of
a
larger
problem
with
specific
routes,
trucks,
or
drivers
15. Page
15
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Trucking
Company’s
YARN-‐enabled
Architecture
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
Real-‐Eme
Serving
(HBase)
Alerts
&
Events
(AcEveMQ)
Real-‐Time
User
Interface
One
cluster
with
consistent
security,
governance
&
operaKons
SQL
InteracEve
Query
(Hive
on
Tez)
Truck
Sensors
16. Page
16
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Trucking
Company’s
YARN-‐enabled
Architecture
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
Real-‐Eme
Serving
(HBase)
Alerts
&
Events
(AcEveMQ)
Real-‐Time
User
Interface
One
cluster
with
consistent
security,
governance
&
operaKons
SQL
InteracEve
Query
(Hive
on
Tez)
Truck
Sensors
17. Page
17
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
What
is
Kasa?
APACHE
KAFKA
§ High
throughput
distributed
messaging
system
§ Publish-‐Subscribe
semanEcs
but
re-‐
imagined
at
the
implementaEon
level
to
operate
at
speed
with
big
data
volumes
§ Kasa
@LinkedIn:
§ 800
billion
messages
per
day
§ 175
terabytes
of
data
wriUen
per
day
§ 650
terabytes
of
data
read
per
day
§ Over
13
million
messages/2.75GB
of
data
per
second
Kaha
Cluster
producer
producer
producer
consumer
consumer
consumer
18. Page
18
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Kasa:
Anatomy
of
a
Topic
ParKKon
0
ParKKon
1
ParKKon
2
0
0
0
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
8
8
8
9
9
9
10
10
11
11
12
Writes
Old
New
APACHE
KAFKA
§ ParEEoning
allows
topics
to
scale
beyond
a
single
machine/node
§ Topics
can
also
be
replicated,
for
high
availability.
19. Page
19
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Trucking
Company’s
YARN-‐enabled
Architecture
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
Real-‐Eme
Serving
(HBase)
Alerts
&
Events
(AcEveMQ)
Real-‐Time
User
Interface
One
cluster
with
consistent
security,
governance
&
operaKons
SQL
InteracEve
Query
(Hive
on
Tez)
Truck
Sensors
20. Page
20
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Apache
Storm
• Distributed,
real
Eme,
fault
tolerant
Stream
Processing
plaxorm.
• Provides
processing
guarantees.
• Key
concepts
include:
• Tuples
• Streams
• Spouts
• Bolts
• Topology
Page
20
21. Page
21
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Tuples
and
Streams
• What
is
a
Tuple?
– Fundamental
data
structure
in
Storm.
Is
a
named
list
of
values
that
can
be
of
any
data
type.
Page
21
• What
is
a
Stream?
– An
unbounded
sequences
of
tuples.
– Core
abstracEon
in
Storm
and
are
what
you
“process”
in
Storm
22. Page
22
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Spouts
• What
is
a
Spout?
– Generates
or
a
source
of
Streams
– E.g.:
JMS,
TwiUer,
Log,
Kasa
Spout
– Can
spin
up
mulEple
instances
of
a
Spout
and
dynamically
adjust
as
needed
Page
22
23. Page
23
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Bolts
• What
is
a
Bolt?
– Processes
any
number
of
input
streams
and
produces
output
streams
– Common
processing
in
bolts
are
funcEons,
aggregaEons,
joins,
read/write
to
data
stores,
alerEng
logic
– Can
spin
up
mulEple
instances
of
a
Bolt
and
dynamically
adjust
as
needed
• Bolts
used
in
the
Use
Case:
1. HBaseBolt:
persisEng
and
counEng
in
Hbase
2. HDFSBolt:
persisEng
into
HFDS
as
Avro
Files
using
Flume
3. MonitoringBolt:
Read
from
Hbase
and
create
alerts
via
email
and
a
message
to
AcEveMQ
if
the
number
of
illegal
driver
incidents
exceed
a
given
threshhold.
Page
23
24. Page
24
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Topology
• What
is
a
Topology?
– A
network
of
spouts
and
bolts
wired
together
into
a
workflow
Page 24
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
25. Page
25
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Trucking
Company’s
YARN-‐enabled
Architecture
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
Real-‐Eme
Serving
(HBase)
Alerts
&
Events
(AcEveMQ)
Real-‐Time
User
Interface
One
cluster
with
consistent
security,
governance
&
operaKons
SQL
InteracEve
Query
(Hive
on
Tez)
Truck
Sensors
26. Page
26
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Key
Constructs
in
Apache
HBase
• HBase = Key /Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page
26
27. Page
27
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Data
Assignment
Page
27
HBase
Table
Keys
within
HBase
Divided
among
different
RegionServers
28. Page
28
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Data
Access
• Get
– Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a
matching rowkey
• Put
– Inserts a new version of a cell.
• Scan
– The whole table, row by row, or a section of that table starting at a particular start key and ending
at a particular end key
• Delete
– It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
– Unique capability in the NoSQL market
Page
28
29. Page
29
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Trucking
Company’s
YARN-‐enabled
Architecture
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
Real-‐Eme
Serving
(HBase)
Alerts
&
Events
(AcEveMQ)
Real-‐Time
User
Interface
One
cluster
with
consistent
security,
governance
&
operaKons
SQL
InteracEve
Query
(Hive
on
Tez)
Truck
Sensors
30. Page
30
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
2009
2006
1
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
MapReduce
Largely
Batch
Processing
Hadoop
w/
MapReduce
YARN: Data Operating System
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-‐279:
YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
31. Page
31
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Benefits
of
YARN
as
the
Data
OperaEng
System
• The container based model allows for running nearly any workload.
– Enables the centralized architecture.
– No longer is MapReduce the only data processing engine.
– Docker containers managed byYARN.Yes Please!
• Decouples resource scheduling from application lifecycle.
– Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains
– Versus static allocation of “slots” in Hadoop 1.0
Page
31
Yahoo has over 30000 nodes runningYARN across over 365PB of data.
They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving toYARN.
32. Page
32
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed
Storage:
HDFS
Many
Workloads:
YARN
Trucking
Company’s
YARN-‐enabled
Architecture
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
Real-‐Eme
Serving
(HBase)
Alerts
&
Events
(AcEveMQ)
Real-‐Time
User
Interface
One
cluster
with
consistent
security,
governance
&
operaKons
SQL
InteracEve
Query
(Hive
on
Tez)
Truck
Sensors
33. Page
33
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Apache HDFS – Hadoop Distributed File System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
Page
33
34. Page
34
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Streaming
Demo
-‐
High
Level
Architecture
Distributed
Storage:
HDFS
YARN
Storm
Stream
Processing
Kakfa
Spout
HBase
Dangerous
Events
Table
Hbase
Bolt
HDFS
Bolt
Truck
Events
AcKve
MQ
Monitoring
Bolt
Web
App
Truck
Streaming
Data
T(1)
T(2)
T(N)
Inbound
Messaging
(Kaha)
Truck
Events
Topic
35. Page
35
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Demo
–
Streaming
Dashboard
.
36. Page
36
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
A
New
Challenge
.
37. Page
37
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
CDO’s
vision:
Build
a
PredicEve
Business,
not
a
ReacEve
one
CDO’s
Requirements
§ Offline
predicKons
§ IdenKfy
investments
that
will
increase
safety
and
reduce
company’s
liabiliKes
§ Real-‐Kme
predicKons
§ AnKcipate
driver
violaKons
before
they
happen
and
take
precauKonary
acKons
Data
ScienKst’s
Response
§ Need
to
explore
data
&
form
a
hypothesis
§ Verify
trends
against
TBs
of
events
data
via
machine
learning
§ Generate
predicEve
models
with
Spark
MLlib
on
HDP
§ Plug
models
into
the
Storm
topology
to
predict
driver
violaEons
in
real-‐Eme
♬
I’ve
been
wai+ng
for
this
moment
all
my
life
♬
38. Page
38
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Demo
–
Analyzing
Events
with
Tableau
.
39. Page
39
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Raw Events – dangerous drivers
Page 39
40. Page
40
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Raw Events – dangerous routes
Page 40
41. Page
41
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Raw Events – violations by location
Page 41
42. Page
42
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Enriching
truck
events
for
analysis
with
Pig
HDFS
Raw
Truck
Events
Weather
Data
Sets
Raw
Weather
Data
HCatalog
(Metadata)
Payroll
Data
HR
&
Payroll
DBs
Load
Raw
Truck
Events
Clean
&
Filter
Cleaned
Events
Transformed
Events
Transform
Join
with
HR
&
weather
data
Enriched
Events
Enriched
Events
Store
Tableau
43. Page
43
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – noncertified and fatigued drivers
more dangerous
Page 43
44. Page
44
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – top 3 dangerous routes seem to be
driven by fatigued drivers
Page 44
45. Page
45
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – foggy weather leads to violations
Page 45
46. Page
46
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – but top 3 safest routes are also
foggy
Page 46
47. Page
47
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
IntegraEng
PredicEve
AnalyEcs
48. Page
48
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Building
the
PredicEve
Model
on
HDP
Tableau
Explore
small
subset
of
events
to
idenEfy
predicEve
features
and
make
a
hypothesis.
E.g.
hypothesis:
“foggy
weather
causes
driver
viola+ons”
1
IdenEfy
suitable
ML
algorithms
to
train
a
model
–
we
will
use
classificaEon
algorithms
as
we
have
labeled
events
data
2
Transform
enriched
events
data
to
a
format
that
is
friendly
to
Spark
MLlib
–
many
ML
libs
expect
training
data
in
a
certain
format
3
Train
a
logisEc
regression
model
in
Spark
on
YARN,
with
above
events
as
training
input,
and
iterate
to
fine
tune
the
generated
model
4
Integrate
Spark
MLlib
model
in
a
Storm
bolt
to
predict
violaEons
in
real
Eme
5
49. Page
49
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Truck
Sensors
HDFS
YARN
Integrate
PredicEve
AnalyEcs
in
Stream
Processing
Stream
Processing
(Storm)
Inbound
Messaging
(Kasa)
InteracEve
Query
(Hive
on
Tez)
Real-‐Eme
Serving
(HBase)
Millions
of
Enriched
Truck
Events
PredicEon
Bolt
Plug
Spark
model
into
Storm
bolt
Machine
Learning
(Spark)
Train
Spark
ML
model
with
millions
of
truck
events
50. Page
50
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
© Hortonworks Inc. 2012
Professional Services
Streaming
Demo
-‐
Updated
Architecture
Distributed
Storage:
HDFS
YARN
Storm
Stream
Processing
Kakfa
Spout
HBase
PayRoll
Table
HBase
Bolt
HDFS
Bolt
Truck
Events
AcKve
MQ
Monitoring
Bolt
Web
App
Truck
Streaming
Data
T(1)
T(2)
T(N)
Inbound
Messaging
(Kaha)
Truck
Events
Topic
PredicKon
Bolt
Enrich
Event
Predict
violaKon
in
real
Kme
&
alert
via
MQ
Render
Real
Kme
predicKons
on
UI
51. Page
51
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Transforming
training
data
for
Spark
MLlib
Enriched
Events
Data
Event
Type
Is
Driver
CerKfied?
Wage
Plan
Hours
Driven
Miles
Driven
Longitude
LaKtude
Weather
Foggy
Weather
Rainy
Weather
Windy
Normal
Yes
Hourly
45
2721
-‐91.3
38.14
No
No
No
Overspeed
No
Miles
72
4152
-‐94.23
37.09
Yes
Yes
No
…
…
…
…
…
…
…
…
…
…
Spark
MLlib
Training
Data
Label
Is
Driver
CerKfied?
Wage
Plan
Hours
Driven
Miles
Driven
Weather
Foggy
Weather
Rainy
Weather
Windy
0
1
1
0.45
0.2721
0
0
0
1
0
0
0.72
0.4152
1
1
0
…
…
…
…
…
…
…
…
Normal
events
labeled
as
0
and
violaEon
events
as
1
Feature
scaling
applied
to
hours
and
miles
to
improve
algorithm
performance
Features
with
binary
values
denoted
as
0
and
1
52. Page
52
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Running
Spark
ML
on
YARN
1
spark-‐submit
-‐-‐class
org.apache.spark.examples.mllib.BinaryClassifica+on
-‐-‐master
yarn-‐cluster
-‐-‐
num-‐executors
3
-‐-‐driver-‐memory
512m
-‐-‐executor-‐memory
512m
-‐-‐executor-‐cores
1
truckml.jar
-‐-‐algorithm
LR
-‐-‐regType
L2
-‐-‐regParam
1.0
/user/root/truck_training
-‐-‐numItera3ons
100
Run
spark-‐submit
script
to
launch
a
Spark
job
on
YARN.
Training
data
locaEon
on
HDFS
2
Monitor
progress
of
Spark
job
in
YARN
Resource
Mgr
UI
53. Page
53
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
InterpreEng
Spark
LogisEc
Regression
Results
Precision:
87.5%
Recall:
88%
Top
three
predictors
of
violaKons
1.
Foggy
Weather
2.
Rainy
Weather
3.
Driver
CerEficaEon
54. Page
54
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
IntegraEng
Spark
model
in
Storm
Kasa
Spout
Storm
PredicEon
Bolt
§ IniEalize
Spark
model
§ Parse
truck
event
§ Enrich
event
with
HBase
data
§ Predict
violaEon
with
model
§ Send
Alert
if
violaEon
predicted
Real-‐Eme
Serving
(HBase)
AcKve
MQ
Ops
Center
LOB
Dashboards
55. Page
55
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Summary:
SoluEon
Value
.
56. Page
56
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Value
of
large
scale
ML
on
HDP
§ Accelerate
Kme
to
market/value
§ Test
out
mulEple
ML
algorithms
against
TBs
of
training
data
in
reasonable
Eme
frames
§ Confirm
hypothesis
against
TBs
of
training
data
with
confidence
§ We
confirmed
that
fog
does
impact
safety
and
wage
plans
do
not,
whereas
BI
tools
indicated
otherwise
§ Easily
integrate
predicKve
models
in
data
driven
apps
§ Run
predicEve
models
in
Storm
or
any
other
app
in
your
enterprise
§ Run
all
of
the
above
in
a
mulK-‐tenant
YARN
cluster
§ Large
scale
ML
on
YARN
respects
other
tenants
in
an
HDP
cluster
57. Page
57
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
RecommendaEons
to
CDO
§ Investment
recommendaKons,
in
order
of
priority
1. Invest
in
visibility
sensors
and
auto
braking
systems
to
deal
with
foggy
condiEons
2. Invest
in
slip
resistant
Eres
to
fight
rainy
condiEons
3. Invest
in
cerEfying
drivers
to
reduce
violaEon
probability
§ Power
of
real
Kme
predicKons
§ 40%
reducEon
in
violaEon
rates
by
predicEng
high
risk
situaEons
in
real-‐Eme
and
sending
immediate
alerts
to
drivers
58. Page
58
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
PredicEve
Demo
.
59. Page
59
©
Hortonworks
Inc.
2011
–
2014.
All
Rights
Reserved
Q & A