Contenu connexe Similaire à Apache Spark Crash Course (20) Plus de DataWorks Summit (20) Apache Spark Crash Course15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
à Distributed collection of data organized into named
columns
à Conceptually equivalent to a table in relational DB or
a data frame in R/Python
à API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Two API Examples: DataFrame and SQL APIs
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flightsView
WHERE DepDelay > 15 LIMIT 5
SQL API
DataFrame API
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
23
Modern Data Applications approach to Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning use cases
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a ML Model?
à Mathematical formula with a number of parameters that need to be learned from the
data. And fitting a model to the data is a process known as model training
à E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …
44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
61. 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reasons to Integrate with Livy
à Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
à Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout )
à To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN
65. 65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
à Zeppelin è Interactive notebook
à Spark
à YARN è Resource Management
à HDFS è Distributed Storage Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
66. 66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
68. 68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4