Contenu connexe
Similaire à #HSTokyo16 Apache Spark Crash Course (20)
Plus de DataWorks Summit/Hadoop Summit (20)
#HSTokyo16 Apache Spark Crash Course
- 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sources
à Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars
– Weather Stations, Smart Grids
– RFID Tags, Beacons, Wearables
à User Generated Content (Web & Mobile)
– Twitter, Facebook, Snapchat, YouTube
– Clickstream, Ads, User Engagement
– Payments: Paypal, Venmo
44ZB in 2020
- 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
à Distributed collection of data organized into named
columns
à Conceptually equivalent to a table in relational DB or
a data frame in R/Python
à API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema
- 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context
à Entry point into all functionality in Spark SQL
à All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
à Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions à UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive
- 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Two API Examples: DataFrame and SQL APIs
flightsDF.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5
SQL API
DataFrame API
- 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)
- 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
24
Modern Data Applications approach to Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
- 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Where Can We Use Machine Learning (Data Science)
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
- 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
- 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a Note/Notebook?
• A web based GUI for small code snippets
• Write code snippets in browser
• Zeppelin sends code to backend for execution
• Zeppelin gets data back from backend
• Zeppelin visualizes data
• Zeppelin Note = Set of (Paragraphs/Cells)
• Other Features - Sharing/Collaboration/Reports/Import/Export
- 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
- 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Apache Spark on YARN?
à Resource management
– Share Spark workloads with other
workloads (HIVE, Solr, etc.)
à Utilizes existing HDP cluster
infrastructure
à Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
- 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
- 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
- 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reasons to Integrate with Livy
à Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
à Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout )
à To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN
- 60. 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.0
à API Improvements
– SparkSession (spark) – new entry point (Replaces SQLContext and HiveContext)
– Unified DataFrame & DataSet API (DataFrame à alias for DataSet[Row])
– Structured Streaming/Continuous Application (Concept of an infinite DataFrame)
– Temporary Table à Temporary View
à Performance Improvements
– Tungsten Phase 2 - Multi stage code gen
– ORC & Parquet file improvements
à Machine Learning
– ML pipeline the new API, MLlib deprecated
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
à SparkSQL
– More SQL support (new ANSI SQL parser, subquery support)