SlideShare une entreprise Scribd logo
1  sur  68
Melbourne / April 2015
Introducing DataFrames
R on Spark
Large Scale Machine Learning
Speakers
Ned Shawa
Mark Moloney
Dr. Zhen He
Agenda
• Apache Spark 1.3 Overview 6:15 – 6:30 (Ned)
• DataFrames for Apache Spark 6:30 – 7:00 (Ned)
• R on Apache Spark 7:00 – 7:30 (Mark)
• Large Scale Machine Learning on Spark 7:30 – 8:15 (Zhen)
News
• Hadoop + Strata
• Jobs
• Meetup Update
• Personal Announcement
• Call for Contribution
Contributions so far…..
• Mark with R on Spark, Scala 101
• Tim with Building Spark on IDEs
• Con with building Spark with Gradle
• More?
Whats new in Spark 1.3
• Multi Level Aggregation Trees
• Improved Error reporting
• SSL Encryption for Control messages and WebUI
• DataFrames API
• Backward compatibility for HIVE
• Writing data in source
• JDBC Driver
• New Algorithms for MLLIB
• Direct KAFKA API
More Data Sources APIs
04/13/15
What are DataFrames?
• Distributed Collection of Data organized in Columns
• Equivalent to Tables in Databases or DataFrame in R/PYTHON
• Much richer optimization than any other implementation of DF
• Can be constructed from a wide variety of sources and APIs
Writing a DataFrame
val df = sqlContext.jsonFile("/home/ned/attendees.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()
DataFrame with RDD
case class attendees_class (first_name: String, last_name:String, age:Int)
Val attendees=sc.textFile("/home/ned/attendees.csv").map(_.split(",")).map(p=>attendees_class(p(0),p(1),p(2).trim.toInt)).toDF()
people.registerTempTable("attendees")
val youngppl=sqlContext.sql("select first_name,last_name from attendees where age <35")
youngppl.map(t=>"Name: " +t(0)+ " " + t(1)).collect().foreach(println)
DataFrames and Parquet
attendees.saveAsParquetFile("/home/ned/attendees.parquet")
val pfile = sqlContext.parquetFile("/home/ned/attendees.parquet")
pfile.printSchema()
pfile.registerTempTable("attendees_parquet")
val old_ppl=sqlContext.sql("select first_name,last_name,age from attendees_parquet where age >=35 order by age desc")
old_ppl.map(t=>"Name: " + t(0)+" "+t(1)+ " Age " +t(2)).collect().foreach(println)
DataFrames and JDBC
val jdbc_attendees = sqlContext.load("jdbc", Map("url" -> "jdbc:mysql://localhost:3306/db1?
user=root&password=xxx","dbtable" -> "attendees"))
jdbc_attendees.show()
jdbc.attendees.count()
jdbc_attendees.registerTempTable("jdbc_attendees")
val countall = sqlContext.sql("select count(*) from jdbc_attendees")
countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
Dataframes for Apache Spark
Spark Components
Introduction to SparkR
Mark Moloney
April 2015
https://github.com/markmo/sparkr-meetup-sparkr-demo
R
• A language that targets statistical and general data analysis
• A package for nearly everything in this space
• Great for exploratory analysis – rapid statistics and plots
• Single threaded
• Datasets limited to memoryTool Selection Primary Analytic Tool
18%
15%
Cost is important
A
B
E
Ease-of-use
& interface
Ability to
write one’s
Everything is important
Data miners are a diverse group who are looking for
different things from their data mining tools. They report
using multiple tools to meet their analytic needs, and
even the most popular tool is identified as their primary
tool by just 24% of data miners. Over the years, R and
Rapid Miner have shown substantial increases.
Cluster analysis* reveals that, in their tool-selection
preferences, data miners fall into 5 groups. The primary
dimensions that distinguish them are price sensitivity and
code-writing / interface / ease-of-use preferences.
2013 Rexer Analytics Survey of 1,259 analytics professionals from 75 countries.
Spark
• An evolutionary step up from Map-Reduce programming on Hadoop
• Do more with less work
• Simpler API than Map-Reduce. Apply functional transforms to datasets. The
framework takes care of distribution of work across multiple machines.
• Can cache interim results in memory, which speeds up iterative procedures
SparkR
• An R API to Spark’s RDDs (Resilient Distributed Datasets)
• Work with massive datasets in R
• Works on top of YARN or Mesos
• Interactively run jobs on a cluster from the R shell
• Exposes the RDD API of Spark as distributed lists in R
• Packages and ships variables in the closure to each node
• Use includePackage to include third-party packages on
other nodes
• Currently only suports R lists and vectors. Data frame
support is in the works.
map / lapply
mapPartitions / lapplyPartition
groupByKey
reduceByKey
sampleRDD
collect
cache
textFile
parallelize
broadcast
includePackage
How does it work
• The R executable must be installed on each node
• Work is sent to a Spark Executor (Java) on each node
• Some overhead in starting R interpreter – looking at background process as in PySpark
R
Java
Spark
Context
Java
Spark
Context
Java Native Interface (JNI)
using rJava
Local JVM
Local Machine Remote Machine
Remote Machine
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
R
R
tasks
tasks
broadcast vars, packages
broadcast vars, packages
Send R environment of vars used
Uses R save() function to serialize closure
Roadmap
• Feature-complete DataFrame API
• MLLib integration
DataFrame Methods
•Filter
– filter(df, df$col1 > 0)
•Sort
– sortDF(df, asc(df$col1), desc(df$col2))
•Join
– join(df1, df2, df1$col1 == df2$col2, "right_outer")
•GroupBy
– groupBy(df, df$col1)
•Agg
– agg(groupDF, sum(groupDF$col2), max(groupDF$col3))
Demos
Document Similarity Example
• Collection of inaugural speeches of US presidents
• Using Shingles and Jaccard similarity
– A k-shingle for a document is a sequence of k characters that appear in the document
– Example: k = 2; doc = abcab. Set of 2-shingles = {ab, bc, ca}
– Intuitively, documents that are similar will have many shingles in common
– Robust to small changes, e.g. reordering a paragraph only affects the 2k chingles that cross
paragraph boundaries
– Jaccard similarity is intersection / union
k = 6
doc1: “The cat sat”
doc2: “The cat ate”
‘The ca’
‘he cat’
‘e cat ‘
‘ cat s’
‘cat sa’
‘at sat’
‘ cat a’
‘cat at’
‘at ate’
doc1
1
1
1
1
1
1
0
0
0
doc2
1
1
1
0
0
0
1
1
1
characteristic matrix:
too large in real-world problems
therefore use minhashing
Machine Learning Example
Digit Classification (MNIST database)
Digit(Classification:(MNIST(
Digits consist of 784 pixel values
Training set: 60,000 images; Test set: 10,000
Citations
• http://amplab-extras.github.io/SparkR-pkg/ - main site
• http://ampcamp.berkeley.edu/5/exercises/sparkr.html - hands-on exercises
• http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 - MNIST example
• https://github.com/kiendang/sparkr-naivebayes-example - how to integrate with MLLib now
• https://github.com/cafreeman/Demo_SparkR - New API
24
Spark Machine Learning
Experiences
Speaker: Zhen He
25
Talk Outline
●
Spark is awesome!
●
Example of spark in action
●
Why spark is good for machine learning?
●
Big Data versus big parameters
●
What I wish you could do in spark
●
Competing scalable machine learning systems
●
Dispel some common myths about spark and
Scala
●
Current spark machine learning projects from our
group
●
Conclusion
●
Demonstration
26
Spark is Awesome!
27
Focus of Talk is on
Performance
28
Example of Spark in
Action
●
Recently did some work with an Australian Government
Agency
●
Taught Mastering Hadoop and Spark course
●
Students loved it
●
Brain dead at the end 
●
Use Spark for Machine Learning
●
Really made excellent use of skills leant in the course for
their project.
29
Summary
●
Initial R solution
●
Never finished
●
Approximate R + C single core solution
●
A couple of days
●
Accurate spark parallel solution
●
18 minutes
●
The model worked very well in initial
trial.
30
Lessons Learnt
●
First reduce complexity of problem
●
Pre and post processing performance is
very important
●
Spark is excellent for both the modeling
and the pre and post processing
31
Why Spark is good for machine learning
compared to MapReduce?
●
Machine learning algorithms are iterative
●
The following happens when using MapReduce
●
A lot of reading and writing to HDFS
●
The following happens when using spark
●
Reading and writing to RAM instead of HDFS
iter. 1iter. 1 iter. 2iter. 2 . . .
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
32
Make sure you actually cache the data!
val data = MLUtils.loadLibSVMFile(sc, ”datafile.txt")
val LR = new LogisticRegressionWithSGD()
LR.optimizer.setNumIterations(10)
val model = LR.run(data)
The above runs 10 x slower then the code below!
val data = MLUtils.loadLibSVMFile(sc, ”datafile.txt").cache
val LR = new LogisticRegressionWithSGD()
LR.optimizer.setNumIterations(10)
val model = LR.run(data)
33
Big Data versus Big
Parameters
●
Spark is great at doing machine learning on Big Data as
long as the size of the parameters is small.
●
Big parameters is hard for Spark to handle efficiently.
34
What is Big Parameters?
●
Big parameter is caused by high dimensional data
●
Modeling high dimensional data requires a large number
of parameters
● In the example above the parameters are the b0, b1, b2,
… bn
35
Examples of High Dimensional Data
Time series
Text
Image
Speech
36
Varying dimensionality (100 – 10, 000, 000)
●
Logistic regression on 30 cores, 10 iterations
●
Dimensionality shown in [ ]
●
The second number is the number of training instances
●
Total data size is constant
●
As dimensionality increases execution time increases significantly!
Time(secs)
Low dimensional data
High dimensional data
37
Varying number of cores
●
For high dimensional data (10 million)
●
Using more cores actually slows execution!
●
1 core is the optimal!
●
For low dimensional data
●
Execution time decreases with the increase in the number of
cores.
Time(secs)
# cores
10 million dimensional data
# cores
Time(secs)
100 dimensional data
38
Why is big parameters /
high dimensional data so
hard for spark?
39
Why is big parameters hard for spark?
mini-batch
of data
Master node combine separate models
Broadcast combined model
40
Reason 1: Requires all nodes to
synchronize
mini-batch
of data
Master node combine separate models
Broadcast combined model
41
Reason 2: Inefficient memory usage
●
Although all cores working on same set of parameters
they all have their own copy
●
It would be nice if they can all share the same copy
RAM
Core 1 Core 2 Core 3 Core 4
within one node
42
Parameters can be multi-GB is size
Reason 3: High shuffle cost
mini-batch
of data
Master node combine separate models
Broadcast combined model
43
Reason 4: Master node is a bottleneck
mini-batch
of data
Master node
Broadcast combined model
44
What I wish we could do on Spark
●
All machines share the same a single copy of the shared
model in the shared RAM.
●
All cores can update the model asynchronously.
●
No bottleneck master node
RAM
Core 1 Core 2 Core 3 Core 4
RAM
Core 1 Core 2 Core 3 Core 4
Node 1 Node 2
Parameter Server
Async update parameters Async update parameters
45
Such a system already exists
●
Google’s DistBelief system
●
J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le,
M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, A. Y. Ng.
Large Scale Distributed Deep Networks. NIPS, 2012.
●
Closed source
●
Only people in Google can use it
●
Used to train deep learning system with 1 billion parameters on 16,
000 cores
46
Comparison of some key machine
learning algorithms in Spark
●
Linear Models
●
Logistic regression
●
Linear regression
●
Support vector machines
●
Linear kernel
●
Random forest
47
Logistic Regression, Linear Regression,
Linear Support Vector Machines (SVM)
●
Need to repeatedly merge parameters
●
High dimensional data => Big parameters => high communication
costs
48
Random Forest
●
Can be trained very efficiently
●
Each task trains on the data of its own partition
●
Each tree can be trained separately
●
No communication or synchronization needed
49
Competing Scalable Machine
Learning Systems
●
Greenplum
●
Machine learning via SQL
●
Mahout
●
MapReduce based
●
Moving to using spark as underlying engine
●
Vowpal Wabbit
●
Good performance
●
Specialized system
●
Graph Lab
●
Good performance
●
Need to turn everything into graphs
50
Spark Misconceptions
51
Common Misconception: Spark is
only good if data fits in RAM
●
From inception Spark was designed to be a
general execution engine that works both in-
memory and on-disk
●
Almost all operators perform external operations
when data does not fit in RAM.
●
Spark breaks large-scale sort record
52
Results of Large-scale Sort
●
Spark is 3X faster than Hadoop using 10X fewer machines
●
Spark sorts 1PB in 4 hours on 190 machines
●
Compared to 16 hours for Hadoop using 3800 machines
●
All sorting on disk (HDFS) no use of Spark in-memory cache
53
How did they do it?
●
More efficient Shuffle
●
Very efficient scheduling of tasks
54
A little bit of RAM can go a long way
●
Performance degrades gracefully with decreasing RAM size
55
A little bit of RAM can go a
long way
●
Just cache the small parts of the data that you will use again
●
Use filter and projection to reduce the amount that is cached.
●
Spark SQL stores data compressed in columns
●
Fast compression/decompression
●
Column-stores have been proven to be way better than row-
stores for analytics
●
100x better
●
For normal Spark code (non-SparkSQL), you can store data
serialized and compressed in RAM.
●
RAM is cheap now days
●
Aggregate RAM on large cluster can be very large
56
Common Misconceptions
●
I need to learn a new language Scala
●
Scala is easy to learn
●
Can program in Java and Python
●
Distributed Dataframes
●
SparkR
●
I have to rewrite all my code
●
Spark will run MapReduce code unmodified
●
Spark SQL runs HiveQL
57
Some Real Problems
●
Scala allocates a lot of objects
●
High GC overhead
●
The nice functional scala code is much slower than
writing C-styled while loops
●
For example:
●
val y = x.map(x => x * x)
●
is slower than
●
var i = 0
●
while (i < x.length) {
y(i) = x(i) * x(i)
i += 1
}
58
Our Work
59
Spark API Examples
●
Almost no good examples on the Spark API calls.
●
Matthias and I have written examples for over 110 Spark
API calls
●
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIEx
amples.html
●
Great reference
●
Currently over 33, 000 hits
●
Often top Google page returned when searching spark
API calls
60
Deep Learning on Spark
●
Deep learning is currently very popular
●
State-of-the-art performance in area such as
●
Image recognition
●
Speech recognition
●
Natural language processing
●
etc.
●
No good existing distributed open source deep
learning implementation
●
We are implementing the first serious deep
learning implementation on Spark.
●
Implemented in Scala
61
Features of our Open Source Spark
Deep Learning System
●
Stacked auto-encoder
●
Convolutional neural networks
●
RICA
●
Sparse coding
●
Fully connected networks
●
Many many different state-of-the-art optimization tricks
like
●
Dropout, reLU, adagrad, momentum, RMS prop,
etc.
62
Other Current Projects
●
Zendesk
●
Australian institute of sports
●
Precision agriculture
●
More welcome
63
Mastering Big Data Analytics with
Hadoop Course
●
3 day course on Hadoop and Spark
●
Only pre-requisite is Java programming experience
●
More than 35 programming exercises
●
Contents
●
Hadoop and MapReduce
●
Hive
●
Hadoop 2 ecosystem
●
Storm, Yarn, Giraph
●
Spark and Scala
●
Spark on Amazon
64
Recruiting PhD Students
●
We have a lot of real world projects to do.
●
A lot of projects from industry
●
Zendesk
●
Australian Institute of Sports
●
Precision agriculture
●
UXC Professional Solutions
●
More welcome
●
We need to expand our research group.
●
Topics:
●
Text mining
●
Time series mining
●
Reinforcement learning combined with deep learning
●
Video mining
●
etc.
65
Conclusion
●
Spark is best open source software for distributed
machine learning on Big Data
●
Be careful of using spark for high dimensional
data
●
Random forest does not suffer from
performance penalty
●
Spark programming using Scala is great for ease
of use
●
Spark is good even if data does not fit in RAM
66
Demonstration
R versus Spark Fight!
67
Score
R Spark
Round 1 Load + count 56.8s 10.6s
Round 2 Selection 2.5s 0.86s
Round 3 Sample 50% of data 7.1s 0.38s
Round 4 K means clustering (10
centers, 5 iterations)
100.0s 53.0s
● Results are for single core Spark versus Single core R
● Size of data is around 250 MB
● Data has 5 dimensions
● Clustering done on 3 dimensions
68
Questions?
● Name: Zhen He
● Email Address: z.he@latrobe.edu.au

Contenu connexe

Tendances

Tendances (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 

En vedette

En vedette (17)

Digi-pack ancillary production
Digi-pack ancillary productionDigi-pack ancillary production
Digi-pack ancillary production
 
Vege branding
Vege brandingVege branding
Vege branding
 
Double page spread audience feedback
Double page spread audience feedbackDouble page spread audience feedback
Double page spread audience feedback
 
Q
QQ
Q
 
Questionnaire results
Questionnaire resultsQuestionnaire results
Questionnaire results
 
NME magazine analysis
NME magazine analysisNME magazine analysis
NME magazine analysis
 
Mise en-scène
Mise en-scèneMise en-scène
Mise en-scène
 
Magazine research We Heart Pop
Magazine research We Heart PopMagazine research We Heart Pop
Magazine research We Heart Pop
 
As media overview
As media overviewAs media overview
As media overview
 
Representation of men in the media.
Representation of men in the media.Representation of men in the media.
Representation of men in the media.
 
MOJO
MOJOMOJO
MOJO
 
Magazine Research
Magazine Research Magazine Research
Magazine Research
 
Artist profile
Artist profileArtist profile
Artist profile
 
Соғыс мен үшін ол … = Война для меня это...
Соғыс мен үшін ол … = Война для меня это...Соғыс мен үшін ол … = Война для меня это...
Соғыс мен үшін ол … = Война для меня это...
 
Fight club
Fight clubFight club
Fight club
 
Explanation and Justification of Research Methods
Explanation and Justification of Research MethodsExplanation and Justification of Research Methods
Explanation and Justification of Research Methods
 
Rule of thirds.
Rule of thirds.Rule of thirds.
Rule of thirds.
 

Similaire à Apache spark-melbourne-april-2015-meetup

Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 

Similaire à Apache spark-melbourne-april-2015-meetup (20)

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark core
Spark coreSpark core
Spark core
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 

Dernier

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Dernier (20)

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Apache spark-melbourne-april-2015-meetup

  • 1. Melbourne / April 2015 Introducing DataFrames R on Spark Large Scale Machine Learning Speakers Ned Shawa Mark Moloney Dr. Zhen He
  • 2. Agenda • Apache Spark 1.3 Overview 6:15 – 6:30 (Ned) • DataFrames for Apache Spark 6:30 – 7:00 (Ned) • R on Apache Spark 7:00 – 7:30 (Mark) • Large Scale Machine Learning on Spark 7:30 – 8:15 (Zhen)
  • 3. News • Hadoop + Strata • Jobs • Meetup Update • Personal Announcement • Call for Contribution
  • 4. Contributions so far….. • Mark with R on Spark, Scala 101 • Tim with Building Spark on IDEs • Con with building Spark with Gradle • More?
  • 5. Whats new in Spark 1.3 • Multi Level Aggregation Trees • Improved Error reporting • SSL Encryption for Control messages and WebUI • DataFrames API • Backward compatibility for HIVE • Writing data in source • JDBC Driver • New Algorithms for MLLIB • Direct KAFKA API
  • 6. More Data Sources APIs 04/13/15
  • 7. What are DataFrames? • Distributed Collection of Data organized in Columns • Equivalent to Tables in Databases or DataFrame in R/PYTHON • Much richer optimization than any other implementation of DF • Can be constructed from a wide variety of sources and APIs
  • 8. Writing a DataFrame val df = sqlContext.jsonFile("/home/ned/attendees.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()
  • 9. DataFrame with RDD case class attendees_class (first_name: String, last_name:String, age:Int) Val attendees=sc.textFile("/home/ned/attendees.csv").map(_.split(",")).map(p=>attendees_class(p(0),p(1),p(2).trim.toInt)).toDF() people.registerTempTable("attendees") val youngppl=sqlContext.sql("select first_name,last_name from attendees where age <35") youngppl.map(t=>"Name: " +t(0)+ " " + t(1)).collect().foreach(println)
  • 10. DataFrames and Parquet attendees.saveAsParquetFile("/home/ned/attendees.parquet") val pfile = sqlContext.parquetFile("/home/ned/attendees.parquet") pfile.printSchema() pfile.registerTempTable("attendees_parquet") val old_ppl=sqlContext.sql("select first_name,last_name,age from attendees_parquet where age >=35 order by age desc") old_ppl.map(t=>"Name: " + t(0)+" "+t(1)+ " Age " +t(2)).collect().foreach(println)
  • 11. DataFrames and JDBC val jdbc_attendees = sqlContext.load("jdbc", Map("url" -> "jdbc:mysql://localhost:3306/db1? user=root&password=xxx","dbtable" -> "attendees")) jdbc_attendees.show() jdbc.attendees.count() jdbc_attendees.registerTempTable("jdbc_attendees") val countall = sqlContext.sql("select count(*) from jdbc_attendees") countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
  • 14. Introduction to SparkR Mark Moloney April 2015 https://github.com/markmo/sparkr-meetup-sparkr-demo
  • 15. R • A language that targets statistical and general data analysis • A package for nearly everything in this space • Great for exploratory analysis – rapid statistics and plots • Single threaded • Datasets limited to memoryTool Selection Primary Analytic Tool 18% 15% Cost is important A B E Ease-of-use & interface Ability to write one’s Everything is important Data miners are a diverse group who are looking for different things from their data mining tools. They report using multiple tools to meet their analytic needs, and even the most popular tool is identified as their primary tool by just 24% of data miners. Over the years, R and Rapid Miner have shown substantial increases. Cluster analysis* reveals that, in their tool-selection preferences, data miners fall into 5 groups. The primary dimensions that distinguish them are price sensitivity and code-writing / interface / ease-of-use preferences. 2013 Rexer Analytics Survey of 1,259 analytics professionals from 75 countries.
  • 16. Spark • An evolutionary step up from Map-Reduce programming on Hadoop • Do more with less work • Simpler API than Map-Reduce. Apply functional transforms to datasets. The framework takes care of distribution of work across multiple machines. • Can cache interim results in memory, which speeds up iterative procedures
  • 17. SparkR • An R API to Spark’s RDDs (Resilient Distributed Datasets) • Work with massive datasets in R • Works on top of YARN or Mesos • Interactively run jobs on a cluster from the R shell • Exposes the RDD API of Spark as distributed lists in R • Packages and ships variables in the closure to each node • Use includePackage to include third-party packages on other nodes • Currently only suports R lists and vectors. Data frame support is in the works. map / lapply mapPartitions / lapplyPartition groupByKey reduceByKey sampleRDD collect cache textFile parallelize broadcast includePackage
  • 18. How does it work • The R executable must be installed on each node • Work is sent to a Spark Executor (Java) on each node • Some overhead in starting R interpreter – looking at background process as in PySpark R Java Spark Context Java Spark Context Java Native Interface (JNI) using rJava Local JVM Local Machine Remote Machine Remote Machine Spark Executor Spark Executor Spark Executor Spark Executor R R tasks tasks broadcast vars, packages broadcast vars, packages Send R environment of vars used Uses R save() function to serialize closure
  • 19. Roadmap • Feature-complete DataFrame API • MLLib integration DataFrame Methods •Filter – filter(df, df$col1 > 0) •Sort – sortDF(df, asc(df$col1), desc(df$col2)) •Join – join(df1, df2, df1$col1 == df2$col2, "right_outer") •GroupBy – groupBy(df, df$col1) •Agg – agg(groupDF, sum(groupDF$col2), max(groupDF$col3))
  • 20. Demos
  • 21. Document Similarity Example • Collection of inaugural speeches of US presidents • Using Shingles and Jaccard similarity – A k-shingle for a document is a sequence of k characters that appear in the document – Example: k = 2; doc = abcab. Set of 2-shingles = {ab, bc, ca} – Intuitively, documents that are similar will have many shingles in common – Robust to small changes, e.g. reordering a paragraph only affects the 2k chingles that cross paragraph boundaries – Jaccard similarity is intersection / union k = 6 doc1: “The cat sat” doc2: “The cat ate” ‘The ca’ ‘he cat’ ‘e cat ‘ ‘ cat s’ ‘cat sa’ ‘at sat’ ‘ cat a’ ‘cat at’ ‘at ate’ doc1 1 1 1 1 1 1 0 0 0 doc2 1 1 1 0 0 0 1 1 1 characteristic matrix: too large in real-world problems therefore use minhashing
  • 22. Machine Learning Example Digit Classification (MNIST database) Digit(Classification:(MNIST( Digits consist of 784 pixel values Training set: 60,000 images; Test set: 10,000
  • 23. Citations • http://amplab-extras.github.io/SparkR-pkg/ - main site • http://ampcamp.berkeley.edu/5/exercises/sparkr.html - hands-on exercises • http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 - MNIST example • https://github.com/kiendang/sparkr-naivebayes-example - how to integrate with MLLib now • https://github.com/cafreeman/Demo_SparkR - New API
  • 25. 25 Talk Outline ● Spark is awesome! ● Example of spark in action ● Why spark is good for machine learning? ● Big Data versus big parameters ● What I wish you could do in spark ● Competing scalable machine learning systems ● Dispel some common myths about spark and Scala ● Current spark machine learning projects from our group ● Conclusion ● Demonstration
  • 27. 27 Focus of Talk is on Performance
  • 28. 28 Example of Spark in Action ● Recently did some work with an Australian Government Agency ● Taught Mastering Hadoop and Spark course ● Students loved it ● Brain dead at the end  ● Use Spark for Machine Learning ● Really made excellent use of skills leant in the course for their project.
  • 29. 29 Summary ● Initial R solution ● Never finished ● Approximate R + C single core solution ● A couple of days ● Accurate spark parallel solution ● 18 minutes ● The model worked very well in initial trial.
  • 30. 30 Lessons Learnt ● First reduce complexity of problem ● Pre and post processing performance is very important ● Spark is excellent for both the modeling and the pre and post processing
  • 31. 31 Why Spark is good for machine learning compared to MapReduce? ● Machine learning algorithms are iterative ● The following happens when using MapReduce ● A lot of reading and writing to HDFS ● The following happens when using spark ● Reading and writing to RAM instead of HDFS iter. 1iter. 1 iter. 2iter. 2 . . . HDFS read HDFS write HDFS read HDFS write iter. 1iter. 1 iter. 2iter. 2 . . . Input
  • 32. 32 Make sure you actually cache the data! val data = MLUtils.loadLibSVMFile(sc, ”datafile.txt") val LR = new LogisticRegressionWithSGD() LR.optimizer.setNumIterations(10) val model = LR.run(data) The above runs 10 x slower then the code below! val data = MLUtils.loadLibSVMFile(sc, ”datafile.txt").cache val LR = new LogisticRegressionWithSGD() LR.optimizer.setNumIterations(10) val model = LR.run(data)
  • 33. 33 Big Data versus Big Parameters ● Spark is great at doing machine learning on Big Data as long as the size of the parameters is small. ● Big parameters is hard for Spark to handle efficiently.
  • 34. 34 What is Big Parameters? ● Big parameter is caused by high dimensional data ● Modeling high dimensional data requires a large number of parameters ● In the example above the parameters are the b0, b1, b2, … bn
  • 35. 35 Examples of High Dimensional Data Time series Text Image Speech
  • 36. 36 Varying dimensionality (100 – 10, 000, 000) ● Logistic regression on 30 cores, 10 iterations ● Dimensionality shown in [ ] ● The second number is the number of training instances ● Total data size is constant ● As dimensionality increases execution time increases significantly! Time(secs) Low dimensional data High dimensional data
  • 37. 37 Varying number of cores ● For high dimensional data (10 million) ● Using more cores actually slows execution! ● 1 core is the optimal! ● For low dimensional data ● Execution time decreases with the increase in the number of cores. Time(secs) # cores 10 million dimensional data # cores Time(secs) 100 dimensional data
  • 38. 38 Why is big parameters / high dimensional data so hard for spark?
  • 39. 39 Why is big parameters hard for spark? mini-batch of data Master node combine separate models Broadcast combined model
  • 40. 40 Reason 1: Requires all nodes to synchronize mini-batch of data Master node combine separate models Broadcast combined model
  • 41. 41 Reason 2: Inefficient memory usage ● Although all cores working on same set of parameters they all have their own copy ● It would be nice if they can all share the same copy RAM Core 1 Core 2 Core 3 Core 4 within one node
  • 42. 42 Parameters can be multi-GB is size Reason 3: High shuffle cost mini-batch of data Master node combine separate models Broadcast combined model
  • 43. 43 Reason 4: Master node is a bottleneck mini-batch of data Master node Broadcast combined model
  • 44. 44 What I wish we could do on Spark ● All machines share the same a single copy of the shared model in the shared RAM. ● All cores can update the model asynchronously. ● No bottleneck master node RAM Core 1 Core 2 Core 3 Core 4 RAM Core 1 Core 2 Core 3 Core 4 Node 1 Node 2 Parameter Server Async update parameters Async update parameters
  • 45. 45 Such a system already exists ● Google’s DistBelief system ● J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, A. Y. Ng. Large Scale Distributed Deep Networks. NIPS, 2012. ● Closed source ● Only people in Google can use it ● Used to train deep learning system with 1 billion parameters on 16, 000 cores
  • 46. 46 Comparison of some key machine learning algorithms in Spark ● Linear Models ● Logistic regression ● Linear regression ● Support vector machines ● Linear kernel ● Random forest
  • 47. 47 Logistic Regression, Linear Regression, Linear Support Vector Machines (SVM) ● Need to repeatedly merge parameters ● High dimensional data => Big parameters => high communication costs
  • 48. 48 Random Forest ● Can be trained very efficiently ● Each task trains on the data of its own partition ● Each tree can be trained separately ● No communication or synchronization needed
  • 49. 49 Competing Scalable Machine Learning Systems ● Greenplum ● Machine learning via SQL ● Mahout ● MapReduce based ● Moving to using spark as underlying engine ● Vowpal Wabbit ● Good performance ● Specialized system ● Graph Lab ● Good performance ● Need to turn everything into graphs
  • 51. 51 Common Misconception: Spark is only good if data fits in RAM ● From inception Spark was designed to be a general execution engine that works both in- memory and on-disk ● Almost all operators perform external operations when data does not fit in RAM. ● Spark breaks large-scale sort record
  • 52. 52 Results of Large-scale Sort ● Spark is 3X faster than Hadoop using 10X fewer machines ● Spark sorts 1PB in 4 hours on 190 machines ● Compared to 16 hours for Hadoop using 3800 machines ● All sorting on disk (HDFS) no use of Spark in-memory cache
  • 53. 53 How did they do it? ● More efficient Shuffle ● Very efficient scheduling of tasks
  • 54. 54 A little bit of RAM can go a long way ● Performance degrades gracefully with decreasing RAM size
  • 55. 55 A little bit of RAM can go a long way ● Just cache the small parts of the data that you will use again ● Use filter and projection to reduce the amount that is cached. ● Spark SQL stores data compressed in columns ● Fast compression/decompression ● Column-stores have been proven to be way better than row- stores for analytics ● 100x better ● For normal Spark code (non-SparkSQL), you can store data serialized and compressed in RAM. ● RAM is cheap now days ● Aggregate RAM on large cluster can be very large
  • 56. 56 Common Misconceptions ● I need to learn a new language Scala ● Scala is easy to learn ● Can program in Java and Python ● Distributed Dataframes ● SparkR ● I have to rewrite all my code ● Spark will run MapReduce code unmodified ● Spark SQL runs HiveQL
  • 57. 57 Some Real Problems ● Scala allocates a lot of objects ● High GC overhead ● The nice functional scala code is much slower than writing C-styled while loops ● For example: ● val y = x.map(x => x * x) ● is slower than ● var i = 0 ● while (i < x.length) { y(i) = x(i) * x(i) i += 1 }
  • 59. 59 Spark API Examples ● Almost no good examples on the Spark API calls. ● Matthias and I have written examples for over 110 Spark API calls ● http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIEx amples.html ● Great reference ● Currently over 33, 000 hits ● Often top Google page returned when searching spark API calls
  • 60. 60 Deep Learning on Spark ● Deep learning is currently very popular ● State-of-the-art performance in area such as ● Image recognition ● Speech recognition ● Natural language processing ● etc. ● No good existing distributed open source deep learning implementation ● We are implementing the first serious deep learning implementation on Spark. ● Implemented in Scala
  • 61. 61 Features of our Open Source Spark Deep Learning System ● Stacked auto-encoder ● Convolutional neural networks ● RICA ● Sparse coding ● Fully connected networks ● Many many different state-of-the-art optimization tricks like ● Dropout, reLU, adagrad, momentum, RMS prop, etc.
  • 62. 62 Other Current Projects ● Zendesk ● Australian institute of sports ● Precision agriculture ● More welcome
  • 63. 63 Mastering Big Data Analytics with Hadoop Course ● 3 day course on Hadoop and Spark ● Only pre-requisite is Java programming experience ● More than 35 programming exercises ● Contents ● Hadoop and MapReduce ● Hive ● Hadoop 2 ecosystem ● Storm, Yarn, Giraph ● Spark and Scala ● Spark on Amazon
  • 64. 64 Recruiting PhD Students ● We have a lot of real world projects to do. ● A lot of projects from industry ● Zendesk ● Australian Institute of Sports ● Precision agriculture ● UXC Professional Solutions ● More welcome ● We need to expand our research group. ● Topics: ● Text mining ● Time series mining ● Reinforcement learning combined with deep learning ● Video mining ● etc.
  • 65. 65 Conclusion ● Spark is best open source software for distributed machine learning on Big Data ● Be careful of using spark for high dimensional data ● Random forest does not suffer from performance penalty ● Spark programming using Scala is great for ease of use ● Spark is good even if data does not fit in RAM
  • 67. 67 Score R Spark Round 1 Load + count 56.8s 10.6s Round 2 Selection 2.5s 0.86s Round 3 Sample 50% of data 7.1s 0.38s Round 4 K means clustering (10 centers, 5 iterations) 100.0s 53.0s ● Results are for single core Spark versus Single core R ● Size of data is around 250 MB ● Data has 5 dimensions ● Clustering done on 3 dimensions
  • 68. 68 Questions? ● Name: Zhen He ● Email Address: z.he@latrobe.edu.au

Notes de l'éditeur

  1. Mixed National Institute of Standards and Technology database