SlideShare a Scribd company logo
1 of 29
Download to read offline
PRACTICAL LARGE SCALE
EXPERIENCES WITH SPARK
2.1 MACHINE LEARNING
Berni Schiefer
IBM, Spark Technology Center
Agenda
• How IBM is leveraging SparkML
• Our experimental environment
– hardware and benchmark/workload
• Focus areas for scalability exploration
• Initial results
• Future work
2
Built-in learning to
get started or go
the distance with
advanced
tutorials
Learn
The best of open source
and IBM value-add to
create state-of-the-art
data products
Create
Community and
social features that
provide meaningful
collaboration
Collaborate
Sign up today! - http://datascience.ibm.com
IBM Data Science Experience
3
Pipeline
Model
The Machine Learning Workflow
retraining
History
Data
Pipeline
Model
Feedback
Data
scoring
monitoring
Operational
Data
deploying
redeploying
Predictions
Data
Scientist
ML
Pipeline
Data visualization
Feature
engineering
Model training
Model Evaluation
Developer/
stackholder
IBM Watson Machine Learning 4
Key Model Training Questions:
• Which machine learning algorithm should I use?
• For a chosen machine learning algorithm what
hyper-parameters should be tuned?
Explosive search space!
5
© 2015 IBM Corporation
Data Scientist Workflow With CADS
6
Science of
Analytics
Repository
Deployed Analytic
UserInterface(orREST-APIdirectly)
1
4
5
Deployed Analytic
62
LearningController
Tactical
Planner
Orchestrator
and Scheduler
Learning Controller
Analytic Monitoring
and Adaptation
Analytic
Platforms
Knowledge Acquisition
External Knowledge
about Analytics
3
AI technology automatically determines best analytics pipeline
Data ScientistcaninteractwithSystem
Cross-PlatformDeploymentandEvaluation
Input: Supervisedprediction problem
Submit to
CADS
…
© 2015 IBM Corporation
Model Selection via “Data Allocation using Upper Bounds (DAUB) “
7
Logistic Regression
A3
SVMRandom Forest
…
500 500# Additional
Data points
------------------
Built Model
------------------
Prediction Accuracy
versus #Data Points
TrainingData
Ranking based on
upper bound
estimate on
performance of
each pipeline
(‘slope’ of learning
curve)
https://arxiv.org/pdf
/1601.00024.pdf
7
Our “F1” Spark SQL Cluster
• A 28-node cluster
– 2 management nodes (co-exist
with 2 data nodes)
– 28 data nodes
• Lenovo x3650 M5
– 2 sockets (18cores/socket)
– 1.5TB RAM
– 8x2TB SSD
– 2 racks,
• 20x 2U servers per rack (42U racks)
– 1 switch, 100GbE, 32 ports, 1U
• Mellanox SN2700
8
• Each data node
– CPU: 2x E5-2697 v4 @ 2.3GHz (Broadwell) (18c)
– Memory: 1.5TB per server 24x 64GB DDR4 2400MHz
– Flash Storage: 8x 2TB SSD PCIe NVMe (Intel DC P3600), 16TB per server
– Network: 100GbE adapter (Mellanox ConnectX-4 EN)
– IO Bandwidth per server: 16GB/s, Network bandwidth 12.5GB/s
– IO Bandwidth per cluster: 480GB/s, Network bandwidth 375GB/s
• Totals per Cluster (counting 28 data nodes)
– 1,008 cores, 2,016 threads (hyperthreading)
– 42TB memory
– 448TB raw, 224 NVMe
“F1” Spark Platform Details
9
What about the Data Set?
• Desired Data Set / Data Generator Properties
– Realistic (we want to realistically exercise ML algorithms)
– Synthetic (no issues with data privacy/ownership)
– Scalable (desire to scale data volumes up/down)
– Source Available (to make changes)
• In particular we wanted to be able to “salt” the data (if needed)
• Selected Social Network Benchmark from LDBC
10
What is the LDBC?
+ non-profit members (FORTH) & personal members
+ Task Forces, volunteers developing benchmarks
+ TUC: Technical User Community
http://ldbcouncil.org
Linked Data Benchmark Council = LDBC
• Industry entity similar to TPC (www.tpc.org)
• Focusing on graph and RDF store benchmarking
11
LDBC benchmarks consist of..
• Four main elements:
– data schema: defines the structure of the data
– workloads: defines the set of operations to perform
– performance metrics: used to measure (quantitatively) the
performance of the systems
– execution rules: defined to assure that the results from
different executions of the benchmark are valid and
comparable
• Software as Open Source (GitHub)
– data generator, query drivers, validation tools, ...
12
Social Network Benchmark
www.cwi.nl/~boncz/graphta.ppt
Focus of our machine learning
13
Initial Focus Areas
• Prediction System Scalability
– Evaluate 6 different algorithms to determine which can best
predict interest in a “topic class”
• Recommendation System Scalability
– Collaborative Filtering using the ALS (Alternating Least Squares)
algorithm but evaluating multiple parameters each with multiple
values to recommend a topic to a person
• Using Watson Machine Learning with embedded
Cognitive Automation of Data Science (CADS)
14
Prediction Algorithms
• Goal: Given information about what “topic classes” a person is
known to be interested in, how well can we predict whether the
person will be interested in another topic class
• Algorithms competing
– Logistic Regression
– Support Vector Machine (SVMWithSGD)
– Decision Tree
– Random Forest
– Gradient-Boosted Trees
– Multilayer Perceptron
• Experiment: How well can we scale the evaluation of multiple
machine learning algorithms to reduce elapsed time?
15
Data Preparation for Prediction
Person.id Tag.id
1 6
1 138
1 523
1 573
1 576
1 775
1 777
1 973
2 3
… …
Table: person_hasInterest_tag
(generated, 98.4GB, 5.47 billion rows)
Tag.id TagClass.id
0 349
1 211
2 98
3 13
4 13
5 13
6 3
7 82
8 88
… …
Table: tag_hasType_tagclass
(generated, 145KB, 16080 rows)
Person.id TagClass.id
1 3
2 13
… …
Table: person_hasInterest_tagclass
(derived, 1.8 billion rows)
Person.id TagClass.id.0 TagClass.id.3 TagClass.id.13 … TagClass.id.115 … TagClass.id.355
1 0.0 1.0 0.0 … 0.0 … 0.0
2 0.0 0.0 1.0 … 0.0 … 0.0
… … … … … … … …
Replace “TagClass.id” column with
columns for each unique TagClass.id
(derived, 234 million rows, 63 columns,
30.1 GB CSV file stored on HDFS)
1
2
“distinct rows” join
on “Tag.id” column
Using the 100TB LDBC-SNB data set
16
Classification workload – Elapsed
time by cluster size
• We wanted to assess the scalability of the CADS algorithm using SparkML as we
increased the node count from 1 to 14
• Elapsed time was shortest with 4 Data Nodes (144 cores)
Spark tuning:
Executors per node: 36
Executor memory: 32GB
Executor cores: 2
Driver memory: 32GB
Driver cores: 4
17
Classification workload – Elapsed
time by cluster size
• Next, we assessed the scalability of the CADS algorithm using SparkML as we
increased the node count from 1 to 14, with a fixed number of Spark executors (144)
• With 144 Spark executors, elapsed time decreases as we add DataNodes
• Conclusion: Too many Spark executors can hurt performance
Spark tuning:
Executors: 144
Executor memory: 32GB
Executor cores: 2
Driver memory: 32GB
Driver cores: 4
18
Speeding up Algorithm Selection
• Here we compare CADS evaluating 6 classification algorithms individually (separate
Spark jobs), versus doing all 6 algorithms concurrently (single Spark job)
• Recommendation: Let CADS compare all algorithms at the same time
19
Recommendation Algorithm
• Goal: Given a matrix of people and topics and information
about which people like which topics can we recommend
other topics that they may be interested in?
• Algorithm chosen: ALS (Alternating Least Squares)
• Hyper-parameters
– Regularization Parameter (0.1, 0.01)
– Rank (5, 10)
– Alpha (1.0, 100.0)
• Experiment: How well can we parallelize and scale the
evaluation of the combinations of Hyper-parameter values of
ALS?
20
Data Preparation for ALS (1 of 2)
Id (type: Long) firstName lastName gender birthday creationDate locationIP browserUsed
933 Mahinda Perera male 1989-12-04 2010-03-17T23:32:10.447 192.248.2.123 Firefox
… … … … … … … …
35184381044707 Dặng Dinh Doan female 1981-08-22 2012-10-14T08:38:49.331 82.236.112.234 Chrome
… … … … … … … …
Id (type: Long) new_id (type : Int) firstName lastName gender birthday creationDate locationIP browserUsed
933 1 Mahinda Perera male 1989-12-04 2010-03-17T23:32:10.447 192.248.2.123 Firefox
… … … … … … … … …
35184381044707 8099641 Dặng Dinh Doan female 1981-08-22 2012-10-14T08:38:49.331 82.236.112.234 Chrome
… … … … … … … … …
Need ”Long” type to hold wide person id’s generated by LDBC-SNB data generator, but Spark MLlib
“Rating” class member “user” is of type Int. For compatibility to test the ALS algorithm with either
Spark MLlib or SparkML, we create an alternate id for each person stored as type Int.
Add alternate person id (type Int) that the ALS algorithm will use1
Table: person (generated, 747 MB, 8.1 million rows)
Using the 3TB LDBC-SNB data set
21
Data Preparation for ALS (2 of 2)
id new_id …
933 1 …
… … …
35184381044707 8099641 …
… … …
Table: person (with “new_id” added, 8.1 million rows)
Person.id Tag.id
933 6
933 138
933 523
933 573
933 576
933 775
933 777
933 787
933 973
… …
Table: person_hasInterest_tag
(generated, 3.39 GB, 189 million rows)
new_id Tag.id rating
1 6 1.0
1 138 1.0
1 523 1.0
1 573 1.0
1 576 1.0
1 775 1.0
1 777 1.0
1 787 1.0
1 973 1.0
… … …
Value in “rating” column is always 1.0
2 Join on “Person.id”,
add ”rating” column
Table: ratings (derived,189 million rows,
8.1 million distinct users, 16080 items,
3.3 GB CSV file stored on HDFS)
22
• We wanted to assess the scalability of CADS-HPO with the ALS algorithm
using SparkML as we increased the node count from 1 to 14
• Elapsed time was shortest with 14 Data Nodes
ALS hyper-parameter tuning – Elapsed
time by cluster size
Spark tuning:
Executors per node: 36
Executor memory: 32GB
Executor cores: 2
Driver memory: 32GB
Driver cores: 4
23
“Error” of competing ALS
hyper-parameter combinations
• Used error metric “root mean squared error” (RMSE), lower is better
– ALS-0, ALS-2, ALS-4, ALS-6 drop out first
– ALS-1, ALS-5 still left after Iter-4
CADS-HPO
Iteration
Percentage of
training data
Iter-0 10%
Iter-1 20%
Iter-2 40%
Iter-3 80%
Iter-4 100%
RegParam Rank Alpha
ALS-0 0.1 5 1.0
ALS-1 0.1 5 100.0
ALS-2 0.1 10 1.0
ALS-3 0.1 10 100.0
ALS-4 0.01 5 1.0
ALS-5 0.01 5 100.0
ALS-6 0.01 10 1.0
ALS-7 0.01 10 100.0 24
Future Work / Next Steps
• Retest as additional classification algorithms are added to
SparkML (e.g. Linear Support Vector Classifier)
• Develop additional SparkML scenarios using LDBC-SNB
• Continue exploring how to best tune SparkML algorithms
• Try Spark’s Dynamic Resource Allocation
• Evaluate automated feature selection
• Assess model evolution
• Optimize scoring performance
• Track Spark evolution
SPARK-13857 (Feature parity for ALS ML with MLLIB)
SPARK-14489 (RegressionEvaluator returns NaN for
ALS in Spark ml)
SPARK-19071 (Optimizations for ML Pipeline Tuning)
25
Summary & Conclusion
• Machine Learning Algorithm selection and tuning is
difficult and resource intensive
• Watson Machine Learning can make it easier
• A large computational cluster can accelerate high
quality model building
• We can leverage synthetic data generation
• A Spark-Optimized cluster can assist greatly
• There is LOTS more work to be done
26
Try Watson Machine Learning
http://datascience.ibm.com/features#machinelearning 27
Thank You.
Berni Schiefer
schiefer@ca.ibm.com
Training data volume and ALS
algorithm accuracy
• Asses impact of training data volume on ALS recommendation accuracy
• Accuracy improves as we increase the training data set size
Data Split:
60% training
20% validation
20% test
29

More Related Content

What's hot

Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 

What's hot (20)

Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Viewers also liked

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 

Viewers also liked (20)

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas PatilCustom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 

Similar to Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summit East talk by Berni Schiefer

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
Teresa Giacomini
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
hdhappy001
 
Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04
marc_harrison
 

Similar to Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summit East talk by Berni Schiefer (20)

Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04
 
ifip2008albashiri.pdf
ifip2008albashiri.pdfifip2008albashiri.pdf
ifip2008albashiri.pdf
 
Artificial Intelligence for Data Quality
Artificial Intelligence for Data QualityArtificial Intelligence for Data Quality
Artificial Intelligence for Data Quality
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summit East talk by Berni Schiefer

  • 1. PRACTICAL LARGE SCALE EXPERIENCES WITH SPARK 2.1 MACHINE LEARNING Berni Schiefer IBM, Spark Technology Center
  • 2. Agenda • How IBM is leveraging SparkML • Our experimental environment – hardware and benchmark/workload • Focus areas for scalability exploration • Initial results • Future work 2
  • 3. Built-in learning to get started or go the distance with advanced tutorials Learn The best of open source and IBM value-add to create state-of-the-art data products Create Community and social features that provide meaningful collaboration Collaborate Sign up today! - http://datascience.ibm.com IBM Data Science Experience 3
  • 4. Pipeline Model The Machine Learning Workflow retraining History Data Pipeline Model Feedback Data scoring monitoring Operational Data deploying redeploying Predictions Data Scientist ML Pipeline Data visualization Feature engineering Model training Model Evaluation Developer/ stackholder IBM Watson Machine Learning 4
  • 5. Key Model Training Questions: • Which machine learning algorithm should I use? • For a chosen machine learning algorithm what hyper-parameters should be tuned? Explosive search space! 5
  • 6. © 2015 IBM Corporation Data Scientist Workflow With CADS 6 Science of Analytics Repository Deployed Analytic UserInterface(orREST-APIdirectly) 1 4 5 Deployed Analytic 62 LearningController Tactical Planner Orchestrator and Scheduler Learning Controller Analytic Monitoring and Adaptation Analytic Platforms Knowledge Acquisition External Knowledge about Analytics 3 AI technology automatically determines best analytics pipeline Data ScientistcaninteractwithSystem Cross-PlatformDeploymentandEvaluation Input: Supervisedprediction problem Submit to CADS …
  • 7. © 2015 IBM Corporation Model Selection via “Data Allocation using Upper Bounds (DAUB) “ 7 Logistic Regression A3 SVMRandom Forest … 500 500# Additional Data points ------------------ Built Model ------------------ Prediction Accuracy versus #Data Points TrainingData Ranking based on upper bound estimate on performance of each pipeline (‘slope’ of learning curve) https://arxiv.org/pdf /1601.00024.pdf 7
  • 8. Our “F1” Spark SQL Cluster • A 28-node cluster – 2 management nodes (co-exist with 2 data nodes) – 28 data nodes • Lenovo x3650 M5 – 2 sockets (18cores/socket) – 1.5TB RAM – 8x2TB SSD – 2 racks, • 20x 2U servers per rack (42U racks) – 1 switch, 100GbE, 32 ports, 1U • Mellanox SN2700 8
  • 9. • Each data node – CPU: 2x E5-2697 v4 @ 2.3GHz (Broadwell) (18c) – Memory: 1.5TB per server 24x 64GB DDR4 2400MHz – Flash Storage: 8x 2TB SSD PCIe NVMe (Intel DC P3600), 16TB per server – Network: 100GbE adapter (Mellanox ConnectX-4 EN) – IO Bandwidth per server: 16GB/s, Network bandwidth 12.5GB/s – IO Bandwidth per cluster: 480GB/s, Network bandwidth 375GB/s • Totals per Cluster (counting 28 data nodes) – 1,008 cores, 2,016 threads (hyperthreading) – 42TB memory – 448TB raw, 224 NVMe “F1” Spark Platform Details 9
  • 10. What about the Data Set? • Desired Data Set / Data Generator Properties – Realistic (we want to realistically exercise ML algorithms) – Synthetic (no issues with data privacy/ownership) – Scalable (desire to scale data volumes up/down) – Source Available (to make changes) • In particular we wanted to be able to “salt” the data (if needed) • Selected Social Network Benchmark from LDBC 10
  • 11. What is the LDBC? + non-profit members (FORTH) & personal members + Task Forces, volunteers developing benchmarks + TUC: Technical User Community http://ldbcouncil.org Linked Data Benchmark Council = LDBC • Industry entity similar to TPC (www.tpc.org) • Focusing on graph and RDF store benchmarking 11
  • 12. LDBC benchmarks consist of.. • Four main elements: – data schema: defines the structure of the data – workloads: defines the set of operations to perform – performance metrics: used to measure (quantitatively) the performance of the systems – execution rules: defined to assure that the results from different executions of the benchmark are valid and comparable • Software as Open Source (GitHub) – data generator, query drivers, validation tools, ... 12
  • 14. Initial Focus Areas • Prediction System Scalability – Evaluate 6 different algorithms to determine which can best predict interest in a “topic class” • Recommendation System Scalability – Collaborative Filtering using the ALS (Alternating Least Squares) algorithm but evaluating multiple parameters each with multiple values to recommend a topic to a person • Using Watson Machine Learning with embedded Cognitive Automation of Data Science (CADS) 14
  • 15. Prediction Algorithms • Goal: Given information about what “topic classes” a person is known to be interested in, how well can we predict whether the person will be interested in another topic class • Algorithms competing – Logistic Regression – Support Vector Machine (SVMWithSGD) – Decision Tree – Random Forest – Gradient-Boosted Trees – Multilayer Perceptron • Experiment: How well can we scale the evaluation of multiple machine learning algorithms to reduce elapsed time? 15
  • 16. Data Preparation for Prediction Person.id Tag.id 1 6 1 138 1 523 1 573 1 576 1 775 1 777 1 973 2 3 … … Table: person_hasInterest_tag (generated, 98.4GB, 5.47 billion rows) Tag.id TagClass.id 0 349 1 211 2 98 3 13 4 13 5 13 6 3 7 82 8 88 … … Table: tag_hasType_tagclass (generated, 145KB, 16080 rows) Person.id TagClass.id 1 3 2 13 … … Table: person_hasInterest_tagclass (derived, 1.8 billion rows) Person.id TagClass.id.0 TagClass.id.3 TagClass.id.13 … TagClass.id.115 … TagClass.id.355 1 0.0 1.0 0.0 … 0.0 … 0.0 2 0.0 0.0 1.0 … 0.0 … 0.0 … … … … … … … … Replace “TagClass.id” column with columns for each unique TagClass.id (derived, 234 million rows, 63 columns, 30.1 GB CSV file stored on HDFS) 1 2 “distinct rows” join on “Tag.id” column Using the 100TB LDBC-SNB data set 16
  • 17. Classification workload – Elapsed time by cluster size • We wanted to assess the scalability of the CADS algorithm using SparkML as we increased the node count from 1 to 14 • Elapsed time was shortest with 4 Data Nodes (144 cores) Spark tuning: Executors per node: 36 Executor memory: 32GB Executor cores: 2 Driver memory: 32GB Driver cores: 4 17
  • 18. Classification workload – Elapsed time by cluster size • Next, we assessed the scalability of the CADS algorithm using SparkML as we increased the node count from 1 to 14, with a fixed number of Spark executors (144) • With 144 Spark executors, elapsed time decreases as we add DataNodes • Conclusion: Too many Spark executors can hurt performance Spark tuning: Executors: 144 Executor memory: 32GB Executor cores: 2 Driver memory: 32GB Driver cores: 4 18
  • 19. Speeding up Algorithm Selection • Here we compare CADS evaluating 6 classification algorithms individually (separate Spark jobs), versus doing all 6 algorithms concurrently (single Spark job) • Recommendation: Let CADS compare all algorithms at the same time 19
  • 20. Recommendation Algorithm • Goal: Given a matrix of people and topics and information about which people like which topics can we recommend other topics that they may be interested in? • Algorithm chosen: ALS (Alternating Least Squares) • Hyper-parameters – Regularization Parameter (0.1, 0.01) – Rank (5, 10) – Alpha (1.0, 100.0) • Experiment: How well can we parallelize and scale the evaluation of the combinations of Hyper-parameter values of ALS? 20
  • 21. Data Preparation for ALS (1 of 2) Id (type: Long) firstName lastName gender birthday creationDate locationIP browserUsed 933 Mahinda Perera male 1989-12-04 2010-03-17T23:32:10.447 192.248.2.123 Firefox … … … … … … … … 35184381044707 Dặng Dinh Doan female 1981-08-22 2012-10-14T08:38:49.331 82.236.112.234 Chrome … … … … … … … … Id (type: Long) new_id (type : Int) firstName lastName gender birthday creationDate locationIP browserUsed 933 1 Mahinda Perera male 1989-12-04 2010-03-17T23:32:10.447 192.248.2.123 Firefox … … … … … … … … … 35184381044707 8099641 Dặng Dinh Doan female 1981-08-22 2012-10-14T08:38:49.331 82.236.112.234 Chrome … … … … … … … … … Need ”Long” type to hold wide person id’s generated by LDBC-SNB data generator, but Spark MLlib “Rating” class member “user” is of type Int. For compatibility to test the ALS algorithm with either Spark MLlib or SparkML, we create an alternate id for each person stored as type Int. Add alternate person id (type Int) that the ALS algorithm will use1 Table: person (generated, 747 MB, 8.1 million rows) Using the 3TB LDBC-SNB data set 21
  • 22. Data Preparation for ALS (2 of 2) id new_id … 933 1 … … … … 35184381044707 8099641 … … … … Table: person (with “new_id” added, 8.1 million rows) Person.id Tag.id 933 6 933 138 933 523 933 573 933 576 933 775 933 777 933 787 933 973 … … Table: person_hasInterest_tag (generated, 3.39 GB, 189 million rows) new_id Tag.id rating 1 6 1.0 1 138 1.0 1 523 1.0 1 573 1.0 1 576 1.0 1 775 1.0 1 777 1.0 1 787 1.0 1 973 1.0 … … … Value in “rating” column is always 1.0 2 Join on “Person.id”, add ”rating” column Table: ratings (derived,189 million rows, 8.1 million distinct users, 16080 items, 3.3 GB CSV file stored on HDFS) 22
  • 23. • We wanted to assess the scalability of CADS-HPO with the ALS algorithm using SparkML as we increased the node count from 1 to 14 • Elapsed time was shortest with 14 Data Nodes ALS hyper-parameter tuning – Elapsed time by cluster size Spark tuning: Executors per node: 36 Executor memory: 32GB Executor cores: 2 Driver memory: 32GB Driver cores: 4 23
  • 24. “Error” of competing ALS hyper-parameter combinations • Used error metric “root mean squared error” (RMSE), lower is better – ALS-0, ALS-2, ALS-4, ALS-6 drop out first – ALS-1, ALS-5 still left after Iter-4 CADS-HPO Iteration Percentage of training data Iter-0 10% Iter-1 20% Iter-2 40% Iter-3 80% Iter-4 100% RegParam Rank Alpha ALS-0 0.1 5 1.0 ALS-1 0.1 5 100.0 ALS-2 0.1 10 1.0 ALS-3 0.1 10 100.0 ALS-4 0.01 5 1.0 ALS-5 0.01 5 100.0 ALS-6 0.01 10 1.0 ALS-7 0.01 10 100.0 24
  • 25. Future Work / Next Steps • Retest as additional classification algorithms are added to SparkML (e.g. Linear Support Vector Classifier) • Develop additional SparkML scenarios using LDBC-SNB • Continue exploring how to best tune SparkML algorithms • Try Spark’s Dynamic Resource Allocation • Evaluate automated feature selection • Assess model evolution • Optimize scoring performance • Track Spark evolution SPARK-13857 (Feature parity for ALS ML with MLLIB) SPARK-14489 (RegressionEvaluator returns NaN for ALS in Spark ml) SPARK-19071 (Optimizations for ML Pipeline Tuning) 25
  • 26. Summary & Conclusion • Machine Learning Algorithm selection and tuning is difficult and resource intensive • Watson Machine Learning can make it easier • A large computational cluster can accelerate high quality model building • We can leverage synthetic data generation • A Spark-Optimized cluster can assist greatly • There is LOTS more work to be done 26
  • 27. Try Watson Machine Learning http://datascience.ibm.com/features#machinelearning 27
  • 29. Training data volume and ALS algorithm accuracy • Asses impact of training data volume on ALS recommendation accuracy • Accuracy improves as we increase the training data set size Data Split: 60% training 20% validation 20% test 29