SlideShare une entreprise Scribd logo
1  sur  33
Machine Learning (ML) and
TACC Supercomputers
A little about me
• Data Scientist at Texas Advanced Computing Center
(TACC)
• My Contact: atrivedi@tacc.utexas.edu
• TACC - Independent research center at UT Austin
• TACC - One of the largest HIPAA compliant
supercomputer center
• ~250 faculty, researchers, students and staff
• We work on providing support to large scale computing
problems
1
Some Basic Observations
 There are fundamental differences in data access
patterns between Data Intensive Computing and High
Performance Computing (HPC)
 Today, most of the ML Researchers want/need to
work with Big Data, Vectorization, Code Optimization
etc.
2
Data Intensive Computing
 Specialized in dealing effectively with vast quantities of
data in distributed environments
Generates high demand for computational resources,
e.g. storing capacity, processing power etc.
3
 Big data plays the key role in the popularity
and growth of Data intensive computing
 Increased the volume of data
 Improves accuracy of existing algorithms
 Helps create better predictive models
 Increased the complexity
Data Intensive Computing & Big Data
4
What’s the challenge with the big data
analysis?
5
 Big Data Analysis requires even more computational resources
 Storage is triple the standard data size
 Algorithms use large data points and is memory intensive
 The Big Data Analysis takes much longer time
 Typical hard drive read-speed is about 150MB/sec
 But for reading 1TB ~ 2 hours
 Analysis could require processing-time proportional to the size of the
data
 Data Analysis at the rate of 1GB /second would require 11 days to
finish for 1TB data
6
High Performance Computing (HPC)
Hardware with more computational power per compute
node
Computation can be done with multiple nodes
Provides highly efficient numeric processing in
distributed environments
HPC has seen a recent growth in shared memory
architectures
7
Sample TACC Computing Cluster
8
Combine HPC & Data intensive
computing
The intersection of these two domains is mainly driven
by the use of machine learning (ML)
ML methodologies help extract knowledge from big data
These hybrid environments –
 take advantage of data locality
 keep the data exchanges over the network at a
manageable level
 offer high performance through distributed libraries
9
 Stampede – Traditional cluster HPC system
 Stockyard and Corral – 25 Petabytes of combined disk
storage for all data needs
 Ranch – 160 Petabytes of tape archive storage
 Maverick/Rustler/Rodeo – “Niche” systems with GPU
clusters, great for data anatytics and visualization
 Wrangler - A New Generation of Data-intensive
Supercomputer
TACC Ecosystem
10
TACC Ecosystem Goals
 Goal to address the data problem in multiple dimensions
 Supports data in large and small scales
 Supports data reliability
 Supports data security
 Supports multiple data types: structured and unstructured
 Supports sequential access
 Fast for large files
 Goal to support a wide range of applications and interfaces
 Hadoop (and Mahout) & Spark (and MLlib)
 Traditional R, GIS, DBs, and other HPC style performing
workflows
 Goal to support the full data lifecycle
 Metadata and collection management support
11
 Need to analyze large datasets quickly
 Need a more on-demand interactive analysis environment
 Need to work with databases at high transaction rates
 Have a Hadoop or Spark workflow with need for large HDFS
datastore
 Have a dataset that many users will compute with or
analyze
 In need of a system with data management capabilities
 Have a job that is currently IO bound
Why use TACC Supercomputers?
12
TACC Success Stories
13
14
15
Available ML tools/libraries in TACC
Supercomputers
Scikit-learn
Caffe
Theano
CUDA/cuDNN
Hadoop
PyHadoop
RHadoop
Mahout
Spark
PySpark
SparkR
MLlib
16
Two Sample ML workflows in TACC
Supercomputers
GPU Powered Deep Learning on MRI images with NVIDIA
DIGITS in Maverick Supercomputer
Pubmed Recommender System in Wrangler
Supercomputer
17
Deep Learning on Images
 Deep Neural Networks are computationally quite
demanding
 The input data is much larger if we use even a small
image resolution
 256 x 256 RGB-pixel implies 196,608 input neurons
(256 x 256 x 3)
 Many of the involved floating point matrix operations
can be addressed by GPUs
18
Deep Learning on MRI using
TACC Supercomputers
 Maverick has large GPU Clusters
 There are three major GPU utilizing Deep Learning frameworks
available – Theano, Torch and caffe
 We use NVIDIA DIGITS (based on caffe), which is a web server providing
a convenient web interface for training and testing Deep Neural Networks
 For classification of MRI/images we use a convolutional DNN to figure out
the features
 We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our
MRI/images
In the course of 30 epochs, our classification accuracy ranges from
74.21% to 82.09%
19
Pubmed Recommender System in
Wrangler
20
What is a Recommendation System?
 Recommender System helps match users with item
 Implicit or explicit user feedback or item suggestion
 Our Recommendation system:
 We try to build a model which recommends Pubmed
documents to users, based on the user search profile
21
Types of Recommender System
Types Pros Cons
Knowledge‐based
(i.e, search)
Deterministic
recommendations,
assured quality,
no cold‐ start
Knowledge engineering effort to
bootstrap,
basically static
Content‐based No community required,
comparison between items
possible
Content descriptions necessary,
cold start for new users
Collaborative No knowledge‐
engineering effort,
serendipity of results
Requires some form of rating
feedback,
cold start for new users and new
items
22
Using Vector Space Model (VSM) for
Pubmed
 Given:
 A set of Pubmed documents
 N features (unique terms) describing the documents in the set
 VSM builds an N-dimensional Vector Space
 Each item/document is represented as a point in the Vector Space
 Information Retrieval based on search
 Query: A point in the Vector Space
 We apply TFIDF to the tokenized documents to weight the documents
and convert the documents to vectors
 We compute cosine similarity between the tokenized documents and
the query term
 We select top 3 documents matching our query
 We weight the query term in the sparse matrix and rank documents
2323
MPI or Hadoop or Spark?
Which is really more suitable for this
ML problem in a HPC system ?
24
Message Passing in HPC
Message Passing Interface (MPI) was one of the key factors
which supported the initial growth of cluster computing
MPI helped shape what the HPC world has become today
MPI supported a substantial majority of all supercomputing
work
 Scientists and engineers have relied upon MPI for the past
decades
 MPI works great for data intensive computing in a GPU
cluster
25
Why MPI is not the best tool for ML
A researcher/developer working with MPI needs to
manually decompose the common data structures
across processors
 Every update of the data structure needs to be recast into a
flurry of messages, syncs, and data exchange
Programming at the transport layer is an awkward fit for
numerical application developers
This led to the advent of other techniques
26
 Hadoop is an open source implementation of MapReduce
programming model in JAVA
 It has interface to other programming languages such as
R, python etc.
 Hadoop includes -
 HDFS: A distributed file system based on google file
system (GFS)
 YARN: A resource manager to assign resources to
the computational tasks
 MapReduce: A library to enable efficient distributed
data processing easily
 Mahout: Scalable machine learning and data mining library
 Hadoop streaming: It is a generic API which allows writing
Mappers and Reducers in any language.
 Hadoop is a good fit for large single-pass data processing,
but has its own limitations
Choosing Hadoop over MPI
27
Limitations of Hadoop in HPC
Hadoop comes with mandatory Map Reduce logging of output to
the disk after every Map/Reduce stage
 In HPC, logging output to disk could be sped up with caching or
SSDs
In general, this fact rendered Hadoop unusable for many ML
approaches which required iteration, or interactive use
The real issue with Hadoop was its HDFS file system.
 The HDFS file system was intimately tied to Hadoop cluster
scheduling
The large-scale ML community sought in-memory approaches to
avoid this problem
28
Spark
 For large-scale technical computing, one very promising
in-memory approach is Spark
 Spark lacks Map/Reduce-style requirements
 Spark can run standalone, without a scheduler like YARN
 It has interfaces to other programming languages such
as R, python etc.
 Spark supports HDFS through YARN
 MLlib: Scalable machine learning and data mining library
 Spark streaming: Enables stream processing of live data
streams
29
Our Recommendation Model
 We apply collaborative filtering on the weighted/ranked documents
 We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for
recommending Pubmed documents
 MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)
 We use collaborative filtering in Scikit-learn & Hadoop as baselines
 We use the python-recsys library along with Python Scikit-learn
 svd.recommend(int product_id)
 We use the mahout’s Alternating Least Square for Hadoop
 Comparative study of our model shows improved performance in Spark
3030
Performance Evaluation of Pubmed
Recommendation Model
We evaluate our recommendation model using Python Scikit-learn,
Apache Mahout and PySpark MLlib in Wrangler
Recommendation model use Root Mean Square Error (RMSE) and
Mean Absolute Error (MAE) for evaluation
Lower the errors, more accurate the model
Lower the time taken to train/test the model, better the
performance
Algo: Type Public Dataset
Python ML
library
Eval Test Model Training Time Model Test Time
Recommendation
Weighted Pubmed
Documents Python Scikit
RMSE=17.96%
MAE=16.53% 42 secs 19 secs
Recommendation
Weighted Pubmed
Documents Hadoop Mahout
RMSE=16.02%
MAE=14.98% 38 secs 14 secs
Recommendation
Weighted Pubmed
Documents PySpark MLlib
RMSE=15.88%
MAE=14.23% 34 secs 11 secs
31
THANK YOU !
Questions?
32

Contenu connexe

Tendances

Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 

Tendances (20)

Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Hadoop
HadoopHadoop
Hadoop
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
paper
paperpaper
paper
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 

Similaire à Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 

Similaire à Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15 (20)

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 

Plus de MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Plus de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

  • 1. Machine Learning (ML) and TACC Supercomputers
  • 2. A little about me • Data Scientist at Texas Advanced Computing Center (TACC) • My Contact: atrivedi@tacc.utexas.edu • TACC - Independent research center at UT Austin • TACC - One of the largest HIPAA compliant supercomputer center • ~250 faculty, researchers, students and staff • We work on providing support to large scale computing problems 1
  • 3. Some Basic Observations  There are fundamental differences in data access patterns between Data Intensive Computing and High Performance Computing (HPC)  Today, most of the ML Researchers want/need to work with Big Data, Vectorization, Code Optimization etc. 2
  • 4. Data Intensive Computing  Specialized in dealing effectively with vast quantities of data in distributed environments Generates high demand for computational resources, e.g. storing capacity, processing power etc. 3
  • 5.  Big data plays the key role in the popularity and growth of Data intensive computing  Increased the volume of data  Improves accuracy of existing algorithms  Helps create better predictive models  Increased the complexity Data Intensive Computing & Big Data 4
  • 6. What’s the challenge with the big data analysis? 5
  • 7.  Big Data Analysis requires even more computational resources  Storage is triple the standard data size  Algorithms use large data points and is memory intensive  The Big Data Analysis takes much longer time  Typical hard drive read-speed is about 150MB/sec  But for reading 1TB ~ 2 hours  Analysis could require processing-time proportional to the size of the data  Data Analysis at the rate of 1GB /second would require 11 days to finish for 1TB data 6
  • 8. High Performance Computing (HPC) Hardware with more computational power per compute node Computation can be done with multiple nodes Provides highly efficient numeric processing in distributed environments HPC has seen a recent growth in shared memory architectures 7
  • 10. Combine HPC & Data intensive computing The intersection of these two domains is mainly driven by the use of machine learning (ML) ML methodologies help extract knowledge from big data These hybrid environments –  take advantage of data locality  keep the data exchanges over the network at a manageable level  offer high performance through distributed libraries 9
  • 11.  Stampede – Traditional cluster HPC system  Stockyard and Corral – 25 Petabytes of combined disk storage for all data needs  Ranch – 160 Petabytes of tape archive storage  Maverick/Rustler/Rodeo – “Niche” systems with GPU clusters, great for data anatytics and visualization  Wrangler - A New Generation of Data-intensive Supercomputer TACC Ecosystem 10
  • 12. TACC Ecosystem Goals  Goal to address the data problem in multiple dimensions  Supports data in large and small scales  Supports data reliability  Supports data security  Supports multiple data types: structured and unstructured  Supports sequential access  Fast for large files  Goal to support a wide range of applications and interfaces  Hadoop (and Mahout) & Spark (and MLlib)  Traditional R, GIS, DBs, and other HPC style performing workflows  Goal to support the full data lifecycle  Metadata and collection management support 11
  • 13.  Need to analyze large datasets quickly  Need a more on-demand interactive analysis environment  Need to work with databases at high transaction rates  Have a Hadoop or Spark workflow with need for large HDFS datastore  Have a dataset that many users will compute with or analyze  In need of a system with data management capabilities  Have a job that is currently IO bound Why use TACC Supercomputers? 12
  • 15. 14
  • 16. 15
  • 17. Available ML tools/libraries in TACC Supercomputers Scikit-learn Caffe Theano CUDA/cuDNN Hadoop PyHadoop RHadoop Mahout Spark PySpark SparkR MLlib 16
  • 18. Two Sample ML workflows in TACC Supercomputers GPU Powered Deep Learning on MRI images with NVIDIA DIGITS in Maverick Supercomputer Pubmed Recommender System in Wrangler Supercomputer 17
  • 19. Deep Learning on Images  Deep Neural Networks are computationally quite demanding  The input data is much larger if we use even a small image resolution  256 x 256 RGB-pixel implies 196,608 input neurons (256 x 256 x 3)  Many of the involved floating point matrix operations can be addressed by GPUs 18
  • 20. Deep Learning on MRI using TACC Supercomputers  Maverick has large GPU Clusters  There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe  We use NVIDIA DIGITS (based on caffe), which is a web server providing a convenient web interface for training and testing Deep Neural Networks  For classification of MRI/images we use a convolutional DNN to figure out the features  We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our MRI/images In the course of 30 epochs, our classification accuracy ranges from 74.21% to 82.09% 19
  • 21. Pubmed Recommender System in Wrangler 20
  • 22. What is a Recommendation System?  Recommender System helps match users with item  Implicit or explicit user feedback or item suggestion  Our Recommendation system:  We try to build a model which recommends Pubmed documents to users, based on the user search profile 21
  • 23. Types of Recommender System Types Pros Cons Knowledge‐based (i.e, search) Deterministic recommendations, assured quality, no cold‐ start Knowledge engineering effort to bootstrap, basically static Content‐based No community required, comparison between items possible Content descriptions necessary, cold start for new users Collaborative No knowledge‐ engineering effort, serendipity of results Requires some form of rating feedback, cold start for new users and new items 22
  • 24. Using Vector Space Model (VSM) for Pubmed  Given:  A set of Pubmed documents  N features (unique terms) describing the documents in the set  VSM builds an N-dimensional Vector Space  Each item/document is represented as a point in the Vector Space  Information Retrieval based on search  Query: A point in the Vector Space  We apply TFIDF to the tokenized documents to weight the documents and convert the documents to vectors  We compute cosine similarity between the tokenized documents and the query term  We select top 3 documents matching our query  We weight the query term in the sparse matrix and rank documents 2323
  • 25. MPI or Hadoop or Spark? Which is really more suitable for this ML problem in a HPC system ? 24
  • 26. Message Passing in HPC Message Passing Interface (MPI) was one of the key factors which supported the initial growth of cluster computing MPI helped shape what the HPC world has become today MPI supported a substantial majority of all supercomputing work  Scientists and engineers have relied upon MPI for the past decades  MPI works great for data intensive computing in a GPU cluster 25
  • 27. Why MPI is not the best tool for ML A researcher/developer working with MPI needs to manually decompose the common data structures across processors  Every update of the data structure needs to be recast into a flurry of messages, syncs, and data exchange Programming at the transport layer is an awkward fit for numerical application developers This led to the advent of other techniques 26
  • 28.  Hadoop is an open source implementation of MapReduce programming model in JAVA  It has interface to other programming languages such as R, python etc.  Hadoop includes -  HDFS: A distributed file system based on google file system (GFS)  YARN: A resource manager to assign resources to the computational tasks  MapReduce: A library to enable efficient distributed data processing easily  Mahout: Scalable machine learning and data mining library  Hadoop streaming: It is a generic API which allows writing Mappers and Reducers in any language.  Hadoop is a good fit for large single-pass data processing, but has its own limitations Choosing Hadoop over MPI 27
  • 29. Limitations of Hadoop in HPC Hadoop comes with mandatory Map Reduce logging of output to the disk after every Map/Reduce stage  In HPC, logging output to disk could be sped up with caching or SSDs In general, this fact rendered Hadoop unusable for many ML approaches which required iteration, or interactive use The real issue with Hadoop was its HDFS file system.  The HDFS file system was intimately tied to Hadoop cluster scheduling The large-scale ML community sought in-memory approaches to avoid this problem 28
  • 30. Spark  For large-scale technical computing, one very promising in-memory approach is Spark  Spark lacks Map/Reduce-style requirements  Spark can run standalone, without a scheduler like YARN  It has interfaces to other programming languages such as R, python etc.  Spark supports HDFS through YARN  MLlib: Scalable machine learning and data mining library  Spark streaming: Enables stream processing of live data streams 29
  • 31. Our Recommendation Model  We apply collaborative filtering on the weighted/ranked documents  We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for recommending Pubmed documents  MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)  We use collaborative filtering in Scikit-learn & Hadoop as baselines  We use the python-recsys library along with Python Scikit-learn  svd.recommend(int product_id)  We use the mahout’s Alternating Least Square for Hadoop  Comparative study of our model shows improved performance in Spark 3030
  • 32. Performance Evaluation of Pubmed Recommendation Model We evaluate our recommendation model using Python Scikit-learn, Apache Mahout and PySpark MLlib in Wrangler Recommendation model use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) for evaluation Lower the errors, more accurate the model Lower the time taken to train/test the model, better the performance Algo: Type Public Dataset Python ML library Eval Test Model Training Time Model Test Time Recommendation Weighted Pubmed Documents Python Scikit RMSE=17.96% MAE=16.53% 42 secs 19 secs Recommendation Weighted Pubmed Documents Hadoop Mahout RMSE=16.02% MAE=14.98% 38 secs 14 secs Recommendation Weighted Pubmed Documents PySpark MLlib RMSE=15.88% MAE=14.23% 34 secs 11 secs 31