SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
ParallelAccelerator.jl
High Performance Scripting in Julia
Ehsan Totoni
ehsan.totoni@intel.com
Programming Systems Lab, Intel Labs
December 17, 2015
Contributors: Todd Anderson, Raj Barik, Chunling Hu, Lindsey Kuper, Victor Lee, Hai Liu,
Geoff Lowney, Paul Petersen, Hongbo Rong, Tatiana Shpeisman, Youfeng Wu
1
§  Motivation
§  High Performance Scripting(HPS) Project at Intel Labs
§  ParallelAccelerator.jl
§  How It Works
§  Evaluation Results
§  Current Limitations
§  Get Involved
§  Future Steps
§  Deep Learning
§  Distributed-Memory HPC Cluster/Cloud
2
Outline
HPC is Everywhere
3
Molecular
Biology
Aerospace
Cosmology
PhysicsChemistry
Weather
modeling
TRADITIONAL
HPC
Medical	
  
Visualization
Financial
Analytics
Visual	
  
Effects
Image
analysis
Perception	
  
&Tracking
Oil	
  	
  &	
  Gas	
  
Exploration
Scientific	
  &	
  Technical	
  Computing
+Many-­‐core	
  workstations,	
  small	
  clusters,	
  and	
  clouds
Design	
  &
Engineering
Predictive
Analytics
Drug
Discovery
Large parallel clusters
HPC Programming is an Expert Skill
§  Most college graduates know
Python or MATLAB®
§  HPC programming requires C
or FORTRAN with OpenMP,
MPI
§  “Prototype in MATLAB®, re-
write in C” workflow limits
HPC growth
Source: Survey by ACM, July 7, 2014
Most popular introductory teaching
languages at top-ranked U.S. universities
“As the performance of HPC machines approaches infinity, the number of people who
program them is approaching zero” - Dan Reed
from The National Strategic Computing Initiative presentation
4
5
High Performance Scripting
High Functional Tool Users
(e.g., Julia, MATLAB®, Python, R)
Ninja
Programmers
Increasing Performance
Increasing Technical Skills
Target
Programmer
Base for HPS
Average HPC
Programmers
Productivity
+
Performance
+
Scalability
Why Julia?
§  Modern LLVM-based code
§  Easy compiler construction
§  Extendable (DSLs etc.)
§  Designed for performance
§  MIT license
§  Vibrant and growing user
community
§  Easy to port from MATLAB® or
Python
Source: http://pkg.julialang.org/pulse.html
6
•  Implemented as a package:
•  @acc macro to optimize Julia functions
•  Domain-specific Julia-to-C++ compiler written in
Julia
•  Parallel for loops translated to C++ with OpenMP
•  SIMD vectorization flags
•  Please try it out and report bugs!
7
ParallelAccelerator.jl
https://github.com/IntelLabs/ParallelAccelerator.jl
A compiler framework on top of the Julia compiler for high-
performance technical computing
Approach:
§  Identify implicit parallel patterns such as map, reduce,
comprehension, and stencil
§  Translate to data-parallel operations
§  Minimize runtime overheads
§  Eliminate array bounds checks
§  Aggressively fuse data-parallel operations
8
ParallelAccelerator.jl
9
ParallelAccelerator.jl Installation
•  Julia 0.4
•  Linux, Mac OS X
•  Compilers: icc, gcc, clang
•  Install, switch to master branch for up-to-date bug fixes
•  See examples/ folder
Pkg.add("ParallelAccelerator")	
  
Pkg.checkout("ParallelAccelerator")	
  
Pkg.checkout("CompilerTools")	
  
Pkg.build("ParallelAccelerator")
10
ParallelAccelerator.jl Usage
•  Use high-level array operations (MATLAB®-style)
•  Unary functions: -, +, acos, cbrt, cos, cosh, exp10, exp2, exp, lgamma, log10, log, sin, sinh,
sqrt, tan, tanh, abs, copy, erf …
•  Binary functions: -, +, .+, .-, .*, ./, .,.>, .<,.==, .<<, .>>, .^, div, mod, &, |, min, max …
•  Reductions, comprehensions, stencils
•  minimum, maximum, sum, prod, any, all 
•  A = [ f(i) for i in 1:n]
•  runStencil(dst, src, N, :oob_skip) do b, a

b[0,0] = (a[0,-1] + a[0,1] + a[-1,0] + a[1,0]) / 4

return a, b

end
•  Avoid sequential for-loops
•  Hard to analyze by ParallelAccelerator
using ParallelAccelerator
@acc function blackscholes(sptprice::Array{Float64,1},
strike::Array{Float64,1},
rate::Array{Float64,1},
volatility::Array{Float64,1},
time::Array{Float64,1})
logterm = log10(sptprice ./ strike)
powterm = .5 .* volatility .* volatility
den = volatility .* sqrt(time)
d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den
d2 = d1 .- den
NofXd1 = cndf2(d1)
...
put = call .- futureValue .+ sptprice
end
put = blackscholes(sptprice, initStrike, rate, volatility, time)
11
Example (1): Black-Scholes
Accelerate this
function
Implicit parallelism
exploited
using ParallelAccelerator
@acc function blur(img::Array{Float32,2}, iterations::Int)
buf = Array(Float32, size(img)...)
runStencil(buf, img, iterations, :oob_skip) do b, a
b[0,0] =
(a[-2,-2] * 0.003 + a[-1,-2] * 0.0133 + a[0,-2] * ...
a[-2,-1] * 0.0133 + a[-1,-1] * 0.0596 + a[0,-1] * ...
a[-2, 0] * 0.0219 + a[-1, 0] * 0.0983 + a[0, 0] * ...
a[-2, 1] * 0.0133 + a[-1, 1] * 0.0596 + a[0, 1] * ...
a[-2, 2] * 0.003 + a[-1, 2] * 0.0133 + a[0, 2] * ...
return a, b
end
return img
end
img = blur(img, iterations)

12
Example (2): Gaussian blur
runStencil
construct
13
A quick preview of results
Data from 10/21/2015
Evaluation Platform:
Intel(R) Xeon(R) E5-2690 v2
20 cores
ParallelAccelerator is ~32x faster than MATLAB®
ParallelAccelerator is ~90x faster than Julia
•  mmap & mmap! : element-wise map function
14
Parallel Patterns: mmap
(B1, B2, …) = mmap( (x1, x2, …) à (e1, e2, …), A1, A2, …)
n mm n
Examples:
log(A) ⇒ mmap (x → log(x), A)
A.*B ⇒ mmap ((x, y) → x*y, A, B)
A .+ c ⇒ mmap (x→x+c,A)
A -= B ⇒ mmap! ((x,y) → x-y, A, B)
•  reduce: reduction function

15
Parallel Patterns: reduce
r = reduce(Θ, Φ, A)
Θ is the binary reduction operator
Φ is the initial neutral value for reduction
Examples:
sum(A) ⇒ reduce (+, 0, A)
product(A) ⇒ reduce (*, 1, A)
any(A) ⇒ reduce (||, false, A)
all(A) ⇒ reduce (&&, true, A)
•  Comprehension: creates a rank-n array that is the cartesian
product of the range of variables
16
Parallel Patterns: comprehension
A = [ f(x1, x2, …, xn) for x1 in r1, x2 in r2, …, xn in rn]
where, function f is applied over cartesian product
of points (x1, x2, …, xn) in the ranges (r1, r2, …, rn)
Example:
avg(x) = [ 0.25*x[i-1]+0.5*x[i]+0.25*x[i+1] for i in 2:length(x)-1 ]
•  runStencil: user-facing language construct to perform stencil
operation

17
Parallel Patterns: stencil
runStencil((A, B, …) à f(A, B, …), A, B, …, n, s)
m mm
all arrays in function f are relatively indexed,
n is the trip count for iterative stencil
s specifies how stencil borders are handled
Example:
runStencil(b, a, N, :oob_skip) do b, a
b[0,0] =
(a[-1,-1] + a[-1,0] + a[1, 0] + a[1, 1]) / 4)
return a, b
end
•  DomainIR: replaces some of Julia AST with new “domain nodes” for
map, reduce, and stencil
•  ParallelIR: replaces some of Domain AST with new “parfor” nodes
representing parallel-for loops (parfor)
•  CGen: converts parfor nodes into OpenMP loops
18
ParallelAccelerator Compiler Pipeline
Domain 
Transformations
C++
Backend
(CGen) Array
Runtime
Executable
OpenMP
Domain
AST
Parallel
AST
Julia Parser
Julia AST
Julia Source
Parallel
Transformations
•  Map fusion
•  Reordering of statements to enable fusion
•  Remove intermediate arrays
•  mmap to mmap! Conversion
•  Hoisting of allocations out of loops
•  Other classical optimizations
•  Dead code and variable elimination
•  Loop invariant hoisting
•  Convert parfor nodes to OpenMP with SIMD code generation
19
Transformation Engine
20
ParallelAccelerator vs. Julia
24x
146x
169x
25x
63x
36x
14x
33x
0
20
40
60
80
100
120
140
160
180
SpeedupoverJulia
ParallelAccelerator enables ∼5-100× speedup over MATLAB® and
∼10-250× speedup over plain Julia
Evaluation Platform:
Intel(R) Xeon(R) E5-2690 v2
20 cores
•  Julia-to-C++ translation (needed for OpenMP)
•  Not easy in general, many libraries fail
•  E.g. if is(a,Float64)…
•  Strings, I/O, ccalls, etc. may fail
•  Upcoming native Julia path with threading helps
•  Need full type information
•  Make sure there is no “Any” in AST of function
•  See @code_warntype
21
Current Limitations
•  Not everything parallelizable
•  Limited operators supported
•  Expanding over time
•  ParallelAccelerator’s compilation time
•  Type-inference for our package by Julia compiler
•  First use of package only
•  Use same Julia REPL
•  A solution: see ParallelAccelerator.embed()
•  Julia source needed
•  Compiler bugs…
•  Need more documentation
22
Current Limitations
•  Try ParallelAccelerator and let us know
•  Mailing list
•  https://groups.google.com/forum/#!forum/julia-hps
•  Chat room
•  https://gitter.im/IntelLabs/ParallelAccelerator.jl
•  GitHub issues
•  We are looking for collaborators
•  Application-driven computer science research
•  Compiler contributions
•  Interesting challenges
•  We need your help!
23
Get Involved
•  ParallelAccelerator lets you write code in a
scripting language without sacrificing efficiency
•  Identifies parallel patterns in the code and
compiles to run efficiently on parallel hardware
•  Eliminates many of the usual overheads of high-
level array languages
24
Summary
•  Make it real
•  Extend coverage
•  Improve performance
•  Enable native Julia threading
•  Apply to real world applications
•  Domain-specific features
•  E.g. DSL for Deep Learning
•  Distributed-Memory HPC Cluster/Cloud
25
Next Steps
•  Emerging applications are data/compute intensive
•  Machine Learning on large datasets
•  Enormous data and computation
•  Productivity is 1st priority
•  Not many know MPI/C
•  Goal: facilitate efficient distributed-memory execution without
sacrificing productivity
•  Same high-level code
•  Support parallel data source access
•  Parallel file I/O
26
Using Clusters is Necessary
http://www.udel.edu/
ParallelAccelerator.jl
•  Distributed-IR phase after Parallel-IR
•  Distribute arrays and parfors
•  Handle parallel I/O
•  Call distributed-memory libraries
27
Implementation in ParallelAccelerator
Domain 
Transformations
C++
Backend
(CGen) Array
Runtime
Executable
OpenMP
Domain
AST
Parallel
AST
Julia Parser
Julia AST
Julia Source
Parallel
Transformations
DistributedIR MPI, Charm++
@acc function blackscholes(iterations::Int64)
sptprice = [ 42.0 for i=1:iterations]
strike = [ 40.0+(i/iterations) for i=1:iterations]
logterm = log10(sptprice ./ strike)
powterm = .5 .* volatility .* volatility
den = volatility .* sqrt(time)
d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den
d2 = d1 .- den
NofXd1 = cndf2(d1)
...
put = call .- futureValue .+ sptprice
return sum(put)
end
checksum = blackscholes(iterations)
28
Example: Black-Scholes
Parallel
initialization
double blackscholes(int64_t iterations)
{
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
int mystart = mpi_rank∗(iterations/mpi_nprocs);
int myend = mpi_rank==mpi_nprocs ? iterations:
(mpi_rank+1)∗(iterations/mpi_nprocs);
double *sptprice = (double*)malloc(
(myend-mystart)*sizeof(double));
…
for(i=mystart ; i<myend ; i++) {
sptprice[i-mystart] = 42.0 ;
strike[i-mystart] = 40.0+(i/iterations);
. . .
loc_put_sum += Put;
}
double all_put_sum ;
MPI_Reduce(&loc_put_sum , &all_put_sum , 1 , MPI_DOUBLE,
MPI_SUM, 0 , MPI_COMM_WORLD);
return all_put_sum;
} 29
Example: Black-Scholes
•  Black-Scholes works
•  Generated code equivalent to
hand-written MPI
•  4 nodes, dual-socket Haswell
•  36 cores/node
•  MPI-OpenMP
•  2.03x faster on 4 nodes
vs. 1 node
•  33.09x compared to
sequential
•  MPI-only
•  1 rank/core, no OpenMP
•  91.6x speedup on 144 cores
vs. fast sequential
30
Initial Results
Questions
31

Contenu connexe

Tendances

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 

Tendances (20)

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Boost.Compute GTC 2015
Boost.Compute GTC 2015Boost.Compute GTC 2015
Boost.Compute GTC 2015
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 

Similaire à Ehsan parallel accelerator-dec2015

"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres..."The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
Edge AI and Vision Alliance
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
ActiveState
 

Similaire à Ehsan parallel accelerator-dec2015 (20)

Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Return of c++
Return of c++Return of c++
Return of c++
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres..."The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
Aggregate Programming in Scala
Aggregate Programming in ScalaAggregate Programming in Scala
Aggregate Programming in Scala
 

Dernier

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Dernier (20)

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 

Ehsan parallel accelerator-dec2015

  • 1. ParallelAccelerator.jl High Performance Scripting in Julia Ehsan Totoni ehsan.totoni@intel.com Programming Systems Lab, Intel Labs December 17, 2015 Contributors: Todd Anderson, Raj Barik, Chunling Hu, Lindsey Kuper, Victor Lee, Hai Liu, Geoff Lowney, Paul Petersen, Hongbo Rong, Tatiana Shpeisman, Youfeng Wu 1
  • 2. §  Motivation §  High Performance Scripting(HPS) Project at Intel Labs §  ParallelAccelerator.jl §  How It Works §  Evaluation Results §  Current Limitations §  Get Involved §  Future Steps §  Deep Learning §  Distributed-Memory HPC Cluster/Cloud 2 Outline
  • 3. HPC is Everywhere 3 Molecular Biology Aerospace Cosmology PhysicsChemistry Weather modeling TRADITIONAL HPC Medical   Visualization Financial Analytics Visual   Effects Image analysis Perception   &Tracking Oil    &  Gas   Exploration Scientific  &  Technical  Computing +Many-­‐core  workstations,  small  clusters,  and  clouds Design  & Engineering Predictive Analytics Drug Discovery Large parallel clusters
  • 4. HPC Programming is an Expert Skill §  Most college graduates know Python or MATLAB® §  HPC programming requires C or FORTRAN with OpenMP, MPI §  “Prototype in MATLAB®, re- write in C” workflow limits HPC growth Source: Survey by ACM, July 7, 2014 Most popular introductory teaching languages at top-ranked U.S. universities “As the performance of HPC machines approaches infinity, the number of people who program them is approaching zero” - Dan Reed from The National Strategic Computing Initiative presentation 4
  • 5. 5 High Performance Scripting High Functional Tool Users (e.g., Julia, MATLAB®, Python, R) Ninja Programmers Increasing Performance Increasing Technical Skills Target Programmer Base for HPS Average HPC Programmers Productivity + Performance + Scalability
  • 6. Why Julia? §  Modern LLVM-based code §  Easy compiler construction §  Extendable (DSLs etc.) §  Designed for performance §  MIT license §  Vibrant and growing user community §  Easy to port from MATLAB® or Python Source: http://pkg.julialang.org/pulse.html 6
  • 7. •  Implemented as a package: •  @acc macro to optimize Julia functions •  Domain-specific Julia-to-C++ compiler written in Julia •  Parallel for loops translated to C++ with OpenMP •  SIMD vectorization flags •  Please try it out and report bugs! 7 ParallelAccelerator.jl https://github.com/IntelLabs/ParallelAccelerator.jl
  • 8. A compiler framework on top of the Julia compiler for high- performance technical computing Approach: §  Identify implicit parallel patterns such as map, reduce, comprehension, and stencil §  Translate to data-parallel operations §  Minimize runtime overheads §  Eliminate array bounds checks §  Aggressively fuse data-parallel operations 8 ParallelAccelerator.jl
  • 9. 9 ParallelAccelerator.jl Installation •  Julia 0.4 •  Linux, Mac OS X •  Compilers: icc, gcc, clang •  Install, switch to master branch for up-to-date bug fixes •  See examples/ folder Pkg.add("ParallelAccelerator")   Pkg.checkout("ParallelAccelerator")   Pkg.checkout("CompilerTools")   Pkg.build("ParallelAccelerator")
  • 10. 10 ParallelAccelerator.jl Usage •  Use high-level array operations (MATLAB®-style) •  Unary functions: -, +, acos, cbrt, cos, cosh, exp10, exp2, exp, lgamma, log10, log, sin, sinh, sqrt, tan, tanh, abs, copy, erf … •  Binary functions: -, +, .+, .-, .*, ./, .,.>, .<,.==, .<<, .>>, .^, div, mod, &, |, min, max … •  Reductions, comprehensions, stencils •  minimum, maximum, sum, prod, any, all •  A = [ f(i) for i in 1:n] •  runStencil(dst, src, N, :oob_skip) do b, a
 b[0,0] = (a[0,-1] + a[0,1] + a[-1,0] + a[1,0]) / 4
 return a, b
 end •  Avoid sequential for-loops •  Hard to analyze by ParallelAccelerator
  • 11. using ParallelAccelerator @acc function blackscholes(sptprice::Array{Float64,1}, strike::Array{Float64,1}, rate::Array{Float64,1}, volatility::Array{Float64,1}, time::Array{Float64,1}) logterm = log10(sptprice ./ strike) powterm = .5 .* volatility .* volatility den = volatility .* sqrt(time) d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den d2 = d1 .- den NofXd1 = cndf2(d1) ... put = call .- futureValue .+ sptprice end put = blackscholes(sptprice, initStrike, rate, volatility, time) 11 Example (1): Black-Scholes Accelerate this function Implicit parallelism exploited
  • 12. using ParallelAccelerator @acc function blur(img::Array{Float32,2}, iterations::Int) buf = Array(Float32, size(img)...) runStencil(buf, img, iterations, :oob_skip) do b, a b[0,0] = (a[-2,-2] * 0.003 + a[-1,-2] * 0.0133 + a[0,-2] * ... a[-2,-1] * 0.0133 + a[-1,-1] * 0.0596 + a[0,-1] * ... a[-2, 0] * 0.0219 + a[-1, 0] * 0.0983 + a[0, 0] * ... a[-2, 1] * 0.0133 + a[-1, 1] * 0.0596 + a[0, 1] * ... a[-2, 2] * 0.003 + a[-1, 2] * 0.0133 + a[0, 2] * ... return a, b end return img end img = blur(img, iterations) 12 Example (2): Gaussian blur runStencil construct
  • 13. 13 A quick preview of results Data from 10/21/2015 Evaluation Platform: Intel(R) Xeon(R) E5-2690 v2 20 cores ParallelAccelerator is ~32x faster than MATLAB® ParallelAccelerator is ~90x faster than Julia
  • 14. •  mmap & mmap! : element-wise map function 14 Parallel Patterns: mmap (B1, B2, …) = mmap( (x1, x2, …) à (e1, e2, …), A1, A2, …) n mm n Examples: log(A) ⇒ mmap (x → log(x), A) A.*B ⇒ mmap ((x, y) → x*y, A, B) A .+ c ⇒ mmap (x→x+c,A) A -= B ⇒ mmap! ((x,y) → x-y, A, B)
  • 15. •  reduce: reduction function 15 Parallel Patterns: reduce r = reduce(Θ, Φ, A) Θ is the binary reduction operator Φ is the initial neutral value for reduction Examples: sum(A) ⇒ reduce (+, 0, A) product(A) ⇒ reduce (*, 1, A) any(A) ⇒ reduce (||, false, A) all(A) ⇒ reduce (&&, true, A)
  • 16. •  Comprehension: creates a rank-n array that is the cartesian product of the range of variables 16 Parallel Patterns: comprehension A = [ f(x1, x2, …, xn) for x1 in r1, x2 in r2, …, xn in rn] where, function f is applied over cartesian product of points (x1, x2, …, xn) in the ranges (r1, r2, …, rn) Example: avg(x) = [ 0.25*x[i-1]+0.5*x[i]+0.25*x[i+1] for i in 2:length(x)-1 ]
  • 17. •  runStencil: user-facing language construct to perform stencil operation 17 Parallel Patterns: stencil runStencil((A, B, …) à f(A, B, …), A, B, …, n, s) m mm all arrays in function f are relatively indexed, n is the trip count for iterative stencil s specifies how stencil borders are handled Example: runStencil(b, a, N, :oob_skip) do b, a b[0,0] = (a[-1,-1] + a[-1,0] + a[1, 0] + a[1, 1]) / 4) return a, b end
  • 18. •  DomainIR: replaces some of Julia AST with new “domain nodes” for map, reduce, and stencil •  ParallelIR: replaces some of Domain AST with new “parfor” nodes representing parallel-for loops (parfor) •  CGen: converts parfor nodes into OpenMP loops 18 ParallelAccelerator Compiler Pipeline Domain Transformations C++ Backend (CGen) Array Runtime Executable OpenMP Domain AST Parallel AST Julia Parser Julia AST Julia Source Parallel Transformations
  • 19. •  Map fusion •  Reordering of statements to enable fusion •  Remove intermediate arrays •  mmap to mmap! Conversion •  Hoisting of allocations out of loops •  Other classical optimizations •  Dead code and variable elimination •  Loop invariant hoisting •  Convert parfor nodes to OpenMP with SIMD code generation 19 Transformation Engine
  • 20. 20 ParallelAccelerator vs. Julia 24x 146x 169x 25x 63x 36x 14x 33x 0 20 40 60 80 100 120 140 160 180 SpeedupoverJulia ParallelAccelerator enables ∼5-100× speedup over MATLAB® and ∼10-250× speedup over plain Julia Evaluation Platform: Intel(R) Xeon(R) E5-2690 v2 20 cores
  • 21. •  Julia-to-C++ translation (needed for OpenMP) •  Not easy in general, many libraries fail •  E.g. if is(a,Float64)… •  Strings, I/O, ccalls, etc. may fail •  Upcoming native Julia path with threading helps •  Need full type information •  Make sure there is no “Any” in AST of function •  See @code_warntype 21 Current Limitations
  • 22. •  Not everything parallelizable •  Limited operators supported •  Expanding over time •  ParallelAccelerator’s compilation time •  Type-inference for our package by Julia compiler •  First use of package only •  Use same Julia REPL •  A solution: see ParallelAccelerator.embed() •  Julia source needed •  Compiler bugs… •  Need more documentation 22 Current Limitations
  • 23. •  Try ParallelAccelerator and let us know •  Mailing list •  https://groups.google.com/forum/#!forum/julia-hps •  Chat room •  https://gitter.im/IntelLabs/ParallelAccelerator.jl •  GitHub issues •  We are looking for collaborators •  Application-driven computer science research •  Compiler contributions •  Interesting challenges •  We need your help! 23 Get Involved
  • 24. •  ParallelAccelerator lets you write code in a scripting language without sacrificing efficiency •  Identifies parallel patterns in the code and compiles to run efficiently on parallel hardware •  Eliminates many of the usual overheads of high- level array languages 24 Summary
  • 25. •  Make it real •  Extend coverage •  Improve performance •  Enable native Julia threading •  Apply to real world applications •  Domain-specific features •  E.g. DSL for Deep Learning •  Distributed-Memory HPC Cluster/Cloud 25 Next Steps
  • 26. •  Emerging applications are data/compute intensive •  Machine Learning on large datasets •  Enormous data and computation •  Productivity is 1st priority •  Not many know MPI/C •  Goal: facilitate efficient distributed-memory execution without sacrificing productivity •  Same high-level code •  Support parallel data source access •  Parallel file I/O 26 Using Clusters is Necessary http://www.udel.edu/ ParallelAccelerator.jl
  • 27. •  Distributed-IR phase after Parallel-IR •  Distribute arrays and parfors •  Handle parallel I/O •  Call distributed-memory libraries 27 Implementation in ParallelAccelerator Domain Transformations C++ Backend (CGen) Array Runtime Executable OpenMP Domain AST Parallel AST Julia Parser Julia AST Julia Source Parallel Transformations DistributedIR MPI, Charm++
  • 28. @acc function blackscholes(iterations::Int64) sptprice = [ 42.0 for i=1:iterations] strike = [ 40.0+(i/iterations) for i=1:iterations] logterm = log10(sptprice ./ strike) powterm = .5 .* volatility .* volatility den = volatility .* sqrt(time) d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den d2 = d1 .- den NofXd1 = cndf2(d1) ... put = call .- futureValue .+ sptprice return sum(put) end checksum = blackscholes(iterations) 28 Example: Black-Scholes Parallel initialization
  • 29. double blackscholes(int64_t iterations) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank∗(iterations/mpi_nprocs); int myend = mpi_rank==mpi_nprocs ? iterations: (mpi_rank+1)∗(iterations/mpi_nprocs); double *sptprice = (double*)malloc( (myend-mystart)*sizeof(double)); … for(i=mystart ; i<myend ; i++) { sptprice[i-mystart] = 42.0 ; strike[i-mystart] = 40.0+(i/iterations); . . . loc_put_sum += Put; } double all_put_sum ; MPI_Reduce(&loc_put_sum , &all_put_sum , 1 , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD); return all_put_sum; } 29 Example: Black-Scholes
  • 30. •  Black-Scholes works •  Generated code equivalent to hand-written MPI •  4 nodes, dual-socket Haswell •  36 cores/node •  MPI-OpenMP •  2.03x faster on 4 nodes vs. 1 node •  33.09x compared to sequential •  MPI-only •  1 rank/core, no OpenMP •  91.6x speedup on 144 cores vs. fast sequential 30 Initial Results