Despite Moore's "law", uniprocessor clock speeds have now stalled. Rather than single processors running at ever higher clock speeds, it is
common to find dual-, quad- or even hexa-core processors, even in consumer laptops and desktops.
Future hardware will not be slightly parallel, however, as in today's multicore systems, but will be
massively parallel, with manycore and perhaps even megacore systems
becoming mainstream.
This means that programmers need to start thinking parallel. To achieve this they must move away
from traditional programming models where parallelism is a
bolted-on afterthought. Rather, programmers must use languages where parallelism is deeply embedded into the programming model
from the outset.
By providing a high level model of computation, without explicit ordering of computations,
declarative languages in general, and functional languages in particular, offer many advantages for parallel
programming.
One of the most fundamental advantages of the functional paradigm is purity.
In a purely functional language, as exemplified by Haskell, there are simply no side effects: it is therefore impossible for parallel computations to conflict with each
other in ways that are not well understood.
ParaForming aims to radically improve the process
of parallelising purely functional programs through a comprehensive set of high-level parallel refactoring patterns for Parallel Haskell,
supported by advanced refactoring tools.
By matching parallel design patterns with appropriate algorithmic skeletons
using advanced software refactoring techniques and novel cost information, we will bridge the gap between fully automatic
and fully explicit approaches to parallelisation, helping programmers "think parallel" in a systematic,
guided way. This talk introduces the ParaForming approach, gives some examples and shows how
effective parallel programs can be developed using advanced refactoring technology.
APM Welcome, APM North West Network Conference, Synergies Across Sectors
ParaForming - Patterns and Refactoring for Parallel Programming
1. Paraforming:
Forming
Parallel
(Func2onal)
Programs
from
High-‐Level
Pa:erns
using
Advanced
Refactoring
Kevin
Hammond,
Chris
Brown,
Vladimir
Janjic
University
of
St
Andrews,
Scotland
Build
Stuff,
Vilnius,
Lithuania,
December
10
2013
T:
@paraphrase_fp7,
@khstandrews
E:
kh@cs.st-‐andrews.ac.uk
W: http://www.paraphrase-ict.eu!
4. What
will
“megacore”
computers
look
like?
§ Probably
not
just
scaled
versions
of
today’s
mul2core
§
§
§
§
§
§
Perhaps
hundreds
of
dedicated
lightweight
integer
units
Hundreds
of
floa9ng
point
units
(enhanced
GPU
designs)
A
few
heavyweight
general-‐purpose
cores
Some
specialised
units
for
graphics,
authen9ca9on,
network
etc
possibly
so*
cores
(FPGAs
etc)
Highly
heterogeneous
6
5. What
will
“megacore”
computers
look
like?
§ Probably
not
uniform
shared
memory
§ NUMA
is
likely,
even
hardware
distributed
shared
memory
§ or
even
message-‐passing
systems
on
a
chip
§ shared-‐memory
will
not
be
a
good
abstrac:on
int
arr
[x]
[y];
7
6. Laki (NEC Nehalem Cluster) and hermit (XE6)
Laki
hermit (phase 1 step 1)
700 dual socket Xeon 5560 2,8GHz
(“Gainestown”)
4/6
Nodes with 32GB and 64GB memory
reflecting different user needs
2.7PB storage capacity @ 150GB/s IO
bandwidth
I
Scientific Linux 6.0
Each compute node will have 2 sockets
AMD Interlagos @ 2.3GHz 16 Cores
each leading to 113.664 cores
External Access Nodes, Pre-
Postprocessing Nodes, Remote
Visualization Nodes
32 nodes with additional Nvidia Tesla
S1070
I
96 service nodes and 3552 compute
nodes
I
Infiniband (QDR)
I
I
12 GB DDR3 RAM / node
I
38 racks with 96 nodes each
I
I
I
I
I
::
HLRS in ParaPhrase
::
Turin, 4th/5th October 2011
::
8
7. The
Biggest
Computer
in
the
World
Tianhe-‐2,
Chinese
Na2onal
University
of
Defence
Technology
33.86
petaflops/s
(June
17,
2013)
16,000
Nodes;
each
with
2
Ivy
Bridge
mul9cores
and
3
Xeon
Phis
3,120,000
x86
cores
in
total!!!
9
8. It’s
not
just
about
large
systems
• Even
mobile
phones
are
mul9core
§ Samsung
Exynos
5
Octa
has
8
cores,
4
of
which
are
“dark”
• Performance/energy
tradeoffs
mean
systems
will
be
increasingly
parallel
• If
we
don’t
solve
the
mul9core
challenge,
then
no
other
advances
will
maber!
ALL
Future
Programming
will
be
Parallel!
10
9. The
Manycore
Challenge
“Ul9mately,
developers
should
start
thinking
about
tens,
hundreds,
and
thousands
of
cores
now
in
their
algorithmic
development
and
deployment
pipeline.”
Anwar
Ghuloum,
Principal
Engineer,
Intel
Microprocessor
Technology
Lab
The
ONLY
important
challenge
in
Computer
Science
(Intel)
“The
dilemma
is
that
a
large
percentage
of
mission-‐cri9cal
enterprise
applica9ons
will
not
``automagically''
run
faster
on
mul9-‐core
servers.
In
fact,
many
will
actually
run
slower.
We
must
make
it
as
easy
as
possible
for
applica9ons
programmers
to
exploit
the
latest
developments
in
mul9-‐core/many-‐core
Also
recognised
as
thema9c
priori9es
by
EU
and
architectures,
while
s9ll
making
it
easy
to
target
future
(and
perhaps
na9onal
funding
bodies
unan9cipated)
hardware
developments.”
Patrick
Leonard,
Vice
President
for
Product
Development
Rogue
Wave
Sobware
10. But
Doesn’t
that
mean
millions
of
threads
on
a
megacore
machine??
13
11. How
to
build
a
wall
(with
apologies
to
Ian
Watson,
Univ.
Manchester)
13. How
NOT
to
build
a
wall
Typical
CONCURRENCY
Approaches
require
the
Programmer
to
solve
these
Task
iden2fica2on
is
not
the
only
problem…
Must
also
consider
Coordina9on,
communica9on,
placement,
scheduling,
…
14. We
need
structure
We
need
abstrac2on
We
don’t
need
another
brick
in
the
wall
17
15. Thinking
Parallel
§ Fundamentally,
programmers
must
learn
to
“think
parallel”
§ this
requires
new
high-‐level
programming
constructs
§ perhaps
dealing
with
hundreds
of
millions
of
threads
§ You
cannot
program
effec2vely
while
worrying
about
deadlocks
etc.
§ they
must
be
eliminated
from
the
design!
§ You
cannot
program
effec2vely
while
fiddling
with
communica2on
etc.
§ this
needs
to
be
packaged/abstracted!
§ You
cannot
program
effec2vely
without
performance
informa2on
§ this
needs
to
be
included
as
part
of
the
design!
18
16. A
Solu2on?
“The
only
thing
that
works
for
parallelism
is
func2onal
programming”
Bob
Harper,
Carnegie
Mellon
University
17. Parallel
Func2onal
Programming
§ No
explicit
ordering
of
expressions
§ Purity
means
no
side-‐effects
§ Impossible
for
parallel
processes
to
interfere
with
each
other
§ Can
debug
sequen2ally
but
run
in
parallel
§ Enormous
saving
in
effort
§ Programmer
concentrate
on
solving
the
problem
§ Not
por9ng
a
sequen9al
algorithm
into
a
(ill-‐defined)
parallel
domain
§ No
locks,
deadlocks
or
race
condi2ons!!
§ Huge
produc2vity
gains!
λ
λ
λ
18. ParaPhrase
Project:
Parallel
Pa:erns
for
Heterogeneous
Mul2core
Systems
(ICT-‐288570),
2011-‐2014,
€4.2M
budget
13
Partners,
8
European
countries
UK,
Italy,
Germany,
Austria,
Ireland,
Hungary,
Poland,
Israel
Coordinated
by
Kevin
Hammond
St
Andrews
0
19. The
ParaPhrase
Approach
§ Start
bobom-‐up
§ iden9fy
(strongly
hygienic)
COMPONENTS
§ using
semi-‐automated
refactoring
both
legacy
and
new
programs
§ Think
about
the
PATTERN
of
parallelism
§ e.g.
map(reduce),
task
farm,
parallel
search,
parallel
comple9on,
...
§ STRUCTURE
the
components
into
a
parallel
program
§ turn
the
pa?erns
into
concrete
(skeleton)
code
§ Take
performance,
energy
etc.
into
account
(mul9-‐objec9ve
op9misa9on)
§ also
using
refactoring
§ RESTRUCTURE
if
necessary!
(also
using
refactoring)
25
20. Some
Common
Pa:erns
§ High-‐level
abstract
paberns
of
common
parallel
algorithms
Google
map-‐
reduce
combines
two
of
these!
Generally,
we
need
to
nest/combine
paberns
in
arbitray
ways
35
21. The
Skel
Library
for
Erlang
§ Skeletons
implement
specific
parallel
paberns
§ Pluggable
templates
§ Skel
is
a
new
(AND
ONLY!)
Skeleton
library
in
Erlang
§ map,
farm,
reduce,
pipeline,
feedback
§ instan9ated
using
skel:run
§ Fully
Nestable
chrisb.host.cs.st-‐andrews.ac.uk/skel.html
hbps://github.com/ParaPhrase/skel
§ A
DSL
for
parallelism
!
OutputItems = skel:run(Skeleton, InputItems).!
!
36
22. e
Parallel
Pipeline
Skeleton
§ Each
stage
of
the
pipeline
can
be
executed
in
parallel
§ The
input
and
output
are
streams
{pipe, [Skel1 , Skel2 , · · · , Skeln ]}
Tn · · · T1
Skel1
Skel2
···
Skeln
Tn · · · T1
skel:run([{pipe,[Skel1, Skel2,..,SkelN]}], Inputs).!
Inc
= { seq , fun ( X ) - X +1 end } ,
!
Double = { seq , fun ( X ) - X *2 end } ,
skel : run ( { pipe , [ Inc , Double ] } ,
[ 1 ,2 ,3 ,4 ,5 ,6 ] ).
37
23. m
Farm
Skeleton
§ Each
worker
is
executed
in
parallel
§ A
bit
like
a
1-‐stage
pipeline
{farm, Skel, M}
Skel1
Tn · · · T1
!
Skel2
.
.
.
Tn · · · T1
SkelM
skel:do([{farm, Skel, M}], Inputs).!
nc = { seq , fun ( X ) - X +1 end } ,
38
24. Using
The
Right
Pa:ern
Ma:ers
Speedup
Speedups for Matrix Multiplication
24
22
20
18
16
14
12
10
8
6
4
2
Naive Parallel
Farm
Farm with Chunk 16
12 4
8
12
16
No. cores
20
24
39
26. Refactoring
§ Refactoring
changes
the
structure
of
the
source
code
§ using
well-‐defined
rules
§ semi-‐automa:cally
under
programmer
guidance
Review
27. S1Refactoring:
Farm
Introduc2on
S2
⌘
P ipe(S1 , S2 )
pipe seq
Map(S1 S2 , d, r)
⌘
Map(S1 , d, r) Map(S2 , d, r)
map fission/fusion
S
⌘
F arm(S)
farm intro/elim
Map(F, d, r)
⌘
P ipe(Decomp(d)), F arm(F ), Recomp(r)) data2stream
0
S1
⌘
Map(S1 , d, r)
map intro/elim
Figure 3.3: Some Standard Skeleton Equivalences
Farm
The following describes each of the patterns in turn:
• a MAP is made up of three OPERATIONs: a worker, a partitioner, and a
combiner, followed by an INPUT;
• a SEQ is made up of a single OPERATION denoting the sequential computation to be performed, followed by an INPUT;
• a FARM is made up of a single OPERATON denoting the working, an INPUT
44
33. Large-‐Scale
Demonstrator
Applica2ons
§ ParaPhrase
tools
are
being
used
by
commercial/end-‐user
partners
§ SCCH
(SME,
Austria)
§ Erlang
Solu9ons
Ltd
(SME,
UK)
§ Mellanox
(Israel)
§ ELTESos,
Hungary
(SME)
§ AGH
(University,
Poland)
§ HLRS
(High
Performance
Compu9ng
Centre,
Germany)
34. Speedup
Results
(demonstrators)
Speedup
Speedups for Ant Colony, BasicN2 and Graphical Lasso
24
22
20
18
16
14
12
10
8
6
4
2
1
BasicN2
BasicN2 Manual
Graphical Lasso
Graphical Lasso Manual
Ant Colony Optimisation Manual
Ant Colony Optimisation
Speedup
close
to
or
beHer
than
manual
op9misa9on
1 2 4 6 8 10 12 14 16 18 20 22 24
No of Workers
55
35. Bow2e2:
most
widely
used
DNA
alignment
tool
28
30
26
Speedup
Speedup
24
22
20
25
20
18
16
15
Bt2FF-pin+int
Bt2
14
20
30
40
50
60
70
80
Read Length
90
100 110
Bt2FF-pin+int
Bt2
28
30
32
34
Quality
36
38
40
Original
Paraphrase
C.
Misale.
Accelera9ng
Bow9e2
with
a
lock-‐less
concurrency
approach
and
memory
affinity.
IEEE
PDP
2014.
To
appear.
56
36. Comparison
of
Development
Times
ge pipeline (k),
ates the images
the images (F ).
tained from the
e first farm and
o three workers
es, and one for
e load balancers
e, the nature of
second stage of
first stage takes
e takes around
n a substantial
Convolution
Ant Colony
BasicN2
Graphical Lasso
Man.Time
3 days
1 day
5 days
15 hours
Refac. Time
3 hours
1 hour
5 hours
2 hours
LOC Intro.
58
32
40
53
Figure 3.
Approximate manual implementation time of use-cases vs.
refactoring time with lines of code introduced by refactoring tool
linear scaling for higher numbers of cores, because of cache
synchronisation (disjunct but interleaving memory regions are
updated in the tasks), and an uneven size combined with a
limited number of tasks (48). At the end of the computation,
58
some cores will wait idly for the completion of remaining
38. Example:
Enumerate
Skeleton
Configura2ons
for
Image
Convolu2on
Δ(r p)
r || Δ(p)
Δ(r) p
r p
r || p
Δ(r) Δ(p)
r
:
read
image
file
p
:
process
image
file
r Δ(p)
39. Results
on
Benchmark:
Image
Convolu2on
MCTS
Mapping
(C,
G):
(6,
0)
||
(0,
3)
Speedup
39.12
Best
Speedup:
40.91
40. Conclusions
§ The
manycore
revolu9on
is
upon
us
§ Computer
hardware
is
changing
very
rapidly
(more
than
in
the
last
50
years)
§ The
megacore
era
is
here
(aka
exascale,
BIG
data)
§ Heterogeneity
and
energy
are
both
important
§ Most
programming
models
are
too
low-‐level
§ concurrency
based
§ need
to
expose
mass
parallelism
§ Paberns
and
func:onal
programming
help
with
abstrac9on
§ millions
of
threads,
easily
controlled
41. Conclusions
(2)
§ Func9onal
programming
makes
it
easy
to
introduce
parallelism
§ No
side
effects
means
any
computa9on
could
be
parallel
§ Matches
pabern-‐based
parallelism
§ Much
detail
can
be
abstracted
§ Lots
of
problems
can
be
avoided
§ e.g.
Freedom
from
Deadlock
§ Parallel
programs
give
the
same
results
as
sequen9al
ones!
§ Automa9on
is
very
important
§ Refactoring
drama9cally
reduces
development
9me
(while
keeping
the
programmer
in
the
loop)
§ Machine
learning
is
very
promising
for
determining
complex
performance
sewngs
42. But
isn’t
this
all
just
wishful
thinking?
Rampant-‐Lambda-‐Men
in
St
Andrews
66
43. NO!
§ C++11
has
lambda
func9ons
(and
some
other
nice
func9onal-‐
inspired
features)
§ Java
8
will
have
lambda
(closures)
§ Apple
uses
closures
in
Grand
Central
Dispatch
67
44. ParaPhrase
Parallel
C++
Refactoring
§ Integrated
into
Eclipse
§ Supports
full
C++(11)
standard
§ Uses
strongly
hygienic
components
§ func9onal
encapsula9on
(closures)
68
48. Funded
by
•
ParaPhrase
(EU
FP7),
Pa:erns
for
heterogeneous
mul2core,
€4.2M,
2011-‐2014
•
•
SCIEnce
(EU
FP6),
Grid/Cloud/Mul2core
coordina2on
• €3.2M,
2005-‐2012
Advance
(EU
FP7),
Mul2core
streaming
• €2.7M,
2010-‐2013
•
HPC-‐GAP
(EPSRC),
Legacy
system
on
thousands
of
cores
• £1.6M,
2010-‐2014
•
Islay
(EPSRC),
Real-‐2me
FPGA
streaming
implementa2on
• £1.4M,
2008-‐2011
•
TACLE:
European
Cost
Ac2on
on
Timing
Analysis
• €300K,
2012-‐2015
74
49. Some
of
our
Industrial
Connec2ons
Mellanox
Inc.
Erlang
Solu9ons
Ltd
SAP
GmbH,
Karlsrühe
BAe
Systems
Selex
Galileo
BioId
GmbH,
Stubgart
Philips
Healthcare
Sosware
Competence
Centre,
Hagenberg
Microsos
Research
Well-‐Typed
LLC
75
50. ParaPhrase
Needs
You!
•
Please
join
our
mailing
list
and
help
grow
our
user
community
§
§
§
§
§
§
•
news
items
access
to
free
development
sosware
chat
to
the
developers
free
developer
workshops
bug
tracking
and
fixing
Tools
for
both
Erlang
and
C++
Subscribe
at
hbps://mailman.cs.st-‐andrews.ac.uk/mailman/
lis9nfo/paraphrase-‐news
•
•
We’re
also
looking
for
open
source
developers...
We
also
have
8
PhD
studentships...
76
51. Further
Reading
Chris
Brown.
Vladimir
Janjic,
Kevin
Hammond,
Mehdi
Goli
and
John
McCall
“Bridging
the
Divide:
Intelligent
Mapping
for
the
Heterogeneous
Parallel
Programmer”,
Submi?ed
to
IPDPS
2014
Chris
Brown.
Marco
Danelu:o,
Kevin
Hammond,
Peter
Kilpatrick
and
Sam
Elliot
“Cost-‐Directed
Refactoring
for
Parallel
Erlang
Programs”
To
appear
in
InternaGonal
Journal
of
Parallel
Programming,
2013
Vladimir
Janjic,
Chris
Brown.
Max
Neunhoffer,
Kevin
Hammond,
Steve
Linton
and
Hans-‐
Wolfgang
Loidl
“Space
Explora2on
using
Parallel
Orbits”
Proc.
PARCO
2013:
Interna2onal
Conf.
on
Parallel
Compu2ng,
Munich,
Sept.
2013
Ask
me
for
copies!
Chris
Brown.
Hans-‐Wolfgang
Loidl
and
Kevin
Hammond
Many
technical
“ParaForming
Forming
Parallel
Haskell
Programs
using
efactoring
Techniques”
results
011
Trends
he
uncGonal
Programming
(TFP),
MNovel
Rpain,
May
2011
also
on
t in
F
Proc.
2
adrid,
S
project
web
site:
Henrique
ownload!
free
for
dFerreiro,
David
Castro,
Vladimir
Janjic
and
Kevin
Hammond
“Repea2ng
History:
Execu2on
Replay
for
Parallel
Haskell
Programs”
Proc.
2012
Trends
in
FuncGonal
Programming
(TFP),
St
Andrews,
UK,
June
2012