Pharo Optimising JIT Internals

Pharo Optimising JIT
Internals
Clément Béra
Pharo Consortium engineer

About Me
2012-2014: Engineer on Pharo and its VM
2014-2017: Ph.D on Pharo JIT
2017+: Pharo consortium engineer

Sista Overview
Alpha release
Architecture internals

What is it ?
Sista
=
Speculative Inlining Small-Talk Architecture

What is it ?
Optimising Just-in-time compiler (JIT) for Pharo

Why would one need that ?
Program readability
Overall Smalltalk performance

Program readability
Technique
name
Program code
do: selector array do: #yourself.
do: block array do: [ :elem | elem yourself ].
to: do: 1 to: array size do: [ :i | (array at: i) yourself ].

0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Millions of execu-on per second
Array size
do: selector
do: block
to: do:

0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Array size
do: selector
do: block
to: do:
Array size = 0
to:do: 5x faster than do: block

0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Array size
do: selector
do: block
to: do:
Array size = 10
to:do: 2.5x faster than do: block

Technique
name
Program code
Readability
Performance

Program readability
AND
Performance
Wanted

Overall Smalltalk
Performance
Faster than anything you can write

Overall Smalltalk
Performance
Optimisations on low-level behavior
that cannot be expressed at Smalltalk level
Faster than anything you can write

Overview
Sista: Improving Cog's JIT performance
ESUG 2014 Cambridge
Clément Béra

How do we do it ?
Speculative optimisations:
First runs non optimised
Optimise based on ﬁrst runs
Ex: Unused branch in ﬁrst runs

Speculations
Speculative optimisations may be incorrect
Ex: Unused branch is used
Deoptimise the code at runtime
Re-optimise differently

Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT

Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Speculations
Always
correct
Bytecode Native
Bytecode Native
Bytecode
Bytecode

Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Existing

Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Sista

Sista Goals
Program readability AND Performance
Overall better performance

Technique
name
Program code
Readability
SAME
Performance

Relative speed-up
0 1 2 3 4 5 6
A*
ThreadRing
SpectralNorm
JSJSON
BinaryTree
DeltaBlue
Richards
TCAP
Kmeans
Sista
Pharo

Large bench
0 1 2 3 4 5 6
A*
ThreadRing
SpectralNorm
JSJSON
BinaryTree
DeltaBlue
Richards
TCAP
Kmeans
Sista
Pharo

Potential
0 1 2 3 4 5 6
A*
ThreadRing
SpectralNorm
JSJSON
BinaryTree
DeltaBlue
Richards
TCAP
Kmeans
Sista
Pharo

Set of classic bench
0 1 2 3 4 5 6
A*
ThreadRing
SpectralNorm
JSJSON
BinaryTree
DeltaBlue
Richards
TCAP
Kmeans
Sista
Pharo

Alpha ?
Sista works.
Improve IDE support.
Rethink some semantics (thisContext ?)

Production
Currently moving long task to the Sista runtime
Use-case: Loading a Moose model
15 min to 10 min or less
30 min to 20 min or less

Production
Sista preview in Pharo 7
You can use it,
but optional VM

Smalltalk image
Virtual machine
Cogit
CompiledCode to native code
Machine-speciﬁc optimisations
CompiledCode
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT

Smalltalk image
Virtual machine
Cogit
CompiledCode to native code
Machine-speciﬁc optimisations
Scorch
CompiledCode to CompiledCode
Smalltalk-speciﬁc optimisations
CompiledCode
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT
Optimising JIT

Optimising JIT
Smalltalk image
Virtual machine
Cogit
Hot spot
detection
Scorch
What to
optimise
Optimisation Installation
Optimised
Native code
Optimised CompiledCode

Hot spot detection
Proﬁling counters on branches
Counters as pinned byteArray on heap
No cpu i-cache ﬂush
Direct reference to counter
Detection: VM call-back to Scorch DNU-style
D5 C8 A2 5E 24 1A AD 7C 23 60 9
D2 3C A4 B9 66 28 CF A2 A8 27 A2
FA 7B 88 7E CA AD B5 43 6B EF 43
5C 48 A 55 D5 88 52 AE E5 68 77 1
D3 D8 F5 13 42 A5 F2 3B 76 7 CA 26
16 65 27 4E F6 EB 68 98 6B C3 91
42 68 8 B7 4B B3 6E 81 C5 F0 3D 44
E2 94 8E 29 61 82 93 5 D4 10 96 C3
EB C5 5 46 FB 52 61 A6 6B 44 11
BA 8E EB EA 91 70 20 C3 D7 67 E0
91 58 32 79 9A 31 50 ED 6D CB BF
6C E0 25 5B 74 82 D9 C3 E5 54 F6
87 B5 88 C1 16 65 BF B7 F1 6F E0
91 18 6E 84 2E B7 E8 3 4C 4B 92 7B
81 BE 84 4C C0 78 8C CB EB 87 7F
D0 7B 58 E6 DA CD 81 3 94 8D 42
89 F6 8A 24 A8 7C B0 62 46 F5 FF
E9 CC C2 8F DC ED E 70 42 AA BE
A7 3D 60 A3 88 E7 FC 40 56 8 66 4C
C7 22 41 86 B1 BE D0 AA D9 FD E5
6F BC 19 E8 3C 6A EA 68 62 3 38
CB FE CF 91 35 33 6F F8 8A C3 9C
Machine code
Pinned ByteArray
Counter 1 Counter 2
0 1615 31

Example>>exampleArrayLoop
array do: [:item | item displayOnScreen ]
Array>>do: aClosure
1 to: self size do: [:index |
aClosure value: (self at: index) ]
What to optimise ?

Array>>do: aClosure
What to optimise ? Example >>
exampleArrayLoop
Array >> do:
[:item |
item displayOnScreen ]
Stack
growing
down

Array>>do: aClosure
Hot spot detected
exampleArrayLoop
Array >> do:
[:item |
Stack
growing
down
Hidden
branch

Array>>do: aClosure
Method to optimise
Hot spot detected
exampleArrayLoop
Array >> do:
[:item |
Stack
growing
down

Edges cases
DoIts
VM call-backs (DNU, perform:, mustBeBoolean,
unusedBytecode, …)
Smalltalk-speciﬁc heuristics

Bytecode optimisation
Compiled
Code
Compiled
Code
Compiled
Code
Compiled
Code
Scorch IR
Compiled
Code
Decompilation Generation
Range optimisations,
Loop optimisations,
...
Speculative
inlining

Decompilation
Bytecode to Scorch IR (CFG SSA IR)
IR LLVM-style, higher level, deoptimisation info
Annotations using send and branch data

Decompilation
Basic block ordering
Reverse post order
1
2 3
5
7 6
8
9 10
4

Decompilation
Basic block ordering
Reverse post order, Loop canonicalisation
1
2 3
5
7 6
8
9 10
4

Decompilation
Lots of edges cases
SmallInteger selector type-prediction
Context inst var access
Strange CFG

Optimisations
Typical graph transformations
Keeps updating:
Dominator tree
Def-use graph
Canonicalised CFG
Deoptimisation info

Speculative inlining
What to inline ?
Heuristics (closures, constants, escapes)
How to inline ?
Non local returns
Context access, exceptions, continuations …

Range optimisations
ABCD
Bounds check elimination
Branch folding
Overﬂow removal
Loop iterator analysis

Loop optimisations
Unrolling
Peeling and loop invariant code motion
Strange CFG are canonicalised

Back-end
From SSA to Bytecode (stack-based)
Liveness analysis and graph colouring
Generate patterns efﬁcient for Cogit
(Smi comparison followed by branch, …)

Installation
Where to install ?
Copy down ?
Dependency management
Optimisations such as inlining track dependencies
Register optimised method

Next call uses optimised
CompiledCode

Execution of optimised
CompiledCode
VM extensions
Unsafe operations
Register Allocation

Unsafe operations
Inlined operations
Unsafe array access
Arithmetic without overﬂow
Allocation
Efﬁcient type-check
….

Register allocation
Cheap heuristic was enough for baseline JIT
We now need a proper linear scan allocator
First version moved to production
Second version with control ﬂow management

Deoptimisation
Smalltalk image
Virtual machine
Cogit
Trap
Tripped
Scorch
Reconstruct
objects
Stack
edition
Resume
execution

Traps
Traps may be off the cpu i-cache
Most trap branches are not taken
Call-back to Scorch DNU-style
D5 C8 A2 5E 24 1A AD 7C 23 60 9
D2 3C A4 B9 66 28 CF A2 A8 27 A2
FA 7B 88 7E CA AD B5 43 6B EF 43
5C 48 A 55 D5 88 52 AE E5 68 77 1
D3 D8 F5 13 42 A5 F2 3B 76 7 CA 26
16 65 27 4E F6 EB 68 98 6B C3 91
42 68 8 B7 4B B3 6E 81 C5 F0 3D 44
E2 94 8E 29 61 82 93 5 D4 10 96 C3
EB C5 5 46 FB 52 61 A6 6B 44 11
BA 8E EB EA 91 70 20 C3 D7 67 E0
91 58 32 79 9A 31 50 ED 6D CB BF
6C E0 25 5B 74 82 D9 C3 E5 54 F6
87 B5 88 C1 16 65 BF B7 F1 6F E0
91 18 6E 84 2E B7 E8 3 4C 4B 92 7B
81 BE 84 4C C0 78 8C CB EB 87 7F
D0 7B 58 E6 DA CD 81 3 94 8D 42
89 F6 8A 24 A8 7C B0 62 46 F5 FF
E9 CC C2 8F DC ED E 70 42 AA BE
A7 3D 60 A3 88 E7 FC 40 56 8 66 4C
C7 22 41 86 B1 BE D0 AA D9 FD E5
6F BC 19 E8 3C 6A EA 68 62 3 38
CB FE CF 91 35 33 6F F8 8A C3 9C
Machine code
Type check
Trap call-back

Object re-construction
Application frame
requesting
deoptimisation
Deoptimised
Context
Object
Reconstruction
Deoptimised
Context
Deoptimised
Context
Reconstructed
Closure
Reconstructed
Temp Vector
Reconstructed
Object
Deoptimisation metadata: objects to reconstruct
Objects to reconstruct includes contexts, closures,
temp vectors

Stack edition
Application frame
Application frame
Application frame
requesting
deoptimisation
Stack
growing
down

Stack edition
Application frame
Application frame
Application frame
requesting
deoptimisation
Call-back frame
Scorch
deoptimiser frame
Scorch
deoptimiser frame
Stack
growing
down

Stack edition
Application frame
Application frame
Application frame
requesting
deoptimisation
Call-back frame
Scorch
deoptimiser frame
Scorch
deoptimiser frame
Stack
growing
down
Application frame
Application frame
Deoptimised
Context
Call-back frame
Scorch
deoptimiser frame
Scorch
deoptimiser frame
Scorch
deoptimiser
Deoptimised
Context
Deoptimised
Context

Discard
If multiple deoptimisation on the same code,
discard optimised code
Scorch may reoptimise it differently next time
Discard can also happen when loading / editing code

Others
Improved closure implementation
Avoids outerContext issues of old implementation
Decreases metadata
Write barrier (Read-only objects)
Immutable literals
Compiler informed of object mutation (Global, …)
…

Research directions
Warm-up time to reach peak performance
Sista: Persistence of bytecode
Metacircular optimising JIT
On top of existing C VM

Conclusion & Questions
Readable and performant code.
Overall performance boost.
Alpha release: Sista works.
Moving to production long tasks.
High complexity,
Many details and edge cases,
We made it work.

Pharo Optimising JIT Internals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pharo Optimising JIT Internals

Similar to Pharo Optimising JIT Internals (20)

More from ESUG

More from ESUG (20)

Recently uploaded

Recently uploaded (20)

Pharo Optimising JIT Internals