Sista: a Metacircular Architecture for Runtime Optimisations

Sista: a Metacircular
Architecture for Runtime
Optimisations persistence
Clément Béra
Ph.D defense

Ph.D Setup
International collaboration
Goal: transfer new knowledge to RMoD
2
France (32 months) California (4 months)
RMoD Researchers Domain expert
Stéphane Ducasse
Marcus Denker
Eliot Miranda
Inria Stellect Systems LLC

3
Context
Research problems
Solution: Sista
Validation

4
Context
Research problems
Solution: Sista
Validation

Sista ?
5
Speculative Inlining Small-Talk Architecture

Sista ?
6
Optimising Just-in-time compiler (JIT)
for Smalltalk

Need
7
Program readability
Overall Smalltalk performance

Program readability
8
Technique
name
Program code
do: selector array do: #yourself.
do: block array do: [ :elem | elem yourself ].
to: do: 1 to: array size do: [ :i | (array at: i) yourself ].

0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Millions of execu-on per second
Array size
do: selector
do: block
to: do:

0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Array size
do: selector
do: block
to: do:
Array size = 0
to:do: 5x faster than do: block

0
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Array size
do: selector
do: block
to: do:
Array size = 10
to:do: 2.5x faster than do: block

12
Technique
name
Program code
do: selector array do: #yourself.
do: block array do: [ :elem | elem yourself ].
to: do: 1 to: array size do: [ :i | (array at: i) yourself ].
Readability
Performance

Wanted
13
Program readability
AND
Performance

Overall Smalltalk
Performance
14
Faster than anything you can write

Overall Smalltalk
Performance
15
Faster than anything you can write
Optimisations on low-level behavior
that cannot be expressed at Smalltalk level

Speculations
16
Speculative optimisations
First runs non optimised
Optimise based on ﬁrst runs
Ex: Unused branch in ﬁrst runs

Speculations
17
Speculative optimisations may be incorrect
Ex: Unused branch is used
Deoptimise the code at runtime
Re-optimise differently

Main existing solution
18
Three-tiers execution model
Java Hotspot, Javascript V8, …

19
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT

20
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Speculations
Always
correct
Bytecode Native
Bytecode Native
Bytecode
Bytecode

21
Existing
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT

22
Sista
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT

Alternative architecture
23
Meta-tracing JIT
Record optimised traces of loop bodies
Pypy, LuaJIT

Existing proposal
24
Eliot Miranda,
with contributions of
Paolo Bonzini, Steve Dahl, David Griswold, Urs Hölzle,
Ian Piumarta and David Simmons
Adaptive Optimiser for Small-Talk Architecture

Key ideas
Build on top of the existing VM
Split optimising JIT design
25

Build on top of existing VM
Market acceptance
Lower risk and investment
26

Split optimising JIT design
27
Bytecode
Bytecode
Bytecode
Bytecode Native code
1 2
Lowering maintenance cost
1 is written in Smalltalk
More open source contributors
2 reuses the baseline JIT
One back-end to maintain

28
Context
Research problems
Solution: Sista
Validation

Is the split optimising JIT design possible ?
29
Bytecode
Bytecode
Bytecode
1 2
1. Architecture

1. Architecture
Can we build a maintainable optimising JIT
with two people part-time ?
30
Practical problem:
Cannot be proven (empirical study)

Existing solutions
31
Reusing an existing adaptable JIT
Trufﬂe
RPython toolchain

Non optimising tiers are slower to execute code
Runtime compilation time
Deoptimisation and re-optimisation
32
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Time to reach peak performance

2. Warm-up
How to reduce the time to reach peak
performance ?
33

Existing solutions
Many tiers architecture
Baseline JIT
4 tiers in Javascript Webkit
Cloneable JVMs
Work of Kawachiya et al.
Clone a running JVM including native code
34

Existing solutions
Persistence of Runtime information
Saving inlining decisions (Strongtalk)
Shared proﬁling information between VMs (Arnold et
al.)
Persistence of Machine code
Azul
JRockit
35

Metacircular JITs
Written in itself
Compiled AOT (RPython toolchain)
Same runtime (Maxine, Graal)
36

3. Metacircular
Can the optimising JIT optimise its own
code at runtime ?
37

Existing solution
Trufﬂe-Graal
Built on top of the JVM
Graal optimising JIT in Java
38

Problems
1. Architecture
2. Warm-up
3. Metacircular
39

40
Context
Research problems
Solution: Sista
Validation

Terminology
Function: Method or Closure
Virtual function VS Native function
41

Smalltalk runtime
Virtual machine
Cogit
n-function to v-function
Machine-speciﬁc optimisations
virtual functions
native functions
Baseline JIT
Optimising JIT

Smalltalk runtime
Virtual machine
Cogit
Scorch
non optimised v-functions to
optimised v-function
Smalltalk-speciﬁc optimisations
virtual functions
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT
Optimising JIT

Smalltalk runtime
Virtual machine
Cogit
Scorch
virtual functions
native functions
Baseline JIT
Optimising JIT
My work
alone
My work in
collaboration
with Stellect
Systems LLC

Optimising JIT
46
Smalltalk image
Virtual machine
Cogit
Hot spot
detection
Scorch
What to
optimise
Optimisation Installation
Optimised
native function
Optimised virtual function

Hot spot detection
47
Proﬁling counters on branches
Counters as pinned byteArray on heap
No cpu i-cache ﬂush
Direct reference to counter
Detection: VM call-back to Scorch
D5 C8 A2 5E 24 1A AD 7C 23 60 9
D2 3C A4 B9 66 28 CF A2 A8 27 A2
FA 7B 88 7E CA AD B5 43 6B EF 43
5C 48 A 55 D5 88 52 AE E5 68 77 1
D3 D8 F5 13 42 A5 F2 3B 76 7 CA 26
16 65 27 4E F6 EB 68 98 6B C3 91
42 68 8 B7 4B B3 6E 81 C5 F0 3D 44
E2 94 8E 29 61 82 93 5 D4 10 96 C3
EB C5 5 46 FB 52 61 A6 6B 44 11
BA 8E EB EA 91 70 20 C3 D7 67 E0
91 58 32 79 9A 31 50 ED 6D CB BF
6C E0 25 5B 74 82 D9 C3 E5 54 F6
87 B5 88 C1 16 65 BF B7 F1 6F E0
91 18 6E 84 2E B7 E8 3 4C 4B 92 7B
81 BE 84 4C C0 78 8C CB EB 87 7F
D0 7B 58 E6 DA CD 81 3 94 8D 42
89 F6 8A 24 A8 7C B0 62 46 F5 FF
E9 CC C2 8F DC ED E 70 42 AA BE
A7 3D 60 A3 88 E7 FC 40 56 8 66 4C
C7 22 41 86 B1 BE D0 AA D9 FD E5
6F BC 19 E8 3C 6A EA 68 62 3 38
CB FE CF 91 35 33 6F F8 8A C3 9C
Non optimised
native function
Pinned ByteArray
Counter 1A Counter 1B
Counter 2A Counter 2B
0 1615 31

48
What to optimise ?
Example>>exampleArrayLoop
array do: [:item | item displayOnScreen ]
Array>>do: aClosure
1 to: self size do: [:index |
aClosure value: (self at: index) ]

49
What to optimise ?
Array>>do: aClosure
Example >>
exampleArrayLoop
Array >> do:
[:item |
item displayOnScreen ]
Stack
growing
down

50
What to optimise ?
Array>>do: aClosure
Hot spot detected
Example >>
exampleArrayLoop
Array >> do:
[:item |
Stack
growing
down
Hidden
branch

51
What to optimise ?
Array>>do: aClosure
Method to optimise
Hot spot detected
Example >>
exampleArrayLoop
Array >> do:
[:item |
Stack
growing
down

Virtual function optimisation
52
non
optimised
v-function
Compiled
Code
Compiled
Code
non
optimised
v-function
Scorch IR
optimised
v-function
Decompilation Generation
Range optimisations,
Loop optimisations,
...
Speculative
inlining

Bytecode optimisation
Through a CFG SSA IR (similar to LLVM)
Lots of details and edge cases
Details in the thesis
53

Installation
54
Where to install ?
Copy down ?
Collection
Array Dictionary
Ordered
Collection
Sequenceable
Collection
Hashed
Collection
Set
Copy down

Installation
55
Dependency management
Optimisations such as inlining track dependencies
Register optimised method

56
Next call uses optimised
virtual function

VM extensions
Execution of optimised virtual functions
Register Allocation
New operations
57

New operations
• unsafe array access
• arithmetic without overﬂow
• inlined allocation
• Efﬁcient type-checks
• ….
58
A bytecode set for adaptive optimizations, IWST’14

New operations
59
A bytecode set for adaptive optimizations, IWST’14
• unsafe array access
• arithmetic without overﬂow
• inlined allocation
• Efﬁcient type-checks
• ….
BEST PAPER
AWARD

Deoptimisation
Debugger requested
Incorrect speculation during optimisation
60

Deoptimisation
61
Smalltalk image
Virtual machine
Cogit
Trap
Tripped
Scorch
Reconstruct
objects
Stack
edition
Resume
execution

Traps
Most trap branches are not taken
Traps should be off the cpu i-cache
Call-back to Scorch
62
D5 C8 A2 5E 24 1A AD 7C 23 60 9
D2 3C A4 B9 66 28 CF A2 A8 27 A2
FA 7B 88 7E CA AD B5 43 6B EF 43
5C 48 A 55 D5 88 52 AE E5 68 77 1
D3 D8 F5 13 42 A5 F2 3B 76 7 CA 26
16 65 27 4E F6 EB 68 98 6B C3 91
42 68 8 B7 4B B3 6E 81 C5 F0 3D 44
E2 94 8E 29 61 82 93 5 D4 10 96 C3
EB C5 5 46 FB 52 61 A6 6B 44 11
BA 8E EB EA 91 70 20 C3 D7 67 E0
91 58 32 79 9A 31 50 ED 6D CB BF
6C E0 25 5B 74 82 D9 C3 E5 54 F6
87 B5 88 C1 16 65 BF B7 F1 6F E0
91 18 6E 84 2E B7 E8 3 4C 4B 92 7B
81 BE 84 4C C0 78 8C CB EB 87 7F
D0 7B 58 E6 DA CD 81 3 94 8D 42
89 F6 8A 24 A8 7C B0 62 46 F5 FF
E9 CC C2 8F DC ED E 70 42 AA BE
A7 3D 60 A3 88 E7 FC 40 56 8 66 4C
C7 22 41 86 B1 BE D0 AA D9 FD E5
6F BC 19 E8 3C 6A EA 68 62 3 38
CB FE CF 91 35 33 6F F8 8A C3 9C
Optimised
native function
Type check
Trap call-back

Object re-construction
63
Application frame
requesting
deoptimisation
Deoptimised
Context
Object
Reconstruction
Deoptimised
Context
Deoptimised
Context
Reconstructed
Closure
Reconstructed
Temp Vector
Reconstructed
Object
Deoptimisation metadata: objects to reconstruct
Objects to reconstruct includes contexts, closures,
temp vectors

64
Stack edition
Application frame
Application frame
Application frame
requesting
deoptimisation
Stack
growing
down

65
Stack edition
Application frame
Application frame
Application frame
requesting
deoptimisation
Call-back frame
Scorch
deoptimiser frame
Scorch
deoptimiser frame
Stack
growing
down

66
Stack edition
Application frame
Application frame
Application frame
requesting
deoptimisation
Call-back frame
Scorch
deoptimiser frame
Scorch
deoptimiser frame
Stack
growing
down
Application frame
Application frame
Deoptimised
Context
Call-back frame
Scorch
deoptimiser frame
Scorch
deoptimiser frame
Scorch
deoptimiser
Deoptimised
Context
Deoptimised
Context

Working implementation
Language “features” were incompatible
Old-style Memory Manager
Literal mutability
…
67

Memory Manager
68
A Partial Read Barrier for Efﬁcient Support of Live Object-
oriented Programming, ISMM’15
Old-style memory
representation
Old-style GC
Improved memory
representation
Efﬁcient scavenger

Memory Manager
69
A Partial Read Barrier for Efﬁcient Support of Live Object-
oriented Programming, ISMM’15
TOP
CONFERENCE
Old-style memory
representation
Old-style GC
Improved memory
representation
Efﬁcient scavenger

Literal Mutability
Literals are not constants
Limited compiler optimisations
70

Literal Mutability
Read-only objects
Any object can change mutability state
Hook before object mutation
Literals are now read-only
Optimiser notiﬁed upon mutation
71
A low Overhead Per Object Write Barrier for the Cog VM,
IWST’16

Making it work
Tools needed
Debugging tools
VM simulator
Accurate VM proﬁler
72
Accurate VM proﬁler for the Cog VM, IWST’17

Making it work
73
Accurate VM proﬁler for the Cog VM, IWST’17BEST PAPER
AWARD
Tools needed
Debugging tools
VM simulator
Accurate VM proﬁler

1. Architecture
Working Implementation
75
Bytecode
Bytecode
Bytecode
1 2

1. Architecture
Development cost
2 people part time
No empirical study
76
Bytecode
Bytecode
Bytecode
1 2

1. Architecture
Maintenance cost
A few open-source contributors
Shared back-end
77
Bytecode
Bytecode
Bytecode
1 2

2. Warm-up
Persistence of optimised virtual functions
78
Smalltalk runtime
Virtual machine
Cogit
Scorch
virtual functions
native functions
Baseline JIT
Optimising JIT

2. Warm-up
Persistence of optimised virtual functions
79
Smalltalk runtime
Virtual machine
Cogit
Scorch
virtual functions
native functions
Baseline JIT
Optimising JIT
Sista: Saving Optimized Code in Snapshots for Fast Start-Up,
ManLang’17

2. Warm-up: Comparison
Many tiers architecture
We have 3 tiers
4+ tiers is difﬁcult to maintain
80

Persistence of machine code (incl. Clones)
Arguably quicker start-up
Machine dependent
Difﬁcult to implement & maintain
Security issue (Future work)
81

Persistence of runtime information
More compilation time
82

3. Metacircular
Scorch implemented in Smalltalk
Can optimise its own code under certain constraints
Similar to Graal-Trufﬂe
83

84
Context
Research problems
Solution: Sista
Validation

Benchmarks
Cog: production VM
Cog+Counters: production VM + proﬁling counters
Sista (Cold): Sista runtime with non optimised v-functions
Sista (Warm): Sista runtime from optimised v-functions
85

86
(a) A*
0
20
40
60
80avgmsperiteration
(b) Binary tree
0
2
4
6
8
10
avgmsperiteration
(c) JSON parsing
0
2
4
6
8
10
avgmsperiteration
(d) Richards
0
2
4
6
avgmsperiteration
(e) K-Nucleotide
0
1,000
2,000
3,000
4,000
avgmsperiteration
(f) Thread ring
0
200
400
600
800
1,000
1,200
avgmsperiteration
(g) N-body
0
100
200
300
400
avgmsperiteration
(h) DeltaBlue
0
10
20
30
40
50
avgmsperiteration
(i) Mandelbrot
0
500
1,000
1,500
2,000
avgmsperiteration
(j) Spectral Norm
0
100
200
300
avgmsperiteration
(k) Meteor
0
100
200
300
avgmsperiteration
Legend
Cog
Cog + Counters
Sista (Cold)
Sista (Warm)

87
(a) A*
0
20
40
60
80avgmsperiteration
(b) Binary tree
0
2
4
6
8
10
avgmsperiteration
(c) JSON parsing
0
2
4
6
8
10
avgmsperiteration
(d) Richards
0
2
4
6
avgmsperiteration
(e) K-Nucleotide
0
1,000
2,000
3,000
4,000
avgmsperiteration
(f) Thread ring
0
200
400
600
800
1,000
1,200
avgmsperiteration
(g) N-body
0
100
200
300
400
avgmsperiteration
(h) DeltaBlue
0
10
20
30
40
50
avgmsperiteration
(i) Mandelbrot
0
500
1,000
1,500
2,000
avgmsperiteration
(j) Spectral Norm
0
100
200
300
avgmsperiteration
(k) Meteor
0
100
200
300
avgmsperiteration
Legend
Cog
Cog + Counters
Sista (Cold)
Sista (Warm)
Avg ms per iteration
The smaller the better

88
(a) A*
0
20
40
60
80avgmsperiteration
(b) Binary tree
0
2
4
6
8
10
avgmsperiteration
(c) JSON parsing
0
2
4
6
8
10
avgmsperiteration
(d) Richards
0
2
4
6
avgmsperiteration
(e) K-Nucleotide
0
1,000
2,000
3,000
4,000
avgmsperiteration
(f) Thread ring
0
200
400
600
800
1,000
1,200
avgmsperiteration
(g) N-body
0
100
200
300
400
avgmsperiteration
(h) DeltaBlue
0
10
20
30
40
50
avgmsperiteration
(i) Mandelbrot
0
500
1,000
1,500
2,000
avgmsperiteration
(j) Spectral Norm
0
100
200
300
avgmsperiteration
(k) Meteor
0
100
200
300
avgmsperiteration
Legend
Cog
Cog + Counters
Sista (Cold)
Sista (Warm)

89
(a) A*
0
20
40
60
80avgmsperiteration
(b) Binary tree
0
2
4
6
8
10
avgmsperiteration
(c) JSON parsing
0
2
4
6
8
10
avgmsperiteration
(d) Richards
0
2
4
6
avgmsperiteration
(e) K-Nucleotide
0
1,000
2,000
3,000
4,000
avgmsperiteration
(f) Thread ring
0
200
400
600
800
1,000
1,200
avgmsperiteration
(g) N-body
0
100
200
300
400
avgmsperiteration
(h) DeltaBlue
0
10
20
30
40
50
avgmsperiteration
(i) Mandelbrot
0
500
1,000
1,500
2,000
avgmsperiteration
(j) Spectral Norm
0
100
200
300
avgmsperiteration
(k) Meteor
0
100
200
300
avgmsperiteration
Legend
Cog
Cog + Counters
Sista (Cold)
Sista (Warm)

90
(a) A*
0
20
40
60
80avgmsperiteration
(b) Binary tree
0
2
4
6
8
10
avgmsperiteration
(c) JSON parsing
0
2
4
6
8
10
avgmsperiteration
(d) Richards
0
2
4
6
avgmsperiteration
(e) K-Nucleotide
0
1,000
2,000
3,000
4,000
avgmsperiteration
(f) Thread ring
0
200
400
600
800
1,000
1,200
avgmsperiteration
(g) N-body
0
100
200
300
400
avgmsperiteration
(h) DeltaBlue
0
10
20
30
40
50
avgmsperiteration
(i) Mandelbrot
0
500
1,000
1,500
2,000
avgmsperiteration
(j) Spectral Norm
0
100
200
300
avgmsperiteration
(k) Meteor
0
100
200
300
avgmsperiteration
Legend
Cog
Cog + Counters
Sista (Cold)
Sista (Warm)

91
(a) A*
0
20
40
60
80avgmsperiteration
(b) Binary tree
0
2
4
6
8
10
avgmsperiteration
(c) JSON parsing
0
2
4
6
8
10
avgmsperiteration
(d) Richards
0
2
4
6
avgmsperiteration
(e) K-Nucleotide
0
1,000
2,000
3,000
4,000
avgmsperiteration
(f) Thread ring
0
200
400
600
800
1,000
1,200
avgmsperiteration
(g) N-body
0
100
200
300
400
avgmsperiteration
(h) DeltaBlue
0
10
20
30
40
50
avgmsperiteration
(i) Mandelbrot
0
500
1,000
1,500
2,000
avgmsperiteration
(j) Spectral Norm
0
100
200
300
avgmsperiteration
(k) Meteor
0
100
200
300
avgmsperiteration
Legend
Cog
Cog + Counters
Sista (Cold)
Sista (Warm)

Deoptimiser validation
Practical Validation of Bytecode to Bytecode JIT
Compiler Dynamic Deoptimization, JOT’16
92
Good validation = Good symbolic execution = High
engineering time
Symbolic
non
optimised
stack
Symbolic
optimised
stack
Symbolic
deoptimised
stack
Scorch
deoptimiser
Symbolic
values
comparison

Runtime information
Inferring Types by Mining Class Usage Frequency
from Inline Caches, IWST’16
Mining Inline Cache Data to Order Inferred Types in
Dynamic Languages, SCP’17
93
Collaboration with SCG (Oscar Nierstrasz)
Runtime information for type inference

Contributions
&
Publications
94

95
Contributions
1 Hot spot detection in Cogit
2 Support of extended instruction set
3 VM call-backs to trigger Scorch
4 Runtime information primitive
5 Scorch
6 Spur Memory Manager
7 New bytecode set
8 Register allocation
9 Read-only objects
10 Improved closure implementation95

96
Contributions Prod
5 Scorch
7 New bytecode set
9 Read-only objects
10 Improved closure implementation
In progress
In progress
In progress
In progress
In progress

97
Contributions Prod
5 Scorch
7 New bytecode set
9 Read-only objects
In progress
In progress
In progress
In progress
In progress

98
Contributions Prod
5 Scorch
7 New bytecode set
9 Read-only objects
In progress
In progress
In progress
In progress
In progress

Publications
99
1
A Partial Read Barrier for Efficient Support of Live Object-oriented
Programming
ISMM’15
2
Practical Validation of Bytecode to Bytecode JIT Compiler Dynamic
Deoptimization
JOT’16
3 Mining Inline Cache Data to Order Inferred Types in Dynamic Languages SCP’17
4 Sista: Saving Optimized Code in Snapshots for Fast Start-Up ManLang’17
5 Towards a flexible Pharo Compiler IWST’13
6 A bytecode set for adaptive optimizations IWST’14
7 Inferring Types by Mining Class Usage Frequency from Inline Caches IWST’16
8 A low Overhead Per Object Write Barrier for the Cog VM IWST’16
9 Accurate VM profiler for the Cog VM IWST’17
Conferences and Journals
Workshops

Publications
100
1
Programming
ISMM’15
2
Deoptimization
JOT’16
Workshops
TOP
CONFERENCE

Publications
101
1
Programming
ISMM’15
2
Deoptimization
JOT’16
Workshops
CORE OF THE
THESIS

Publications
102
1
Programming
ISMM’15
2
Deoptimization
JOT’16
Workshops
BEST PAPER
AWARDS

103
Smalltalk runtime
Virtual machine
Cogit
Scorch
virtual functions
native functions
Baseline JIT
Optimising JIT
1. Split optimising JIT
architecture
2. Persistence of optimised
functions
3. Metacircular JIT
optimising itself
A Partial Read Barrier for Efﬁcient Support of Live Object-oriented Programming,
ISMM’15
Sista: Saving Optimized Code in Snapshots for Fast Start-Up, ManLang’17
A bytecode set for adaptive optimizations, IWST’14, Best paper award
Accurate VM proﬁler for the Cog VM, IWST’17, Best paper award

Sista: a Metacircular Architecture for Runtime Optimisations

Recommandé

Recommandé

Contenu connexe

Similaire à Sista: a Metacircular Architecture for Runtime Optimisations

Similaire à Sista: a Metacircular Architecture for Runtime Optimisations (20)

Dernier

Dernier (20)

Sista: a Metacircular Architecture for Runtime Optimisations