Most high-level programming languages run on top of a virtual machine (VM) to abstract away from the underlying hardware. To reach high-performance, the VM typically relies on an optimising just-in-time compiler (JIT), which speculates on the program behavior based on its first runs to generate at runtime efficient machine code and speed-up the program execution. As multiple runs are required to speculate correctly on the program behavior, such a VM requires a certain amount of time at start-up to reach peak performance. The optimising JIT itself is usually compiled ahead-of-time to executable code as part of the VM.
The dissertation proposes Sista, an architecture for an optimising JIT, in which the optimised state of the VM can be persisted across multiple VM start-ups and the optimising JIT is running in the same runtime than the program executed. To do so, the optimising JIT is split in two parts. One part is high-level: it performs optimisations specific to the programming language run by the VM and is written in a metacircular style. Staying away from low-level details, this part can be read, edited and debugged while the program is running using the standard tool set of the programming language executed by the VM. The second part is low-level: it performs machine specific optimisations and is compiled ahead-of-time to executable code as part of the VM. The two parts of the JIT use a well-defined intermediate representation to share the code to optimise. This representation is machine-independent and can be persisted across multiple VM start-ups, allowing the VM to reach peak performance very quickly.
To validate the architecture, the dissertation includes the description of an implementation on top of Pharo Smalltalk and its VM. The implementation is able to run a large set of benchmarks, from large application benchmarks provided by industrial users to micro-benchmarks used to measure the performance of specific code patterns. The optimising JIT is implemented according to the architecture proposed and shows significant speed-up (up to 5x) over the current production VM. In addition, large benchmarks show that peak performance can be reached almost immediately after VM start-up if the VM can reuse the optimised state persisted from another run.
2. Ph.D Setup
International collaboration
Goal: transfer new knowledge to RMoD
2
France (32 months) California (4 months)
RMoD Researchers Domain expert
Stéphane Ducasse
Marcus Denker
Eliot Miranda
Inria Stellect Systems LLC
20. 20
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Speculations
Always
correct
Bytecode Native
Bytecode Native
Bytecode
Bytecode
24. Existing proposal
24
Eliot Miranda,
with contributions of
Paolo Bonzini, Steve Dahl, David Griswold, Urs Hölzle,
Ian Piumarta and David Simmons
Adaptive Optimiser for Small-Talk Architecture
25. Key ideas
Build on top of the existing VM
Split optimising JIT design
25
26. Build on top of existing VM
Market acceptance
Lower risk and investment
26
27. Split optimising JIT design
27
Bytecode
Bytecode
Bytecode
Bytecode Native code
1 2
Lowering maintenance cost
1 is written in Smalltalk
More open source contributors
2 reuses the baseline JIT
One back-end to maintain
32. Non optimising tiers are slower to execute code
Runtime compilation time
Deoptimisation and re-optimisation
32
Run
Number
Compilation
time
Execution
time
1 to 6 Slow
7 Low
8 to
10,000
Average
10,000 High
10,001 +
Fast
Tier
Interpreter
baseline JIT
optimising
JIT
Time to reach peak performance
34. Existing solutions
Many tiers architecture
Baseline JIT
4 tiers in Javascript Webkit
Cloneable JVMs
Work of Kawachiya et al.
Clone a running JVM including native code
34
35. Existing solutions
Persistence of Runtime information
Saving inlining decisions (Strongtalk)
Shared profiling information between VMs (Arnold et
al.)
Persistence of Machine code
Azul
JRockit
35
43. Smalltalk runtime
Virtual machine
Cogit
n-function to v-function
Machine-specific optimisations
Scorch
non optimised v-functions to
optimised v-function
Smalltalk-specific optimisations
virtual functions
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT
Optimising JIT
44. Smalltalk runtime
Virtual machine
Cogit
n-function to v-function
Machine-specific optimisations
Scorch
non optimised v-functions to
optimised v-function
Smalltalk-specific optimisations
virtual functions
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT
Optimising JIT
45. Smalltalk runtime
Virtual machine
Cogit
n-function to v-function
Machine-specific optimisations
Scorch
non optimised v-functions to
optimised v-function
Smalltalk-specific optimisations
virtual functions
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT
Optimising JIT
My work
alone
My work in
collaboration
with Stellect
Systems LLC
46. Optimising JIT
46
Smalltalk image
Virtual machine
Cogit
Hot spot
detection
Scorch
What to
optimise
Optimisation Installation
Optimised
native function
Optimised virtual function
47. Hot spot detection
47
Profiling counters on branches
Counters as pinned byteArray on heap
No cpu i-cache flush
Direct reference to counter
Detection: VM call-back to Scorch
D5 C8 A2 5E 24 1A AD 7C 23 60 9
D2 3C A4 B9 66 28 CF A2 A8 27 A2
FA 7B 88 7E CA AD B5 43 6B EF 43
5C 48 A 55 D5 88 52 AE E5 68 77 1
D3 D8 F5 13 42 A5 F2 3B 76 7 CA 26
16 65 27 4E F6 EB 68 98 6B C3 91
42 68 8 B7 4B B3 6E 81 C5 F0 3D 44
E2 94 8E 29 61 82 93 5 D4 10 96 C3
EB C5 5 46 FB 52 61 A6 6B 44 11
BA 8E EB EA 91 70 20 C3 D7 67 E0
91 58 32 79 9A 31 50 ED 6D CB BF
6C E0 25 5B 74 82 D9 C3 E5 54 F6
87 B5 88 C1 16 65 BF B7 F1 6F E0
91 18 6E 84 2E B7 E8 3 4C 4B 92 7B
81 BE 84 4C C0 78 8C CB EB 87 7F
D0 7B 58 E6 DA CD 81 3 94 8D 42
89 F6 8A 24 A8 7C B0 62 46 F5 FF
E9 CC C2 8F DC ED E 70 42 AA BE
A7 3D 60 A3 88 E7 FC 40 56 8 66 4C
C7 22 41 86 B1 BE D0 AA D9 FD E5
6F BC 19 E8 3C 6A EA 68 62 3 38
CB FE CF 91 35 33 6F F8 8A C3 9C
Non optimised
native function
Pinned ByteArray
Counter 1A Counter 1B
Counter 2A Counter 2B
0 1615 31
48. 48
What to optimise ?
Example>>exampleArrayLoop
array do: [:item | item displayOnScreen ]
Array>>do: aClosure
1 to: self size do: [:index |
aClosure value: (self at: index) ]
49. 49
What to optimise ?
Example>>exampleArrayLoop
array do: [:item | item displayOnScreen ]
Array>>do: aClosure
1 to: self size do: [:index |
aClosure value: (self at: index) ]
Example >>
exampleArrayLoop
Array >> do:
[:item |
item displayOnScreen ]
Stack
growing
down
50. 50
What to optimise ?
Example>>exampleArrayLoop
array do: [:item | item displayOnScreen ]
Array>>do: aClosure
1 to: self size do: [:index |
aClosure value: (self at: index) ]
Hot spot detected
Example >>
exampleArrayLoop
Array >> do:
[:item |
item displayOnScreen ]
Stack
growing
down
Hidden
branch
51. 51
What to optimise ?
Example>>exampleArrayLoop
array do: [:item | item displayOnScreen ]
Array>>do: aClosure
1 to: self size do: [:index |
aClosure value: (self at: index) ]
Method to optimise
Hot spot detected
Example >>
exampleArrayLoop
Array >> do:
[:item |
item displayOnScreen ]
Stack
growing
down
54. Installation
54
Where to install ?
Copy down ?
Collection
Array Dictionary
Ordered
Collection
Sequenceable
Collection
Hashed
Collection
Set
Copy down
58. New operations
• unsafe array access
• arithmetic without overflow
• inlined allocation
• Efficient type-checks
• ….
58
A bytecode set for adaptive optimizations, IWST’14
59. New operations
59
A bytecode set for adaptive optimizations, IWST’14
• unsafe array access
• arithmetic without overflow
• inlined allocation
• Efficient type-checks
• ….
BEST PAPER
AWARD
68. Memory Manager
68
A Partial Read Barrier for Efficient Support of Live Object-
oriented Programming, ISMM’15
Old-style memory
representation
Old-style GC
Improved memory
representation
Efficient scavenger
69. Memory Manager
69
A Partial Read Barrier for Efficient Support of Live Object-
oriented Programming, ISMM’15
TOP
CONFERENCE
Old-style memory
representation
Old-style GC
Improved memory
representation
Efficient scavenger
71. Literal Mutability
Read-only objects
Any object can change mutability state
Hook before object mutation
Literals are now read-only
Optimiser notified upon mutation
71
A low Overhead Per Object Write Barrier for the Cog VM,
IWST’16
72. Making it work
Tools needed
Debugging tools
VM simulator
Accurate VM profiler
72
Accurate VM profiler for the Cog VM, IWST’17
73. Making it work
73
Accurate VM profiler for the Cog VM, IWST’17BEST PAPER
AWARD
Tools needed
Debugging tools
VM simulator
Accurate VM profiler
85. Benchmarks
Cog: production VM
Cog+Counters: production VM + profiling counters
Sista (Cold): Sista runtime with non optimised v-functions
Sista (Warm): Sista runtime from optimised v-functions
85
92. Deoptimiser validation
Practical Validation of Bytecode to Bytecode JIT
Compiler Dynamic Deoptimization, JOT’16
92
Good validation = Good symbolic execution = High
engineering time
Symbolic
non
optimised
stack
Symbolic
optimised
stack
Symbolic
deoptimised
stack
Scorch
deoptimiser
Symbolic
values
comparison
93. Runtime information
Inferring Types by Mining Class Usage Frequency
from Inline Caches, IWST’16
Mining Inline Cache Data to Order Inferred Types in
Dynamic Languages, SCP’17
93
Collaboration with SCG (Oscar Nierstrasz)
Runtime information for type inference
95. 95
Contributions
1 Hot spot detection in Cogit
2 Support of extended instruction set
3 VM call-backs to trigger Scorch
4 Runtime information primitive
5 Scorch
6 Spur Memory Manager
7 New bytecode set
8 Register allocation
9 Read-only objects
10 Improved closure implementation95
96. 96
Contributions Prod
1 Hot spot detection in Cogit
2 Support of extended instruction set
3 VM call-backs to trigger Scorch
4 Runtime information primitive
5 Scorch
6 Spur Memory Manager
7 New bytecode set
8 Register allocation
9 Read-only objects
10 Improved closure implementation
In progress
In progress
In progress
In progress
In progress
97. 97
Contributions Prod
1 Hot spot detection in Cogit
2 Support of extended instruction set
3 VM call-backs to trigger Scorch
4 Runtime information primitive
5 Scorch
6 Spur Memory Manager
7 New bytecode set
8 Register allocation
9 Read-only objects
10 Improved closure implementation
In progress
In progress
In progress
In progress
In progress
98. 98
Contributions Prod
1 Hot spot detection in Cogit
2 Support of extended instruction set
3 VM call-backs to trigger Scorch
4 Runtime information primitive
5 Scorch
6 Spur Memory Manager
7 New bytecode set
8 Register allocation
9 Read-only objects
10 Improved closure implementation
In progress
In progress
In progress
In progress
In progress
99. Publications
99
1
A Partial Read Barrier for Efficient Support of Live Object-oriented
Programming
ISMM’15
2
Practical Validation of Bytecode to Bytecode JIT Compiler Dynamic
Deoptimization
JOT’16
3 Mining Inline Cache Data to Order Inferred Types in Dynamic Languages SCP’17
4 Sista: Saving Optimized Code in Snapshots for Fast Start-Up ManLang’17
5 Towards a flexible Pharo Compiler IWST’13
6 A bytecode set for adaptive optimizations IWST’14
7 Inferring Types by Mining Class Usage Frequency from Inline Caches IWST’16
8 A low Overhead Per Object Write Barrier for the Cog VM IWST’16
9 Accurate VM profiler for the Cog VM IWST’17
Conferences and Journals
Workshops
100. Publications
100
1
A Partial Read Barrier for Efficient Support of Live Object-oriented
Programming
ISMM’15
2
Practical Validation of Bytecode to Bytecode JIT Compiler Dynamic
Deoptimization
JOT’16
3 Mining Inline Cache Data to Order Inferred Types in Dynamic Languages SCP’17
4 Sista: Saving Optimized Code in Snapshots for Fast Start-Up ManLang’17
5 Towards a flexible Pharo Compiler IWST’13
6 A bytecode set for adaptive optimizations IWST’14
7 Inferring Types by Mining Class Usage Frequency from Inline Caches IWST’16
8 A low Overhead Per Object Write Barrier for the Cog VM IWST’16
9 Accurate VM profiler for the Cog VM IWST’17
Conferences and Journals
Workshops
TOP
CONFERENCE
101. Publications
101
1
A Partial Read Barrier for Efficient Support of Live Object-oriented
Programming
ISMM’15
2
Practical Validation of Bytecode to Bytecode JIT Compiler Dynamic
Deoptimization
JOT’16
3 Mining Inline Cache Data to Order Inferred Types in Dynamic Languages SCP’17
4 Sista: Saving Optimized Code in Snapshots for Fast Start-Up ManLang’17
5 Towards a flexible Pharo Compiler IWST’13
6 A bytecode set for adaptive optimizations IWST’14
7 Inferring Types by Mining Class Usage Frequency from Inline Caches IWST’16
8 A low Overhead Per Object Write Barrier for the Cog VM IWST’16
9 Accurate VM profiler for the Cog VM IWST’17
Conferences and Journals
Workshops
CORE OF THE
THESIS
102. Publications
102
1
A Partial Read Barrier for Efficient Support of Live Object-oriented
Programming
ISMM’15
2
Practical Validation of Bytecode to Bytecode JIT Compiler Dynamic
Deoptimization
JOT’16
3 Mining Inline Cache Data to Order Inferred Types in Dynamic Languages SCP’17
4 Sista: Saving Optimized Code in Snapshots for Fast Start-Up ManLang’17
5 Towards a flexible Pharo Compiler IWST’13
6 A bytecode set for adaptive optimizations IWST’14
7 Inferring Types by Mining Class Usage Frequency from Inline Caches IWST’16
8 A low Overhead Per Object Write Barrier for the Cog VM IWST’16
9 Accurate VM profiler for the Cog VM IWST’17
Conferences and Journals
Workshops
BEST PAPER
AWARDS
103. 103
Smalltalk runtime
Virtual machine
Cogit
n-function to v-function
Machine-specific optimisations
Scorch
non optimised v-functions to
optimised v-function
Smalltalk-specific optimisations
virtual functions
(persisted across start-ups)
native functions
(discarded on shut-down)
Baseline JIT
Optimising JIT
1. Split optimising JIT
architecture
2. Persistence of optimised
functions
3. Metacircular JIT
optimising itself
A Partial Read Barrier for Efficient Support of Live Object-oriented Programming,
ISMM’15
Sista: Saving Optimized Code in Snapshots for Fast Start-Up, ManLang’17
A bytecode set for adaptive optimizations, IWST’14, Best paper award
Accurate VM profiler for the Cog VM, IWST’17, Best paper award