3. Introduction:
The Java language is made to be interpreted to achieve the critical goal of
application portability.
HW.java
Other classes
HW.class
public class HW{
. . . .
void hello(){
. . . .
}
}
javac
Java Source file
Java Language
ca
08
fe
1a
ba
42
be
..
java
Class file(bytecode)
Java Virtual Machine
Microprocessors have instruction sets that define the operations they can perform, so does
the VM instructions compile into a format known as bytecodes.
It is through the VM that executable bytecode Java classes are executed and ultimately
routed to appropriate native system calls.
Problem:
“A Java program executing within the VM is executed a bytecode at a time”
4. Problem (Contd.):
The conventional approach resulted in significantly lower performance when
compared to compiled languages like C/C++ by the additional processor and memory usage
during interpretation.
As a result, slow and space-constrained computing devices have tended not to include
virtual computing technology(i.e. JVM).
Initiatives:
JSR-30 : J2ME CLDC (Connected Limited Device Configuration) Specification
Reference implementation of the J2ME CLDC (Connected Limited Device Configuration)
in April 1999, got approval in August 1999
Final public release of CLDC 1.0 in May 2003
The HotSpot engine was developed to address the perception that Java virtual
machine performance was insufficient for many mainstream applications.
By implementing a host of performance enhancing techniques that went beyond
innovations like just-in-time (JIT) compilers, the performance of the Java virtual machine
increased by an order of magnitude
5. Just-In-Time (JIT)/Dynamic Compilation :
The Just-In-Time (JIT) compiler is a component of the Java Runtime
Environment. It improves the performance of Java applications by compiling bytecodes
to native machine code at run time.
Just-In-Time Compiler
Byte
Codes
JVM
Intermediate Representation
Generator
Optimizer
Profiler
GC
Code Generator
Runtime
Just-In-Time (JIT) Compiler
6. Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
JIT Compilation Strategies:
With a JIT compiler, Java programs are compiled one block of code at a time as they
execute into the native processor's instructions to achieve higher performance.
The process involves generating an internal representation of a method that's
different from bytecodes but at a higher level than the target processor's native
instructions.
The compiler performs optimization to improve quality and efficiency
and finally a code-generation step to translate the optimized internal representation to
the target processor's native instructions
To avoid the overhead of compiling and optimizing all an application’s classes at a time,
a number of incremental compilation strategies have evolved.
The general strategy of only compiling the “hot” parts of an application will often result
in only a small percentage of an application being compiled, thus saving considerable
compilation time.
“A continuously operating sampling profiler identifies programs hot regions for code
reoptimization”
“The JIT compiler operates on a compilation thread that's separate from the application
threads so that the application doesn't need to wait for a compilation to occur”
7. Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
The Just-In-Time (JIT) compiler is a component of the Java Runtime Environment.
It improves the performance of Java applications by compiling bytecodes to native machine
code at run time.
A Java class that has been loaded into memory by the VM contains a V-table
(virtual table), which is a list of the addresses for all the methods in the class.
Method 1 Bytecode
Method - 1
Method 2 Bytecode
Method - 2
Method - 3
Method 3 Bytecode
Method - 4
V-table
Method 4 Bytecode
Each address in the V-table points to the executable bytecode for the particular method
8. Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
When the JIT is loaded, bytecode address in the V-table is replaced with the
address of the JIT compiler itself.
Method - 1
Method - 2
Method - 3
Just-In-Time Compiler
Method - 4
Method - 5
V-table
When the VM calls a method through the address in the V-table, the JIT compiler is
executed instead.
9. Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
The JIT compiler steps in and compiles the Java bytecode into native code and
then patches the native code address back to the V-table.
V-table
Method - 1
Method - 2
Method - 3
Just-In-Time Compiler
Method - 4
Method - 5
Method 5 Native Code
From now on, each call to the method results in a call to the native version.
10. JIT Design :
Challenges (Price of Platform neutrality):
The time it takes to compile the code is added to the program's running time.
JIT typically causes a slight delay in initial execution of an application, due to the
time taken to load and compile the bytecode.
Optimizations:
Modern JIT compilers take one of two approaches
1. Compile all the code but without performing any expensive analyses or transformations so that the
code is generated quickly.
2. Devote compilation resources to only a small number of methods that execute frequently.
Combine interpretation and JIT compilation. The application code is initially
interpreted, but the JVM monitors which sequences of bytecode are frequently
executed and translates them to machine code for direct execution on the hardware.
11. JIT Design (Contd.) :
There are 4 reasons for why a JIT for the complete byte code set was not implemented and
the combined usage of Interpreter and JIT has become unavoidable.
1.If thread context switching would have had to be performed whilst executing generated
native code, this would have added complexity to code generation, runtime support, and
the base VM code. By only performing context switching in the interpreter no changes
were made to the way the thread scheduling was done in VM.
2.The generated machine code would have needed to be more rigorous in the way it
dealt with error conditions and other exceptional conditions. As it is, the machine code
only needs to check for error conditions. When they occur the error handling bytecodes
can be then executed by the interpreter, which then can deal with the details of how the
error should be processed.
3.A complete JIT would have required more complicated interactions between the
generated machine code and the virtual machine as a whole. For example, the generated
machine code could cause the compiler, class loader, garbage collector, or native code to
run. In retrospect some of these restrictions were not strictly necessary, but the system
probably has fewer undiscovered bugs, and it does not seem to have limited the
performance of the type of compute-intensive software that is the target of the design.
(Contd.)
12. JIT Design (Contd.) :
4. A debugging technique (discussed below) was used which could not have been
employed so easily with a complete JIT.
Therefore the system was designed to allow execution to pass from the compiled
code to the interpreter at any time, and also for the interpreter to be able return to
generated code in a timely fashion.
Additionally, to keep the interpreter from getting trapped in a long loop of
bytecodes it was necessary to be able to return to compiled code in the middle of a
method as well as at the start.
“JIT lets the interpreter to deal with complex tasks such as Class loading, Exception
handling, Synchronization, Garbage Collection etc”
The basic interpreter loop is as follows:
Start:
Try to enter compiled code.
Interpret the next bytecode.
goto Start.
If the current method has not been compiled then checks are performed to determine if
it can be.
13. JIT Design (Contd.) :
Compilation may not be possible for one of the following reasons.
1.A native function was called.
2.The method has more than a certain number of parameters or local variables, is unusually
large
3.There is no available memory for more compiled code.
4.An object could not be created without running the garbage collector.
5.An operation was attempted that required a class to be initialized.
6.The start of an exception handler was reached.
7.An exception or error occurred. The interpreter always processes these.
8.The part of a method was reached for which no corresponding machine code could be
generated.
9.A function was called for which there was no compiled code.
10.A method return was executed but there was no compiled code to return to because the
code buffer had been flushed.
14. JIT Design (Contd.) :
1. The JVM interprets a method until its call count exceeds a JIT threshold.
2.After a method is compiled, its call count is reset to zero; subsequent calls to the
method continue to increment its count.
3. When the call count of a method reaches a JIT recompilation threshold, the JIT compiles it a
second time, this time applying a larger selection of optimizations than on the previous compilation
(because the method has proven to be a significant part of the whole program)
Method - 1
Method 1 Bytecode
Method - 2
Method 2 Bytecode
Method - 3
Method 3 Bytecode
Method - 4
Method 4 Bytecode
V-table
Just-In-Time Compiler
15. JIT Design (Contd.) :
Interpreter
JIT=OFF
JIT
.class
JIT=ON Threshold=10
.class
.class
times >= 10
JVM
JVM
Native
Code
Operating System
times < 10
16. Dalvik JIT :
Dalvik Execution Environment:
1.Register based architecture (Register Machine)
Stack-based machines (JVMs) must use instructions to load data on the stack and
manipulate that data, and, thus, require more instructions than register machines.
2.Very compact representation
Java bytecode is converted into an alternate instruction set used by the Dalvik VM.
dx is a tool used to convert some (but not all) Java .class files into the .dex format.
3.Emphasis on code/data sharing to reduce memory usage
Multiple classes are included in a single .dex file.
4.Highly-tuned very fast (2x similar) Dalvik Interpreter, good enough for most of the
applications.
For compute-intensive applications, Native Development Kit was released to allow
Dalvik applications to call out statically-compiled(native) methods.
17. Dalvik JIT (Contd.):
Other part of solution is Dalvik JIT:
Translates byte code to optimized native code at run time.
1.Method Compiler
2.Trace Compiler
1.Method Compiler
- Most common model for server JITs
- Interprets with profiling to detect hot methods
- Compile & optimize method-sized chunks
- Strengths
• Larger optimization window
• Machine state sync with interpreter only at method call boundaries
- Weaknesses
• Cold code within hot methods gets compiled
• Much higher memory usage during compilation & optimization
• Longer delay between the point at which a method goes hot and the
point that a compiled and optimized method delivers benefits
18. Dalvik JIT (Contd.):
2.Trace Compiler
- Most common model for low-level code migration systems
- Interprets with profiling to identify hot execution paths
- Compiled fragments chained together in translation cache
- Strengths
• Only hottest of hot code is compiled, minimizing memory usage
• Tight integration with interpreter allows focus on common cases
• Very rapid return of performance boost once hotness detected
- Weaknesses
• Smaller optimization window limits peak gain
• More frequent state synchronization with interpreter
• Difficult to share translation cache across processes
19. Dalvik JIT (Contd.):
(Method Vs Trace):
Method JIT:
Best optimization window
Trace JIT:
Best speed/space tradeoff
Full Program
4,695,780 bytes
Hot Methods
396,230 bytes
8% of program
Hot Traces
396,230 bytes
26% of Hot methods
2% of program
20. Dalvik JIT (Contd.):
The provisional decision was to start with trace for the following reasons:
• Minimizing memory usage critical for mobile devices
• Important to deliver performance boost quickly
- User might give up on new app if we wait too long to JIT
• Leave open the possibility of supplementing with method-based JIT
- The two styles can co-exist
- A mobile device looks more like a server when it’s plugged in
- Best of both worlds
• Trace JIT when running on battery
• Method JIT in background while charging
The Dalvik JIT can be considered as an extension of the Interpreter because it is the
Interpreter which profiles and triggers trace selection mode when a potential trace head
goes hot.
21. Dalvik JIT (Contd.):
Dalvik Trace JIT Flow:
Start
Interpret until
next potential
trace head
Translation Cache
NO
Update Profile
count for this
location
Translation
Threshold?
Translation
Exit 0
Exit 1
YES
Interpret/build
Trace request
NO
Xlation
exists?
YES
Submit Compilation
Request
Compiler Thread
Exit 0
Exit 1
Install new
translation
Translation
Exit 0
Exit 1
22. Dalvik JIT (Contd.):
Features:
• Trace request is built during interpretation
- Allows access to actual run-time values
- Ensures that trace only includes byte codes that have successfully executed at
least once (useful for some optimizations)
• Trace requests handed off to compiler thread, which compiles and optimizes into native
code
• Compiled traces chained together in translation cache
• Per-process translation caches (sharing only within security sandboxes)
• Simple traces - generally 1 to 2 basic blocks long
• Local optimizations
- Register promotion
- Load/store elimination
- Redundant null-check elimination
- Heuristic scheduling
• Loop optimizations
- Simple loop detection
- Invariant code motion
- Induction variable optimization
23. JIT Compiler:
JIT Compiler Work Flow:
In order to execute bytecode, JIT compiler goes through three stages.
1.Baseline: Generates code that is “Obviously correct”
The process involves generating an internal representation of a java code that is
different from bytecodes but at a higher level than the target processor's native
instructions (Intermediate Representation(IR)).
“IR allows more effective machine-specific optimizations”
2.Optimizing: Applies a set of optimizations to a class when it is loaded at run time
3.Adaptive: Methods are compiled with a non-optimizing compiler first and then selects
“hot” methods for recompilation based on run-time profiling information.
“A key part of the JIT design was to split the compilation process into two passes. The first pass
transforms the standard, stack-based bytecodes into a simple 3-address intermediate representation
in which all temporary statement results are placed into new local variables instead of entries on an
evaluation stack. The second pass converts this three-address form into native machine code.”
24. Intermediate Representation:
An IR instruction is an N-tuple (a simple mathematical set), consisting of an
operator, and some number of operands.
“The Intermediate Representation is a machine- and language-independent
version of the original source code”
An Operator is the instruction to perform
Operands are used to represent Symbolic Register, Physical Registers,
Memory Locations, Constants, Branch targets, Method Signatures, Types etc
An IR code must be convenient to translate into real assembly code for all
desired target machines
25. Intermediate Representation (contd.):
Three Address Code (TAC or 3AC):
1.Three-address code is a form of representing intermediate code(IR) used
by compilers to aid in the implementation of code-improving transformations.
2.Each instruction in three-address code can be described as a 4-tuple: (operator, operand1,
operand2, result) as shown.
result := operand1 operator operand2
such as
x := y + z
3.Expressions containing more than one fundamental operation, such as:
p=x+y*z
are not representable in three-address code as a single instruction.
Instead, they are decomposed into an equivalent series of instructions,
such as
t1 := y * z
p := x + t1
“The key features of three-address code are that every instruction implements exactly
one fundamental operation, and that the source and destination may refer to any
available register”
26. Intermediate Representation (contd.):
Static Single Assignment form (SSA):
1.A refinement of three-address code and a property of an intermediate
representation (IR), which says that each variable is assigned exactly once
2.Existing variables in the original IR are split into versions, new variables typically indicated
by the original name with a subscript in textbooks, so that every definition gets its own
version
Benefits (by Example):
y := 1
y := 2
x := y
TAC
1.
2.
Humans can see that the first assignment is
not necessary
The value of y being used in the third line
comes from the second assignment of y.
A program would have to perform “reaching
definition analysis” to do these optimizations
With SSA, 1 and 2 are immediate as it
identifies “y1” is used only once and
omitting it wont affect other part of code
y1 := 1
y2 := 2
x := y2
SSA
27. Intermediate Representation (contd.):
3 levels of IR:
Levels of IR:
b
y
t
e
c
o
d
e
H
M
L
I
I
I
R
R
M
a
c
h
i
n
e
R
1. IRs that are close to a high-level language are called high-level IRs, and IRs that are close to
assembly are called low-level IRs.
2. A high-level IR might preserve things like array subscripts or field accesses whereas a low-level IR
converts those into explicit addresses and offsets.
Original
HIR
MIR
float a[10][20]
a[i][j+2]
t1 = a[i, j+2]
t1
t2
t3
t4
t5
t6
t7
=
=
=
=
=
=
=
LIR
j+2
i*20
t1+t2
4*t3
addr a
t5+t4
*t6
r1
r2
r3
r4
r5
r6
r6
f1
=
=
=
=
=
=
=
=
[fp-4]
[r1+2]
[fp-8]
r3*20
r4+r2
4*r5
fp–216
[r7+r6]
28. Intermediate Representation (contd.):
1.HIR (High Level IR)
a) IR that are closer to high-level language (Operators similar to Java bytecode)
b) Usually preserves information such as loop-structure and if-then-else
statements
c) Operate on symbolic registers instead of an implicit stack
HIR Generation:
class AdditionMethodTest {
public static void main(String args[]) {
int a = 3;
int b = 4;
int c = a + b;
int d = getNewValue(c);
return;
} // End method main
public static int getValue(int var) {
return var * var;
} // End method getNewValue
}
Java Code (.java)
Method void main(java.lang.String[])
0 iconst_3
1 istore_1
2 iconst_4
3 istore_2
4 iload_1
5 iload_2
6 iadd
7 istore_3
8 iload_3
9 invokestatic #2 <Method int getValue(int)>
12 istore 4
14 return
Method int getNewValue(int)
0 iload_0
1 iload_0
2 imul
3 ireturn
Bytecode (.class)
29. Intermediate Representation (contd.):
Conversion from Java bytecode to HIR:
Compiler that performs this conversion contains 2 parts.
1. The BC2IR algorithm that translates bytecode to HIR and performs on-the-fly
optimizations during translation.
2.Additional optimizations perform on the HIR after translation.
BC2IR Translation:
1.Discovers extended-basic-blocks
2.Constructs an exception-table for the method
3.Creates HIR instructions for bytecodes
4.Performs On-the-fly optimizations
a) Copy propagation
b) Constant propagation
c) Register renaming for local variables
d) Dead-Code elimination
e) Short final or static methods are in-lined
Note: Even though these optimizations are performed in later phases, doing so here
reduces the size of the HIR generated and thus compile time.
30. Intermediate Representation (contd.):
y=x+5
Example of on-fly-optimization:
Java Bytecode
iload x
iconst 5
iadd
istore y
Generated IR
(optimization off)
Generated IR
(optimization on)
INT_ADD tint, xint 5
INT_MOVE yint, tint
INT_ADD yint, xint, 5
Copy propagation algorithm can be noticed here
31. Intermediate Representation (contd.):
The HIR generated code for AdditionMethodTest.java:
********* START OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V
-13
LABEL0 Frequency: 0.0
-2
EG ir_prologue
l0i([Ljava/lang/String;,d) =
1
int_move
l1i(B) = 3
3
int_move
l2i(B) = 4
7
int_move
l3i(B) = 7
9
EG call
l5i(I) AF CF OF PF SF ZF = 66668, static"AdditionMethodTest.getValue (I)I", <unused>, 7
-3
return
<unused>
-1
bbend
BB0 (ENTRY)
********* END OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V
********* START OF IR DUMP Initial HIR
FOR AdditionMethodTest.getValue (I)I
-13
LABEL0
Frequency: 0.0
-2
EG ir_prologue
l0i(I,d) =
2
int_mul
t2i(I) = l0i(I,d), l0i(I,d)
3
int_move
t1i(I) = t2i(I)
-3
return
t1i(I)
-1
bbend
BB0 (ENTRY)
*********
END OF IR DUMP Initial HIR
FOR AdditionMethodTest.getValue (I)I
32. Intermediate Representation (contd.):
Optimizations for HIR:
Following optimizers are provided for the basic optimization.
1.CF
2.CPF
3.CSE
4.DCE
5.GT
// Constant Folding
// Constant Propagation and Folding (triggered by the propagation)
// Common Sub-expression Elimination (within basic blocks)
// Dead Code Elimination
// Global Variable Temporalization (within basic block)
The optimizers CF and GT do not require data flow analysis, however, CPF, CSE
and DCE require some result of data flow analysis.
Complete Description can be available @
http://www.coins-project.org/international/COINSdoc.en/hiropt/hiropt.html
33. Intermediate Representation (contd.):
2.Medium-Level IRs (MIR)
a) Support range of features in a set of source languages, but in a languageindependent way.
b) Good basis for generation of efficient machine code for one or more
architectures.
Example: register transfer languages
3.Low-Level IRs (LIR)
a) Almost one-to-one correspondence to target-machine instructions: quite
architecture-dependent.
<MIR & LIR to be added>
34. Optimization Techniques:
Why Optimization:
1.
Programmers do not always write optimal code.
a) For example, ways to improve code are not always recognized
(e.g. move loop-invariant code out of loops, avoiding re-computation of the same
expression).
2. High-level language may not allow a programmer to avoid redundant
computation (or make it inconvenient)
a[i][j] = a[i][j] + 1
3. The programmer should not be bothered with the target machine architecture.
Moreover, modern machine architectures assume optimization; it has become hard to
optimize by hand.
Goal:
Let programmers write clean, high-level source code, produce programs that approach assembly-code
performance.
Optimization: the transformation of a program P into a program P´, that has the same input/output
behavior, but is somehow “better”. Better might mean:
• faster, or
• smaller, or
• uses less power, or
• whatever you care about
P´ is not optimal, may even be worse than P.