SlideShare une entreprise Scribd logo
1  sur  55
1
WWW.BELL-SW.COM
WWW.BELL-SW.COM
2018Java on Arm: theory, applications
and workloads
Aleksei Voitylov, Dmitry Chuyko
2
WWW.BELL-SW.COM
Who we are
Aleksei Voitylov
@AVoitylov
Ex-employers:
http://bell-sw.com
Liberica – supported OpenJDK binaries
3
WWW.BELL-SW.COM
Who we are
Dmitry Chuyko
@dchuyko
http://bell-sw.com
Liberica– supported OpenJDK binaries
Ex-employers
4
WWW.BELL-SW.COM
Committed to freedom
http://bell-sw.com
Liberica
– supported OpenJDK
binaries
0 50 100 150 200
Red Hat
SAP
Google
BellSoft
SUNY Oswego
IBM
NTT
Intel
ARM
Qualcomm
Linaro
Amazon
JetBrains
Longsoon
Eldorado
Azul
Alibaba
AMD
Cavium
SuSE
Twitter
External contributions to OpenJDK jdk/jdk Aug '17 - Aug '18
*Note: Oracle contributed ~3965 patches in the same period
5
WWW.BELL-SW.COM
Two character play
• DevOps – in charge of IT procurement, big Raspberry Pi fan.
• Software engineer – submitted a request to procure Arm servers for a Java-based
project.
6
WWW.BELL-SW.COM
What do we know about Arm?
• Arm = Advanced RISC Machine/Acorn RISC Machine
• Founded in 1985
• UK, Cambridge
• ARM is a RISC architecture
• 30 billion processors shipped in 2013
• Plans to ship 100 billion processors by 2020
7
WWW.BELL-SW.COM
8
WWW.BELL-SW.COM
9
WWW.BELL-SW.COM
IoT Gateways
SuperMicro Dell
Eurotech Advantech
Liberica JDK
10
WWW.BELL-SW.COM
But Servers?
11
WWW.BELL-SW.COM
Arm: architecture, profile, implementation
Timeline
Performance
& capabilities
Cortex-M3
Cortex-M1 Cortex-M0
Cortex-M0+
Cortex-M4
Cortex-R4
Cortex-R5
Cortex-R7
Cortex-A8
Cortex-A5
Cortex-A7
Cortex-A53
Cortex-A57
Cortex-A15
Cortex-A9
• ARM v7
• Architecture profiles
• v7-M (Embedded)
• V7-R (Real-Time)
• V7-A (Application)
• ARM v8
• Architecture profiles
• v8-M (Embedded)
• V8-R (Real-Time)
• V8-A (Application)
Cortex-R52
12
WWW.BELL-SW.COM
Arm: big.LITTLE
Cache Coherent Interconnect
Interrupt Control
CPU CPU
L2 Cache
Cortex-A57
CPU
L2 Cache
Cortex-A53
CPUBIG LITTLE
Performance
on-demand
Always
connected
13
WWW.BELL-SW.COM
DIY
14
WWW.BELL-SW.COM
OpenJDK Arm32 port
• Available since OpenJDK 9
• Minimal VM, Client VM, Server VM
• Works on the Raspberry Pi
• jlink + jdeps
• Allows to create a smaller runtime (as small as 16 Mb)
• Java FX Embedded
• Allows to build fancy UI for the Raspberry Pi
• EGL/DFB acceleration
• Touch screen support
15
WWW.BELL-SW.COM
Minimal VM
• Optimized for footprint, rather than functionality
• Serial GC
• C1 JIT compiler
• No JDWP support
• No JMX support
• But… it is < 4 Mb!
• Linux x86_64 Server VM: 23 Mb
• jlink @since jdk9
• java.base with Minimal VM under 16 Mb!
• Modules for jetty: under 32 Mb
16
WWW.BELL-SW.COM
ARMv8-A Specification
ARMv8-A
- 64 & 32-bit
- 31 GPRs
- SIMD (NEON)
- AES, SHA
ARMv8.1-A
- New
Atomics
- CRC32
ARMv8.2-A
- Optional SVE
(128-2048 bits)
- Dot Product SIMD
- Half-precision FP
ARMv8.3-A
- Complex FP
SIMD
- Nested
virtualization
ARMv8.4-A
- SHA3, 512
- SM3, 4
Dec 2011 Jan 2014 Jan 2016 Oct 2016 2018
17
WWW.BELL-SW.COM
Arm architecture licensees
18
WWW.BELL-SW.COM
Ampere Computing (ex APM)
Up to 32 cores
Up to 32 threads
8 DDR Channels
32 Mb L3
19
WWW.BELL-SW.COM
Cavium/Marvell ThunderX2
32 cores/128 threads
32 Mb L3
8 DDR Channels/socket
Multi-socket
Up to 4 TB RAM
20
WWW.BELL-SW.COM
Cavium/Marvell ThunderX2
That thing
is real!
21
WWW.BELL-SW.COM
Wait, how many threads?
22
WWW.BELL-SW.COM
Arm Software ecosystem
https://worksonarm.comCheck out if it works on Arm:
23
WWW.BELL-SW.COM
OpenJDK ARM ports
• ARM (32 bit & 64 bit)
– Full Java SE Spec
– ARM v6/v7/v8
– C1 & C2
• AARCH64 (64 bit only)
– Full Java SE Spec
– C1 & C2
– G1 / Parallel GC / Shenandoah
(and ZGC is coming)
– AppCDS, JFR, NMT, AOT
24
WWW.BELL-SW.COM
Intrinsics
Intrinsic:
“function (subroutine) available for use in a
given programming language which
implementation is handled specially by the
compiler.”
25
WWW.BELL-SW.COM
What will C2 do with math Java code?
java.lang.Math:
/**
* Returns as a {@code long} the most significant 64 bits of the
* 128-bit product of two 64-bit factors.
* @since 9
*/
public static long multiplyHigh(long x, long y) {
if (x < 0 || y < 0) {
long x1 = x >> 32;
long x2 = x & 0xFFFFFFFFL;
long y1 = y >> 32;
long y2 = y & 0xFFFFFFFFL;
long z2 = x2 * y2;
long t = x1 * y2 + (z2 >>> 32);
long z1 = t & 0xFFFFFFFFL;
long z0 = t >> 32;
z1 += x2 * y1;
return x1 * y1 + z0 + (z1 >> 32);
} else { …
26
WWW.BELL-SW.COM
What will C2 do with math Java code?
java.lang.Math:
/**
* Returns as a {@code long} the most significant 64 bits of the 128-bit
* product of two 64-bit factors.
* @since 9
*/
public static long multiplyHigh(long x, long y) {
// Use technique from section 8-2 of Henry S. Warren, Jr.,
// Hacker's Delight (2nd ed.) (Addison Wesley, 2013), 173-174.
...
// Use Karatsuba technique with two base 2^32 digits.
...
return ...;
}
27
WWW.BELL-SW.COM
Что из этого делает C2?
Math code in assembly
14 operations with
latency 1
28
WWW.BELL-SW.COM
Can we make it faster?
• Rewrite as a С + JNI call
• Well, it will be slower
• Tune HotSpot to optimize IR for this code better*
• Even if this is possible, this might lead to regressions
• Tune HotSpot to detect this method and substitute optimal
code instead
SMULH Xd, Xn, Xm (cost: 4)
“Signed multiply high”
29
WWW.BELL-SW.COM
C2 Intrinsic How-to
1) Add SMULH instruction into ${arch}/assembler_${arch}.hpp
2) Describe a node with this instruction and its cost in ${arch}.ad
3) Mark this method as intrinsic in share/classfile/vmSymbols.hpp
4) Substitute the method with the node
bool LibraryCallKit::inline_math_multiplyHigh() {
set_result(_gvn.transform(new MulHiLNode(arg (0), arg (2))));
return true;
}
5) Annotate j.l.Math.multiplyHigh() @HotSpotIntrinsicCandidate
6) Measure performance
30
WWW.BELL-SW.COM
Benchmarking (throughput)
public class MultiplyHighJMHBench {
@Benchmark
@OperationsPerInvocation(10000)
public long bench() {
long op = System.currentTimeMillis();
long accum = 0;
for (int i = 0; i < 10000; i++) {
accum += Math.multiplyHigh(op + i, op + i);
}
return accum;
}
}
Good for JDK 11!
SMULH cost: 4
31
WWW.BELL-SW.COM
Let’s do something useful for enterprise apps
• What does a JVM do when executing a
typical enterprise program?
– Creates, copies objects, strings,
arrays, frees memory
– Searches and compares objects,
strings, arrays
– Checks that the right information is
received
32
WWW.BELL-SW.COM
String s = new String(“Can this work faster?”);
• Compact Strings @since JDK 9
– Most strings do not require UTF-16 as inner representation
– Inner representation of strings:
• char[] -> byte[], coder
• Either ISO-8859-1/Latin-1
• Either UTF-16 if required
S t r i n g
С т р о к а
33
WWW.BELL-SW.COM
1001 Heap Dump
• Log-normal distribution
• < 0.3% of all strings are not Latin-1
• 18% strings < 8 symbols
• 66% strings < 32 symbols
• 95% strings < 128 symbols
Any changes to improve the current state of
things should not case regressions on this
dataset
0
0.01
0.02
0.03
0.04
0.05
0.06
0 10 20 30 40 50 60 70 80 90 100 110 120
String length distribution
String length
34
WWW.BELL-SW.COM
String s = new String(“Can this work faster?”);
new String(…)
StringDecoder.decode()decodeASCII()decodeLatin1()decodeUTF8()
StringCoding.decode()
hasNegatives()
if (!hasNegatives()){
//ascii fastpath
}
35
WWW.BELL-SW.COM
StringCoding.hasNegatives()
@HotSpotIntrinsicCandidate
public static boolean hasNegatives(byte[] ba, int off, int len) {
for (int i = off; i < off + len; i++) {
if (ba[i] < 0) {
return true;
}
}
return false;
}
36
WWW.BELL-SW.COM
Some ARM assembly – memory reads
Register Width (bits) Latency
(cycles)
LDRB GPR 8 4
LDRH GPR 16 4
LDR GPR 32 or 64 4
LDP GPR 64+64 5
37
WWW.BELL-SW.COM
Learning to read (again)
LDP LDP
LDP LDR LDRH
LDP LDR LDR
LDRB
SEGFAULT
38
WWW.BELL-SW.COM
And compare 8 bits at a time with 0
const uint64_t UPPER_BIT_MASK=0x8080808080808080;
...
__ tst(rscratch2, UPPER_BIT_MASK);
for(int i = off; i < off + len; i++) {
if (ba[i] < 0) {
return true;
}
}
39
WWW.BELL-SW.COM
Aligned memory access
x86:
- in most cases modern processors do not
have a penalty for unaligned memory access
ARM is a spec:
- some CPU manufacturers do not have a
penalty
- others do have (20%, 50%, 100%)
40
WWW.BELL-SW.COM
How to align memory access
LDR LDP
// pre-loop
__ ldr();
…
__ tst(…, UPPER_BIT_MASK);
// main loop
__ ldp(); //aligned
…
__ tst(…, UPPER_BIT_MASK);
41
WWW.BELL-SW.COM
The plan for hasNegatives() intrinsic
• Read as much bytes at a time as possible, without crossing the page boundaries
• If the page border is close
• Read less bytes
• Shift to the left
• Compare as many bytes with 0 as possible at a time
• Align memory access
• Reality
• The code gets too big – 200 instructions
• This interferes with inlining: C2 inlines up to1500 instructions
42
WWW.BELL-SW.COM
Code is too big – what do we do?
if (len > 32)
return stubHasNegatives(ba, 0, len);
for (int i = 0; i < 32; i++) {
if (ba[i] < 0) { // ldr, tst
return true;
}
}
return stubHasNegatives(ba, 32, len); // ldp, tst
• ARM ASM pseudo-code in Java that is short (27 instructions)
• Not optimal, unaligned, but short
• The rest of the code goes to stub
43
WWW.BELL-SW.COM
What is a stub?
• A type of assembly inline in HotSpot
• Close analogy is a function
• Can be called from macroAssembler
• Code gets loaded during JVM startup once
• Does not get inlined
• Several entry points are possible
• Some performance penalty calling stub
44
WWW.BELL-SW.COM
What should we place in stub?
// align memory access
__ bind(LARGE_LOOP); // 64 byte at a time
4x __ ldp(); //ary1, ary1+16, ary1+32, ary1+48
__ add(ary1, ary1, large_loop_size);
__ sub(len, len, large_loop_size);
7x __ orr(…);
__ tst(tmp2, UPPER_BIT_MASK);
__ br(Assembler::NE, RET_TRUE);
__ cmp(len, large_loop_size);
__ br(Assembler::GE, LARGE_LOOP);
OK, we helped C2. Can we help the hardware?
45
WWW.BELL-SW.COM
Software Prefetching
Let’s give a processor a hint where we are going to read from memory next time:
__ prfm(Address(ary1, SoftwarePrefetchHintDistance));
// do local register or operations on data in cache
__ ldp();
• Can be a major performance gain if
• Processor has enough data to process between prfm and memory load
• SoftwarePrefetchHintDistance is correctly defined:
> d_cache_line_size
46
WWW.BELL-SW.COM
Benchmark for new String() – long strings
0
1
2
3
4
5
6
Speedup compared to C2, times
Number of symbols
2 8 16 32 256 1024 16384
Speedup up to 5x!
Longer string sizes experience more
performance gain from optimization
due to
• Optimal ldp & tst use
• Prefetching
47
WWW.BELL-SW.COM
Benchmark for new String() – results
0
0.01
0.02
0.03
0.04
0.05
0.06
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Improvement over C2, times Length Probability
48
WWW.BELL-SW.COM
Let’s have a JEP, darling!
49
WWW.BELL-SW.COM
Performance improvement
• Speedup up to 78x in microbenchmarks
* mean improvement over different size, length, encondings
1 1.5 2 2.5 3 3.5
java.lang.Math.log()
java.lang.Math.sin()
java.lang.Math.cos()
java.lang.String.new String()
java.lang.String.compareTo()
java.lang.StringUTF16.compress()
java.lang.StringLatin1.inflate()
java.lang.String.indexOf()
java.util.zip.CRC32.update()
java.utils.Arrays.equals()
Average performance improvement*, times
50
WWW.BELL-SW.COM
JVM Benchmark #1 results
0 10000 20000 30000 40000 50000 60000 70000
Max-jOPS
Critical-jOPS
SPECjbb2015 composite score (jOPS)
Xeon Gold 6140 ThunderX2 CN9975
ARMv8: -Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseParallelGC -XX:+UseTransparentHugePages -XX:-UseBiasedLocking
X86: -Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseParallelGC -XX:+UseTransparentHugePages -XX:+UseBiasedLocking
• Liberica JDK 11
• Average over 20 runs
• JEP 315 in JDK 11
• Cavium Thunder X2 outperforms
Xeon 6140
– by 33% in Max-jOPS score
– by 16% in Critical-jOPS score
51
WWW.BELL-SW.COM
JVM Benchmark #2 results
• Liberica JDK 11
• Default JVM settings
• Average over 20 runs
• Thunder X2 outperforms Xeon 6140
– by 62% in Crypto
– by 42% in MpegAudio
– By 29% in XML
– by 12% in Compress
• Xeon 6140 outperforms Thunder X2
– By 29% in scimark.small
0 500 1000 1500 2000 2500 3000 3500
composite
compress
crypto
derby
mpegaudio
scimark.large
scimark.small
serial
sunflow
xml
SPECjvm2008 score (ops/m)
Xeon Gold 6140 ThunderX2 CN9975
52
WWW.BELL-SW.COM
Where to try ARM servers?
Bare Metal VPS
53
WWW.BELL-SW.COM
DEMO
54
WWW.BELL-SW.COM
Conclusions
• Arm server vendors did a great job
• Cloud providers provide access to Arm servers right now
• Ubuntu, Red Hat, Oracle Linux, SuSE have ARMv8 support
• The software ecosystem just works as expected on ARMv8
• OpenJDK 11 is optimized for ARMv8
Download and install Liberica for ARMv8
55
WWW.BELL-SW.COM

Contenu connexe

Tendances

Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
elliando dias
 
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsKernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Anne Nicolas
 
Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)
Anil Madhavapeddy
 

Tendances (20)

Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering Oopsies
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Artillery Duel Network
Artillery Duel NetworkArtillery Duel Network
Artillery Duel Network
 
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Slideshare - linux crypto
Slideshare - linux cryptoSlideshare - linux crypto
Slideshare - linux crypto
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
Data Compression Technique
Data Compression TechniqueData Compression Technique
Data Compression Technique
 
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsKernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutions
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)Mirage: ML kernels in the cloud (ML Workshop 2010)
Mirage: ML kernels in the cloud (ML Workshop 2010)
 

Similaire à Java on arm theory, applications, and workloads [dev5048]

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Practical reverse engineering and exploit development for AVR-based Embedded ...
Practical reverse engineering and exploit development for AVR-based Embedded ...Practical reverse engineering and exploit development for AVR-based Embedded ...
Practical reverse engineering and exploit development for AVR-based Embedded ...
Alexander Bolshev
 

Similaire à Java on arm theory, applications, and workloads [dev5048] (20)

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 
Session01_Intro.pdf
Session01_Intro.pdfSession01_Intro.pdf
Session01_Intro.pdf
 
Lec05
Lec05Lec05
Lec05
 
Practical reverse engineering and exploit development for AVR-based Embedded ...
Practical reverse engineering and exploit development for AVR-based Embedded ...Practical reverse engineering and exploit development for AVR-based Embedded ...
Practical reverse engineering and exploit development for AVR-based Embedded ...
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
Network.pptx
Network.pptxNetwork.pptx
Network.pptx
 
Aes
AesAes
Aes
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpeg
 
Elliptics
EllipticsElliptics
Elliptics
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Feasibility of Security in Micro-Controllers
Feasibility of Security in Micro-ControllersFeasibility of Security in Micro-Controllers
Feasibility of Security in Micro-Controllers
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Java on arm theory, applications, and workloads [dev5048]

  • 1. 1 WWW.BELL-SW.COM WWW.BELL-SW.COM 2018Java on Arm: theory, applications and workloads Aleksei Voitylov, Dmitry Chuyko
  • 2. 2 WWW.BELL-SW.COM Who we are Aleksei Voitylov @AVoitylov Ex-employers: http://bell-sw.com Liberica – supported OpenJDK binaries
  • 3. 3 WWW.BELL-SW.COM Who we are Dmitry Chuyko @dchuyko http://bell-sw.com Liberica– supported OpenJDK binaries Ex-employers
  • 4. 4 WWW.BELL-SW.COM Committed to freedom http://bell-sw.com Liberica – supported OpenJDK binaries 0 50 100 150 200 Red Hat SAP Google BellSoft SUNY Oswego IBM NTT Intel ARM Qualcomm Linaro Amazon JetBrains Longsoon Eldorado Azul Alibaba AMD Cavium SuSE Twitter External contributions to OpenJDK jdk/jdk Aug '17 - Aug '18 *Note: Oracle contributed ~3965 patches in the same period
  • 5. 5 WWW.BELL-SW.COM Two character play • DevOps – in charge of IT procurement, big Raspberry Pi fan. • Software engineer – submitted a request to procure Arm servers for a Java-based project.
  • 6. 6 WWW.BELL-SW.COM What do we know about Arm? • Arm = Advanced RISC Machine/Acorn RISC Machine • Founded in 1985 • UK, Cambridge • ARM is a RISC architecture • 30 billion processors shipped in 2013 • Plans to ship 100 billion processors by 2020
  • 11. 11 WWW.BELL-SW.COM Arm: architecture, profile, implementation Timeline Performance & capabilities Cortex-M3 Cortex-M1 Cortex-M0 Cortex-M0+ Cortex-M4 Cortex-R4 Cortex-R5 Cortex-R7 Cortex-A8 Cortex-A5 Cortex-A7 Cortex-A53 Cortex-A57 Cortex-A15 Cortex-A9 • ARM v7 • Architecture profiles • v7-M (Embedded) • V7-R (Real-Time) • V7-A (Application) • ARM v8 • Architecture profiles • v8-M (Embedded) • V8-R (Real-Time) • V8-A (Application) Cortex-R52
  • 12. 12 WWW.BELL-SW.COM Arm: big.LITTLE Cache Coherent Interconnect Interrupt Control CPU CPU L2 Cache Cortex-A57 CPU L2 Cache Cortex-A53 CPUBIG LITTLE Performance on-demand Always connected
  • 14. 14 WWW.BELL-SW.COM OpenJDK Arm32 port • Available since OpenJDK 9 • Minimal VM, Client VM, Server VM • Works on the Raspberry Pi • jlink + jdeps • Allows to create a smaller runtime (as small as 16 Mb) • Java FX Embedded • Allows to build fancy UI for the Raspberry Pi • EGL/DFB acceleration • Touch screen support
  • 15. 15 WWW.BELL-SW.COM Minimal VM • Optimized for footprint, rather than functionality • Serial GC • C1 JIT compiler • No JDWP support • No JMX support • But… it is < 4 Mb! • Linux x86_64 Server VM: 23 Mb • jlink @since jdk9 • java.base with Minimal VM under 16 Mb! • Modules for jetty: under 32 Mb
  • 16. 16 WWW.BELL-SW.COM ARMv8-A Specification ARMv8-A - 64 & 32-bit - 31 GPRs - SIMD (NEON) - AES, SHA ARMv8.1-A - New Atomics - CRC32 ARMv8.2-A - Optional SVE (128-2048 bits) - Dot Product SIMD - Half-precision FP ARMv8.3-A - Complex FP SIMD - Nested virtualization ARMv8.4-A - SHA3, 512 - SM3, 4 Dec 2011 Jan 2014 Jan 2016 Oct 2016 2018
  • 18. 18 WWW.BELL-SW.COM Ampere Computing (ex APM) Up to 32 cores Up to 32 threads 8 DDR Channels 32 Mb L3
  • 19. 19 WWW.BELL-SW.COM Cavium/Marvell ThunderX2 32 cores/128 threads 32 Mb L3 8 DDR Channels/socket Multi-socket Up to 4 TB RAM
  • 23. 23 WWW.BELL-SW.COM OpenJDK ARM ports • ARM (32 bit & 64 bit) – Full Java SE Spec – ARM v6/v7/v8 – C1 & C2 • AARCH64 (64 bit only) – Full Java SE Spec – C1 & C2 – G1 / Parallel GC / Shenandoah (and ZGC is coming) – AppCDS, JFR, NMT, AOT
  • 24. 24 WWW.BELL-SW.COM Intrinsics Intrinsic: “function (subroutine) available for use in a given programming language which implementation is handled specially by the compiler.”
  • 25. 25 WWW.BELL-SW.COM What will C2 do with math Java code? java.lang.Math: /** * Returns as a {@code long} the most significant 64 bits of the * 128-bit product of two 64-bit factors. * @since 9 */ public static long multiplyHigh(long x, long y) { if (x < 0 || y < 0) { long x1 = x >> 32; long x2 = x & 0xFFFFFFFFL; long y1 = y >> 32; long y2 = y & 0xFFFFFFFFL; long z2 = x2 * y2; long t = x1 * y2 + (z2 >>> 32); long z1 = t & 0xFFFFFFFFL; long z0 = t >> 32; z1 += x2 * y1; return x1 * y1 + z0 + (z1 >> 32); } else { …
  • 26. 26 WWW.BELL-SW.COM What will C2 do with math Java code? java.lang.Math: /** * Returns as a {@code long} the most significant 64 bits of the 128-bit * product of two 64-bit factors. * @since 9 */ public static long multiplyHigh(long x, long y) { // Use technique from section 8-2 of Henry S. Warren, Jr., // Hacker's Delight (2nd ed.) (Addison Wesley, 2013), 173-174. ... // Use Karatsuba technique with two base 2^32 digits. ... return ...; }
  • 27. 27 WWW.BELL-SW.COM Что из этого делает C2? Math code in assembly 14 operations with latency 1
  • 28. 28 WWW.BELL-SW.COM Can we make it faster? • Rewrite as a С + JNI call • Well, it will be slower • Tune HotSpot to optimize IR for this code better* • Even if this is possible, this might lead to regressions • Tune HotSpot to detect this method and substitute optimal code instead SMULH Xd, Xn, Xm (cost: 4) “Signed multiply high”
  • 29. 29 WWW.BELL-SW.COM C2 Intrinsic How-to 1) Add SMULH instruction into ${arch}/assembler_${arch}.hpp 2) Describe a node with this instruction and its cost in ${arch}.ad 3) Mark this method as intrinsic in share/classfile/vmSymbols.hpp 4) Substitute the method with the node bool LibraryCallKit::inline_math_multiplyHigh() { set_result(_gvn.transform(new MulHiLNode(arg (0), arg (2)))); return true; } 5) Annotate j.l.Math.multiplyHigh() @HotSpotIntrinsicCandidate 6) Measure performance
  • 30. 30 WWW.BELL-SW.COM Benchmarking (throughput) public class MultiplyHighJMHBench { @Benchmark @OperationsPerInvocation(10000) public long bench() { long op = System.currentTimeMillis(); long accum = 0; for (int i = 0; i < 10000; i++) { accum += Math.multiplyHigh(op + i, op + i); } return accum; } } Good for JDK 11! SMULH cost: 4
  • 31. 31 WWW.BELL-SW.COM Let’s do something useful for enterprise apps • What does a JVM do when executing a typical enterprise program? – Creates, copies objects, strings, arrays, frees memory – Searches and compares objects, strings, arrays – Checks that the right information is received
  • 32. 32 WWW.BELL-SW.COM String s = new String(“Can this work faster?”); • Compact Strings @since JDK 9 – Most strings do not require UTF-16 as inner representation – Inner representation of strings: • char[] -> byte[], coder • Either ISO-8859-1/Latin-1 • Either UTF-16 if required S t r i n g С т р о к а
  • 33. 33 WWW.BELL-SW.COM 1001 Heap Dump • Log-normal distribution • < 0.3% of all strings are not Latin-1 • 18% strings < 8 symbols • 66% strings < 32 symbols • 95% strings < 128 symbols Any changes to improve the current state of things should not case regressions on this dataset 0 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 30 40 50 60 70 80 90 100 110 120 String length distribution String length
  • 34. 34 WWW.BELL-SW.COM String s = new String(“Can this work faster?”); new String(…) StringDecoder.decode()decodeASCII()decodeLatin1()decodeUTF8() StringCoding.decode() hasNegatives() if (!hasNegatives()){ //ascii fastpath }
  • 35. 35 WWW.BELL-SW.COM StringCoding.hasNegatives() @HotSpotIntrinsicCandidate public static boolean hasNegatives(byte[] ba, int off, int len) { for (int i = off; i < off + len; i++) { if (ba[i] < 0) { return true; } } return false; }
  • 36. 36 WWW.BELL-SW.COM Some ARM assembly – memory reads Register Width (bits) Latency (cycles) LDRB GPR 8 4 LDRH GPR 16 4 LDR GPR 32 or 64 4 LDP GPR 64+64 5
  • 37. 37 WWW.BELL-SW.COM Learning to read (again) LDP LDP LDP LDR LDRH LDP LDR LDR LDRB SEGFAULT
  • 38. 38 WWW.BELL-SW.COM And compare 8 bits at a time with 0 const uint64_t UPPER_BIT_MASK=0x8080808080808080; ... __ tst(rscratch2, UPPER_BIT_MASK); for(int i = off; i < off + len; i++) { if (ba[i] < 0) { return true; } }
  • 39. 39 WWW.BELL-SW.COM Aligned memory access x86: - in most cases modern processors do not have a penalty for unaligned memory access ARM is a spec: - some CPU manufacturers do not have a penalty - others do have (20%, 50%, 100%)
  • 40. 40 WWW.BELL-SW.COM How to align memory access LDR LDP // pre-loop __ ldr(); … __ tst(…, UPPER_BIT_MASK); // main loop __ ldp(); //aligned … __ tst(…, UPPER_BIT_MASK);
  • 41. 41 WWW.BELL-SW.COM The plan for hasNegatives() intrinsic • Read as much bytes at a time as possible, without crossing the page boundaries • If the page border is close • Read less bytes • Shift to the left • Compare as many bytes with 0 as possible at a time • Align memory access • Reality • The code gets too big – 200 instructions • This interferes with inlining: C2 inlines up to1500 instructions
  • 42. 42 WWW.BELL-SW.COM Code is too big – what do we do? if (len > 32) return stubHasNegatives(ba, 0, len); for (int i = 0; i < 32; i++) { if (ba[i] < 0) { // ldr, tst return true; } } return stubHasNegatives(ba, 32, len); // ldp, tst • ARM ASM pseudo-code in Java that is short (27 instructions) • Not optimal, unaligned, but short • The rest of the code goes to stub
  • 43. 43 WWW.BELL-SW.COM What is a stub? • A type of assembly inline in HotSpot • Close analogy is a function • Can be called from macroAssembler • Code gets loaded during JVM startup once • Does not get inlined • Several entry points are possible • Some performance penalty calling stub
  • 44. 44 WWW.BELL-SW.COM What should we place in stub? // align memory access __ bind(LARGE_LOOP); // 64 byte at a time 4x __ ldp(); //ary1, ary1+16, ary1+32, ary1+48 __ add(ary1, ary1, large_loop_size); __ sub(len, len, large_loop_size); 7x __ orr(…); __ tst(tmp2, UPPER_BIT_MASK); __ br(Assembler::NE, RET_TRUE); __ cmp(len, large_loop_size); __ br(Assembler::GE, LARGE_LOOP); OK, we helped C2. Can we help the hardware?
  • 45. 45 WWW.BELL-SW.COM Software Prefetching Let’s give a processor a hint where we are going to read from memory next time: __ prfm(Address(ary1, SoftwarePrefetchHintDistance)); // do local register or operations on data in cache __ ldp(); • Can be a major performance gain if • Processor has enough data to process between prfm and memory load • SoftwarePrefetchHintDistance is correctly defined: > d_cache_line_size
  • 46. 46 WWW.BELL-SW.COM Benchmark for new String() – long strings 0 1 2 3 4 5 6 Speedup compared to C2, times Number of symbols 2 8 16 32 256 1024 16384 Speedup up to 5x! Longer string sizes experience more performance gain from optimization due to • Optimal ldp & tst use • Prefetching
  • 47. 47 WWW.BELL-SW.COM Benchmark for new String() – results 0 0.01 0.02 0.03 0.04 0.05 0.06 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Improvement over C2, times Length Probability
  • 49. 49 WWW.BELL-SW.COM Performance improvement • Speedup up to 78x in microbenchmarks * mean improvement over different size, length, encondings 1 1.5 2 2.5 3 3.5 java.lang.Math.log() java.lang.Math.sin() java.lang.Math.cos() java.lang.String.new String() java.lang.String.compareTo() java.lang.StringUTF16.compress() java.lang.StringLatin1.inflate() java.lang.String.indexOf() java.util.zip.CRC32.update() java.utils.Arrays.equals() Average performance improvement*, times
  • 50. 50 WWW.BELL-SW.COM JVM Benchmark #1 results 0 10000 20000 30000 40000 50000 60000 70000 Max-jOPS Critical-jOPS SPECjbb2015 composite score (jOPS) Xeon Gold 6140 ThunderX2 CN9975 ARMv8: -Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseParallelGC -XX:+UseTransparentHugePages -XX:-UseBiasedLocking X86: -Xmx24G -Xms24G -Xmn16G -XX:+AlwaysPreTouch -XX:+UseParallelGC -XX:+UseTransparentHugePages -XX:+UseBiasedLocking • Liberica JDK 11 • Average over 20 runs • JEP 315 in JDK 11 • Cavium Thunder X2 outperforms Xeon 6140 – by 33% in Max-jOPS score – by 16% in Critical-jOPS score
  • 51. 51 WWW.BELL-SW.COM JVM Benchmark #2 results • Liberica JDK 11 • Default JVM settings • Average over 20 runs • Thunder X2 outperforms Xeon 6140 – by 62% in Crypto – by 42% in MpegAudio – By 29% in XML – by 12% in Compress • Xeon 6140 outperforms Thunder X2 – By 29% in scimark.small 0 500 1000 1500 2000 2500 3000 3500 composite compress crypto derby mpegaudio scimark.large scimark.small serial sunflow xml SPECjvm2008 score (ops/m) Xeon Gold 6140 ThunderX2 CN9975
  • 52. 52 WWW.BELL-SW.COM Where to try ARM servers? Bare Metal VPS
  • 54. 54 WWW.BELL-SW.COM Conclusions • Arm server vendors did a great job • Cloud providers provide access to Arm servers right now • Ubuntu, Red Hat, Oracle Linux, SuSE have ARMv8 support • The software ecosystem just works as expected on ARMv8 • OpenJDK 11 is optimized for ARMv8 Download and install Liberica for ARMv8

Notes de l'éditeur

  1. hu