Performance van Java 8 en verder - Jeroen Borgers

Performance of Java 8
and beyond
Performance van Java 8 en verder
By Jeroen Borgers
1

Contents
1. Introduction
2. Lambda expressions
3. Stream API
4. Parallel execution & cores
5. Filter map reduce, parallel streams internals
6. Fork-join framework use
7. Lambda’s versus inner classes
8. Tiered compilation
9. PermGen removal
10.java.time performance
11.Accumulators en Adders
12.Map improvements
13.Java 9+ improvements
14.Utilization of GPU's
15.Value Types
16.Arrays 2.0
17.Summary and conclusions 2

Introduction to lambdas and streams
• Java 8 introduces lambda expressions for functional
programming
• With the Stream API iteration can be handled internally by a
library
• Tell don’t ask for applying a function on a collection
• or tell to do that in parallel, on multiple cores
• question is if this improves your response time
3

Lambda expressions and streams
• Example
4

Lambda expressions and streams
• Example with method references
5

Lambda expressions: short notation
• instance of anonymous inner class of functional interface
• functional interface has only one abstract method
• Runnable: void run()
• Executor: void execute(Runnable r)
• Iterable<T>: Iterator<T> iterator()
• new: java.util.function
• Consumer<T>: void accept(T t)
• Function<T, R>: R apply(T t)
• Predicate<T>: boolean test(T t)
• Annotation: @FunctionalInterface
6

Anonymous inner class instance example
7

Inner class has boiler plate code
8

Lambda expression is concise
9

Stream pipeline
Source Intermediate operations
lazy evaluation
10
Terminal operations
eager evaluation

Stream lazy evaluation optimizes with short-circuiting
- can be big win
12

Stream executed in parallel
13

Parallel execution & hardware threads
• Parallel != concurrent
• CPU Frequency at max
• #cores/hardware threads increase 64+
• Must be able to utilize those cores
• need to process data faster: BigData, IoT
• Runtime.getRuntime().availableProcessors()
• reports #hardware threads
• my Mac: 2 cores with 2 hyper threads = 4
• Can we get a speedup of ~4?
14

Parallel streams utilize ForkJoinPool
• Java 8 ForkJoinPool introduces a common pool for any ForkJoinTask
• one per JVM
• Used in Array.parallelSort, .parallelSetAll and parallelStream
• Size defaults to Runtime.getRuntime().availableProcessors() - 1
• Can be set with:
• -Djava.util.concurrent.ForkJoinPool.common.parallelism=N
• Multiple JVM’s on a machine
• consider lowering the pool size
• Tasks waiting for I/O
• consider increasing the pool size
15

Fork-join framework: divide-and-conquer
• Divide task recursively in smaller tasks
• Divide array of 640 elements into 64
leaf tasks of 10 elements
• e.g. sum or sort on each level
• Many ForkJoinTasks processed by
limited threads, e.g. ForEachTask
• like ThreadPoolExecutor
• worse: overhead of creating tasks
• better: work stealing from queue
of other threads
• great for unbalanced tasks!
16

Performance of Lambda’s versus inner classes
• Lambdas seem syntactic sugar around creating anonymous class
• in fact, it is not
• Inner class
• Actual class loaded by class loader
• New object created, allocation, initialization, gc
• Lambda
• creates a static method called through helper class
• Performance is similar
• Only first time loading inner class in class loader is slower
17

When to use parallel streams?
• source.parallelStream().operation(F)
• F independent
• computation on element does not rely on or impact other
• stateless, non-interfering
• source is efficiently splittable
• Collections, Arrays, SplittableRandom
• not I/O based: designed for sequential use
• computationally expensive
• ROT: sequential version > 100 μs
18

Parallel when computationally expensive
• source.parallelStream().operation(F)
• ROT: sequential version > 100 μs
• N * Q > 10 000
• N = #elements
• Q = cost per element of F: #operations
• small function like x -> x * x: N > 10 000 elements
• moderately large function Q = 100: N > 100 elements
19

Overhead of parallel execution
• Startup of power-controlled cores
• Sequential part of setting up parallel calculation
• Splittability = ease of partitioning
• efficient if random access or efficient search:
• ArrayLists, [Concurrent]HashMaps, arrays
• inefficient: LinkedLists, BlockingQueues, IO-based
• Stream BufferedReader.lines() currently for sequential
use
• might by improved in future JDK, for highly efficient
bulk processing of buffered IO
20

Creating the micro benchmark
Tiny calculation per element
21

Creating the micro benchmark 2
22

Medium sized calculation benchmark
• 1000 elements
• Speedup by using serial lambda's = 0.95884454
• Speedup of parallel over serial lambda's= 1.2968781
• Speedup of parallel over oldSchool = 1.2435045
• 100_000 elements
24

Utilization of cores
Medium calculation, 1000 and 100_000 elements
Parallel part 25

Tiny calculation benchmark
• 1000 elements
• 100_000 elements
28

Utilization of cores
Tiny calculation, 1000 and 100_000 elements
29

Micro benchmark conclusions
(for this benchmark, on this computer)
• For high performance and small functions: use old school loops
• lambda’s infrastructure takes more overhead than function
• For high performance and large functions
• serial
• if N * Q > 100 000 then parallel
• I need more cores!
30

Tiered compilation
• JIT-compiler came in 2 flavors, now 3
• -client (C1)
• quick startup time
• -server (C2)
• best performance in long run
• -XX:+TieredCompilation
• first C1, then C2
• only Java 8: TieredCompilation default
• Java 7: often need to increase code cache
• -XX:ReservedCodeCacheSize=96M (7) 240M (8)
31

Permgen removal
• Upto Java 7: Permgen; Java 8: Metaspace
• Permgen (wrong name)
• data not related to classes: String pool
• Metaspace
• only class meta data
• Class objects itself on heap
• String pool on heap
• -XX:[Max]MetaspaceSize=N
• Default max ‘unlimited’ (1 GB)
• OutOfMemoryError: Metaspace instead of PermGen space
32

java.time performance
• Finally a proper library for Date and Time that replaces the
• Crappy stuff:
• java.util.Date
• mutable - defensive copies needed
• java.util.Calendar
• 540 bytes to store timestamp, Locale, TZ - heap/gc
• java.text.SimpleDateFormat
• not thread safe - so have to re-create
• Stephen Colebourne spec lead, from Joda time
33

java.util.concurrent.atomic
Accumulators and Adders
34

Map improvements
• HashMap, LinkedHashMap and
ConcurrentHashMap
• collisions on keys: keys end up in same bucket
• access time O(1) -> O(n)
• follow LinkedList until key.equals() returns
true
• Balanced tree instead of linked list
• if size > TREEIFY_THRESHOLD (8)
• worst case access time O(n) -> O(log(n))
• keys should implement Comparable
• branches on hashCode, then compareTo
35

Java 9+ performance improvements
36

Sumatra: Utilization of GPU's
• GPU’s have 100_000’s of stream cores
• SIMD - single instruction multiple data
• work offloaded to GPU
• implemented off-loadable version of parallel().forEach()
• Use parallel streams and lambdas
37

Value Types (JEP 169)
The next big thing!
• Currently:
• limited set of primitives, by value: no identity
• others by reference: identity
• footprint:
• heap allocated
• object headers
• 1+ pointers pointing to it
• burden for small objects
• object identity only serves mutability
• JVM attempts to figure out if identity is needed
• escape analysis and object elision can unwrap in cases
• fragile
• Object might be used as lock, then needs identity
38

Integer overhead
39
mark word class pointer value
object pointer
value

Point - class versus value type
• Point object layout
• @Value Point layout
41
mark word class pointer x
object pointer
x
y padding
y

Arrays 2.0 Improvements
• array[(long)i] = 5;
• array[i, j, k] = 7;
• Arrays.chop(T[] a, int newLen);
• prevents copying in StringBuilder.toString()
• arrays become real Java objects
• indexes of other types than int, long
• like Map
• thread-safe access for array slices
• final/volatile
42

• Summary and conclusions
• Lamdas and streams offer possible performance improvement
• lazy evaluation
• tiny calcs or small #elements & medium size calc
• don’t use parallel()
• consider old school iterations if performance important
• Many performance improvements in Java 8
• Use it if you can and get better performance
• Several performance improvements planned for Java 9+ (10?)
• Better support for Big Data & number crunching
43

Want to know more?
• www.jpinpoint.com / www.profactive.com
• references, presentations
• Accelerating Java Applications
• 3 days technical training
• 24-25-26 November 2014
• nl-jug members 10% discount
• hand-in business card today: 20% discount
44

Performance van Java 8 en verder - Jeroen Borgers

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Performance van Java 8 en verder - Jeroen Borgers

Similaire à Performance van Java 8 en verder - Jeroen Borgers (20)

Plus de NLJUG

Plus de NLJUG (20)

Dernier

Dernier (20)

Performance van Java 8 en verder - Jeroen Borgers