We weten allemaal dat de grootste verbetering die Java 8 brengt de ondersteuning voor lambda-expressies is. Dit introduceert functioneel programmeren in Java. Door het toevoegen van de Stream API wordt deze verbetering nog groter: iteratie kan nu intern worden afgehandeld door een bibliotheek, je kunt daarmee nu het beginsel "Tell, don’t ask" toepassen op collecties. Je kunt gewoon vertellen dat er een ??functie uitgevoerd moet worden op je verzameling, of vertellen dat dat parallel, door meerdere cores moet gebeuren. Maar wat betekent dit voor de prestaties van onze Java-toepassingen? Kunnen we nu meteen volledig al onze CPU-cores benutten om betere responstijden te krijgen? Hoe werken filter / map / reduce en parallele streams precies intern? Hoe wordt het Fork-Join framework hierin gebruikt? Zijn lambda's sneller dan inner klassen? - Al deze vragen worden beantwoord in deze sessie. Daarnaast introduceert Java 8 meer performance verbeteringen: tiered compilatie, PermGen verwijdering, java.time, Accumulators, Adders en Map verbeteringen. Ten slotte zullen we ook een kijkje nemen in de keuken van de geplande performance verbeteringen voor Java 9: benutting van GPU's, Value Types en arrays 2.0.
Scaling API-first – The story of a global engineering organization
Performance van Java 8 en verder - Jeroen Borgers
1. Performance of Java 8
and beyond
Performance van Java 8 en verder
By Jeroen Borgers
1
2. Contents
1. Introduction
2. Lambda expressions
3. Stream API
4. Parallel execution & cores
5. Filter map reduce, parallel streams internals
6. Fork-join framework use
7. Lambda’s versus inner classes
8. Tiered compilation
9. PermGen removal
10.java.time performance
11.Accumulators en Adders
12.Map improvements
13.Java 9+ improvements
14.Utilization of GPU's
15.Value Types
16.Arrays 2.0
17.Summary and conclusions 2
3. Introduction to lambdas and streams
• Java 8 introduces lambda expressions for functional
programming
• With the Stream API iteration can be handled internally by a
library
• Tell don’t ask for applying a function on a collection
• or tell to do that in parallel, on multiple cores
• question is if this improves your response time
3
14. Parallel execution & hardware threads
• Parallel != concurrent
• CPU Frequency at max
• #cores/hardware threads increase 64+
• Must be able to utilize those cores
• need to process data faster: BigData, IoT
• Runtime.getRuntime().availableProcessors()
• reports #hardware threads
• my Mac: 2 cores with 2 hyper threads = 4
• Can we get a speedup of ~4?
14
15. Parallel streams utilize ForkJoinPool
• Java 8 ForkJoinPool introduces a common pool for any ForkJoinTask
• one per JVM
• Used in Array.parallelSort, .parallelSetAll and parallelStream
• Size defaults to Runtime.getRuntime().availableProcessors() - 1
• Can be set with:
• -Djava.util.concurrent.ForkJoinPool.common.parallelism=N
• Multiple JVM’s on a machine
• consider lowering the pool size
• Tasks waiting for I/O
• consider increasing the pool size
15
16. Fork-join framework: divide-and-conquer
• Divide task recursively in smaller tasks
• Divide array of 640 elements into 64
leaf tasks of 10 elements
• e.g. sum or sort on each level
• Many ForkJoinTasks processed by
limited threads, e.g. ForEachTask
• like ThreadPoolExecutor
• worse: overhead of creating tasks
• better: work stealing from queue
of other threads
• great for unbalanced tasks!
16
17. Performance of Lambda’s versus inner classes
• Lambdas seem syntactic sugar around creating anonymous class
• in fact, it is not
• Inner class
• Actual class loaded by class loader
• New object created, allocation, initialization, gc
• Lambda
• creates a static method called through helper class
• Performance is similar
• Only first time loading inner class in class loader is slower
17
18. When to use parallel streams?
• source.parallelStream().operation(F)
• F independent
• computation on element does not rely on or impact other
• stateless, non-interfering
• source is efficiently splittable
• Collections, Arrays, SplittableRandom
• not I/O based: designed for sequential use
• computationally expensive
• ROT: sequential version > 100 μs
18
19. Parallel when computationally expensive
• source.parallelStream().operation(F)
• ROT: sequential version > 100 μs
• N * Q > 10 000
• N = #elements
• Q = cost per element of F: #operations
• small function like x -> x * x: N > 10 000 elements
• moderately large function Q = 100: N > 100 elements
19
20. Overhead of parallel execution
• Startup of power-controlled cores
• Sequential part of setting up parallel calculation
• Splittability = ease of partitioning
• efficient if random access or efficient search:
• ArrayLists, [Concurrent]HashMaps, arrays
• inefficient: LinkedLists, BlockingQueues, IO-based
• Stream BufferedReader.lines() currently for sequential
use
• might by improved in future JDK, for highly efficient
bulk processing of buffered IO
20
24. Medium sized calculation benchmark
• 1000 elements
• Speedup by using serial lambda's = 0.95884454
• Speedup of parallel over serial lambda's= 1.2968781
• Speedup of parallel over oldSchool = 1.2435045
• 100_000 elements
• Speedup by using serial lambda's = 0.9760258
• Speedup of parallel over serial lambda's= 2.1337924
• Speedup of parallel over oldSchool = 2.0826366
24
25. Utilization of cores
Medium calculation, 1000 and 100_000 elements
Parallel part 25
28. Tiny calculation benchmark
• 1000 elements
• Speedup by using serial lambda's = 0.12944984
• Speedup of parallel over serial lambda's= 0.46804
• Speedup of parallel over oldSchool = 0.0605877
• 100_000 elements
• Speedup by using serial lambda's = 0.10920245
• Speedup of parallel over serial lambda's= 5.905797
• Speedup of parallel over oldSchool = 0.64492756
28
30. Micro benchmark conclusions
(for this benchmark, on this computer)
• For high performance and small functions: use old school loops
• lambda’s infrastructure takes more overhead than function
• For high performance and large functions
• serial
• if N * Q > 100 000 then parallel
• I need more cores!
30
31. Tiered compilation
• JIT-compiler came in 2 flavors, now 3
• -client (C1)
• quick startup time
• -server (C2)
• best performance in long run
• -XX:+TieredCompilation
• first C1, then C2
• only Java 8: TieredCompilation default
• Java 7: often need to increase code cache
• -XX:ReservedCodeCacheSize=96M (7) 240M (8)
31
32. Permgen removal
• Upto Java 7: Permgen; Java 8: Metaspace
• Permgen (wrong name)
• data not related to classes: String pool
• Metaspace
• only class meta data
• Class objects itself on heap
• String pool on heap
• -XX:[Max]MetaspaceSize=N
• Default max ‘unlimited’ (1 GB)
• OutOfMemoryError: Metaspace instead of PermGen space
32
33. java.time performance
• Finally a proper library for Date and Time that replaces the
• Crappy stuff:
• java.util.Date
• mutable - defensive copies needed
• java.util.Calendar
• 540 bytes to store timestamp, Locale, TZ - heap/gc
• java.text.SimpleDateFormat
• not thread safe - so have to re-create
• Stephen Colebourne spec lead, from Joda time
33
35. Map improvements
• HashMap, LinkedHashMap and
ConcurrentHashMap
• collisions on keys: keys end up in same bucket
• access time O(1) -> O(n)
• follow LinkedList until key.equals() returns
true
• Balanced tree instead of linked list
• if size > TREEIFY_THRESHOLD (8)
• worst case access time O(n) -> O(log(n))
• keys should implement Comparable
• branches on hashCode, then compareTo
35
37. Sumatra: Utilization of GPU's
• GPU’s have 100_000’s of stream cores
• SIMD - single instruction multiple data
• work offloaded to GPU
• implemented off-loadable version of parallel().forEach()
• Use parallel streams and lambdas
37
38. Value Types (JEP 169)
The next big thing!
• Currently:
• limited set of primitives, by value: no identity
• others by reference: identity
• footprint:
• heap allocated
• object headers
• 1+ pointers pointing to it
• burden for small objects
• object identity only serves mutability
• JVM attempts to figure out if identity is needed
• escape analysis and object elision can unwrap in cases
• fragile
• Object might be used as lock, then needs identity
38
41. Point - class versus value type
• Point object layout
• @Value Point layout
41
mark word class pointer x
object pointer
x
y padding
y
42. Arrays 2.0 Improvements
• array[(long)i] = 5;
• array[i, j, k] = 7;
• Arrays.chop(T[] a, int newLen);
• prevents copying in StringBuilder.toString()
• arrays become real Java objects
• indexes of other types than int, long
• like Map
• thread-safe access for array slices
• final/volatile
42
43. • Summary and conclusions
• Lamdas and streams offer possible performance improvement
• lazy evaluation
• tiny calcs or small #elements & medium size calc
• don’t use parallel()
• consider old school iterations if performance important
• Many performance improvements in Java 8
• Use it if you can and get better performance
• Several performance improvements planned for Java 9+ (10?)
• Better support for Big Data & number crunching
43
44. Want to know more?
• www.jpinpoint.com / www.profactive.com
• references, presentations
• Accelerating Java Applications
• 3 days technical training
• 24-25-26 November 2014
• nl-jug members 10% discount
• hand-in business card today: 20% discount
44