During the continuous mORMot refactoring, some core part of the framework was rewritten. In this session, we propose a journey to a refactoring of a single loop. It will take us from a naïve but working approach, to a 10 times faster Pascal rewrite, and then introduce how SSE2 and AVX2 assembly could boost the process even further – to reach more than 30 times improvement! No previous knowledge of assembly is needed: we will try to introduce how modern CPUs work, and will have some fun with algorithms and SIMD parallelism.
10. The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
<> a hashed list
(it does not own the data)
11. The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
Used e.g. by the TDynArray wrapper
the TSynDictionary class
the in-memory ORM engine
12. The Hash-Table Mystery
How does a Hash-Table work?
bucketindex := hash(key) mod bucketscount
for O(1) retrieval instead of O(n) manual lookup
19. The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
if we handle properly hash collision
20. The Hash-Table Mystery
How does a Hash-Table work?
the Hard Thing is for Deletion
you can not just reset the slot
since indexes changed
21. The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
2. Adjust the indexes
3. Use other algorithm
22. The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
23. The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
24.
25. The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
26. The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
Let’s try deleting 1/128th of 200,000 items !
27. The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
But not really fast on huge count.
23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms
Why????
28. • The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
33. Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if you needed to rewind a tape
34. Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if JS needed to garbage collect
35. Branches Are Evil
Processors Learn to Predict Branches
Each CPU Vendor and Architecture
changes the execution plan
and even introduced Artificial Intelligence
i.e. a CPU is a very complex beast
don’t trust the code, nor the asm!
37. Branches Are Evil
2 is always taken, 3 is taken but the last time
and 1 is “randomly” taken… so not predictable...
1
2
3
38. Branches Are Evil
Processors Learn to Predict Branches
Source:
https://lemire.me/blog/2019/10/16/benchmarkin
g-is-hard-processors-learn-to-predict-branches/
39. Branches Are Evil
Processors Learn to Predict Branches
Pseudo code:
while (howmany != 0) {
val = random();
if( val is an odd integer ) {
out[index] = val;
index += 1;
}
howmany--;
}
40. Branches Are Evil
Processors Learn to Predict Branches
The more trials, the better prediction…
the CPU somehow learns from its mistakes!
43. Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth
From Lemire:
“This perfect prediction on the AMD Rome
falls apart if you grow the problem
from 2000 to 10,000 values: the best
prediction goes from a 0.1% error rate
to a 33% error rate.”
44. Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
45. Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
That’s why I hate microbenchmarks!
And in the Delphi world, I have seen so much!
46. Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
(as random as the hash function itself)
47. Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
Note: unrolling doesn’t help, by definition
48. Branches Are Evil
What about Going Parallel?
We could divide P[] into sections, and use threads
- it should scale up to how many CPU cores we have
- but we are in a low-level library, so threads are unavailable
- there should be a better way
50. Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
boolean-to-integer expression returns
either 0 (false) or 1 (true)
and has no branch
51. Branches Are Evil
Introducing a Branch-Less Loop
FACT: it is actually faster to execute
dec(P[count], 0);
than to handle a mispredicted branch…
(i.e. execute nothing)
52. Branches Are Evil
Introducing a Branch-Less Loop
while count > 0 is very likely to loop
therefore easy to predict
(by CPU Scheduler convention,
an “upper jump” is estimated as most probable)
53. Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
compiles to very efficient asm
(branchless setl opcode)
54. Branches Are Evil
Introducing a Branch-Less Loop
Here, a little unrolling (slightly) helps…
since it avoids an unlikely count <= 0 condition/branch
55. Branches Are Evil
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
We have almost 10X better performance,
in pure pascal code !
56. • The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
58. SIMD Assembly: SSE2
Can SIMD Improve It Further?
• Data Alignment Restrictions
• Gathering/Scattering is Tricky
• Architecture Specific
• Not native to Delphi or FPC compilers
• Sometimes needs setup/clear
59. SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Introduced by Intel in 2000 (Pentium 4)
• XMM0 to XMM7 Registers
in 32-bit mode
• XMM0 to XMM15
in x86_64 mode
60. SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Each 128-bit XMM Register can handle
Two 64-bit Doubles or Integers
Four 32-bit Integers
Eight 16-bit or Sixteen 8-bit Integers
67. SIMD Assembly: SSE2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
We expected X4
but we got a little less than X3
(pretty good, to be fair)
68. SIMD Assembly: SSE2
Help Needed?
https://www.agner.org/optimize/
The “Optimization Bible” (also per-CPU timing)
https://gcc.godbolt.org/
Check what best compilers do
https://www.felixcloutier.com/x86/
OpCode Reference Documentation
69. • The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
71. SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Each 256-bit YMM Register can handle
Four 64-bit Doubles or Integers
Eight 32-bit Integers
Sixteen 16-bit or Thirty-two 8-bit Integers
72. SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Before using them:
Check the CPUID flag
Ensure the OS is AVX2-Aware
• AVX2 is Supported in FPC asm
• AVX2 is Not Supported in Delphi asm
75. SIMD Assembly: AVX2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
avx2 adjust=161.73ms 14.1GB/s
We got only 30% better numbers
We saturated the CPU bandwidth
76. • The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
77. Conclusion
• On Deletion, TDynArrayHasher
is not a bottleneck any more
• The TDynArray.Delete data move
takes most time now
• We have a nice pure-pascal version
78. Conclusion
• Branches are Evil
• Never Trust Micro Benchmarks
• Unrolling is no magic
• Branchless is magic: 10 X faster
• SIMD is worth it if really needed
for another 3 X boost
79. From Delphi to AVX2
Questions?
No Marmots Were Harmed in the Making of This Session