SlideShare une entreprise Scribd logo
1  sur  79
Télécharger pour lire hors ligne
Arnaud Bouchez - Synopse
Rewrite for Performance
From Delphi to AVX2
Welcome to
a fun/wakeup session
about performance
hashes
and assembly mystery
Arnaud Bouchez
• Open SourceFounder
mORMot
SynPDF
• Delphiand FPC expert
DDD, SOA, ORM, MVC
Performance,SOLID
• SynopseConsulting
https://synopse.info
Menu du jour
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
The Hash-Table Mystery
mORMot is Fast
The Hash-Table Mystery
mORMot is Fast
and tries to be always faster
The Hash-Table Mystery
mORMot is Fast
and tries to be always faster
so works hard for it
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
<> a hashed list
(it does not own the data)
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
Used e.g. by the TDynArray wrapper
the TSynDictionary class
the in-memory ORM engine
The Hash-Table Mystery
How does a Hash-Table work?
bucketindex := hash(key) mod bucketscount
for O(1) retrieval instead of O(n) manual lookup
The Hash-Table Mystery
How does a Hash-Table work?
crc32c()
(hardware accelerated SSE4.2)
The Hash-Table Mystery
How does a Hash-Table work?
xxhash32()
(on non-Intel or old CPUs)
The Hash-Table Mystery
How does a Hash-Table work?
mORMot prefers indexes for efficiency
(and don’t store the hashcode since crc32c is fast)
The Hash-Table Mystery
How does a Hash-Table work?
mORMot stores keys with values
within a (dynamic) array
The Hash-Table Mystery
How does a Hash-Table work?
mORMot can hash several keys
in the same (dynamic) array
The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
if we handle properly hash collision
The Hash-Table Mystery
How does a Hash-Table work?
the Hard Thing is for Deletion
you can not just reset the slot
since indexes changed
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
2. Adjust the indexes
3. Use other algorithm
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
Let’s try deleting 1/128th of 200,000 items !
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
But not really fast on huge count.
23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms
Why????
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow
Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow
Branches Are Evil
Zilog Z80
nostalgic sight:
“Why would I need more than
16KB RAM on my ZX81?”
Branches Are Evil
Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if you needed to rewind a tape
Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if JS needed to garbage collect
Branches Are Evil
Processors Learn to Predict Branches
Each CPU Vendor and Architecture
changes the execution plan
and even introduced Artificial Intelligence
i.e. a CPU is a very complex beast
 don’t trust the code, nor the asm!
Branches Are Evil
Be your own CPU: Let’s Predict !
Branches Are Evil
2 is always taken, 3 is taken but the last time
and 1 is “randomly” taken… so not predictable...
1
2
3
Branches Are Evil
Processors Learn to Predict Branches
Source:
https://lemire.me/blog/2019/10/16/benchmarkin
g-is-hard-processors-learn-to-predict-branches/
Branches Are Evil
Processors Learn to Predict Branches
Pseudo code:
while (howmany != 0) {
val = random();
if( val is an odd integer ) {
out[index] = val;
index += 1;
}
howmany--;
}
Branches Are Evil
Processors Learn to Predict Branches
The more trials, the better prediction…
the CPU somehow learns from its mistakes!
Branches Are Evil
Processors Learn to Predict Branches
Branches Are Evil
Processors Learn to Predict Branches
Perfect prediction! 
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth
From Lemire:
“This perfect prediction on the AMD Rome
falls apart if you grow the problem
from 2000 to 10,000 values: the best
prediction goes from a 0.1% error rate
to a 33% error rate.” 
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
That’s why I hate microbenchmarks!
And in the Delphi world, I have seen so much!
Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
(as random as the hash function itself)
Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
Note: unrolling doesn’t help, by definition
Branches Are Evil
What about Going Parallel?
We could divide P[] into sections, and use threads
- it should scale up to how many CPU cores we have
- but we are in a low-level library, so threads are unavailable
- there should be a better way
Branches Are Evil
Introducing a Branch-Less Loop
Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
boolean-to-integer expression returns
either 0 (false) or 1 (true)
and has no branch
Branches Are Evil
Introducing a Branch-Less Loop
FACT: it is actually faster to execute
dec(P[count], 0);
than to handle a mispredicted branch…
(i.e. execute nothing)
Branches Are Evil
Introducing a Branch-Less Loop
while count > 0 is very likely to loop
therefore easy to predict
(by CPU Scheduler convention,
an “upper jump” is estimated as most probable)
Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
compiles to very efficient asm
(branchless setl opcode)
Branches Are Evil
Introducing a Branch-Less Loop
Here, a little unrolling (slightly) helps…
since it avoids an unlikely count <= 0 condition/branch
Branches Are Evil
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
We have almost 10X better performance,
in pure pascal code !
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
SIMD Assembly: SSE2
Can SIMD Improve It Further?
SIMD = Single Instruction,
Multiple Data
SIMD Assembly: SSE2
Can SIMD Improve It Further?
• Data Alignment Restrictions
• Gathering/Scattering is Tricky
• Architecture Specific
• Not native to Delphi or FPC compilers
• Sometimes needs setup/clear
SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Introduced by Intel in 2000 (Pentium 4)
• XMM0 to XMM7 Registers
in 32-bit mode
• XMM0 to XMM15
in x86_64 mode
SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Each 128-bit XMM Register can handle
Two 64-bit Doubles or Integers
Four 32-bit Integers
Eight 16-bit or Sixteen 8-bit Integers
SIMD Assembly: SSE2
SSE2 SIMD Instructions
SIMD Assembly: SSE2
We need to SIMD the following code:
SIMD Assembly: SSE2
We need to SIMD the following code:
We can identify two 4-integers = 128-bit blocks
SIMD Assembly: SSE2
1. Prepare and Align the Input
Parameters: rcx=P edx=deleted r8=count
SIMD Assembly: SSE2
2. Processing Loop
SIMD Assembly: SSE2
3. Trailing Bytes
SIMD Assembly: SSE2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
We expected X4
but we got a little less than X3
(pretty good, to be fair)
SIMD Assembly: SSE2
Help Needed?
https://www.agner.org/optimize/
The “Optimization Bible” (also per-CPU timing)
https://gcc.godbolt.org/
Check what best compilers do
https://www.felixcloutier.com/x86/
OpCode Reference Documentation
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• AVX introduced in Sandy Bridge 2011
New 128-bit instructions
New coding scheme
• AVX2 introduced in Haswell 2013
YMM 256-bit registers
FusedMultiplyAccumulate (FMA) ops
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Each 256-bit YMM Register can handle
Four 64-bit Doubles or Integers
Eight 32-bit Integers
Sixteen 16-bit or Thirty-two 8-bit Integers
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Before using them:
Check the CPUID flag
Ensure the OS is AVX2-Aware
• AVX2 is Supported in FPC asm
• AVX2 is Not Supported in Delphi asm
SIMD Assembly: AVX2
SSE2 Processing Loop
SIMD Assembly: AVX2
New AVX2 Processing Loop
SIMD Assembly: AVX2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
avx2 adjust=161.73ms 14.1GB/s
We got only 30% better numbers
 We saturated the CPU bandwidth 
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
Conclusion
• On Deletion, TDynArrayHasher
is not a bottleneck any more
• The TDynArray.Delete data move
takes most time now
• We have a nice pure-pascal version
Conclusion
• Branches are Evil
• Never Trust Micro Benchmarks
• Unrolling is no magic
• Branchless is magic: 10 X faster
• SIMD is worth it if really needed
for another 3 X boost
From Delphi to AVX2
Questions?
No Marmots Were Harmed in the Making of This Session

Contenu connexe

Tendances

Compiler design lab
Compiler design labCompiler design lab
Compiler design labilias ahmed
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Data Structures for Text Editors
Data Structures for Text EditorsData Structures for Text Editors
Data Structures for Text Editorsosfameron
 
AWS IoT with ESP32 and Mongoose OS
AWS IoT with ESP32 and Mongoose OSAWS IoT with ESP32 and Mongoose OS
AWS IoT with ESP32 and Mongoose OSAmazon Web Services
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
 
What is Python JSON | Edureka
What is Python JSON | EdurekaWhat is Python JSON | Edureka
What is Python JSON | EdurekaEdureka!
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design IntroductionAman Sharma
 
Operating system deign and implementation
Operating system deign and implementationOperating system deign and implementation
Operating system deign and implementationsangrampatil81
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 

Tendances (20)

Compiler design lab
Compiler design labCompiler design lab
Compiler design lab
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Data Structures for Text Editors
Data Structures for Text EditorsData Structures for Text Editors
Data Structures for Text Editors
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
AWS IoT with ESP32 and Mongoose OS
AWS IoT with ESP32 and Mongoose OSAWS IoT with ESP32 and Mongoose OS
AWS IoT with ESP32 and Mongoose OS
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
CMS on AWS Deep Dive
CMS on AWS Deep DiveCMS on AWS Deep Dive
CMS on AWS Deep Dive
 
Python :variable types
Python :variable typesPython :variable types
Python :variable types
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
What is Python JSON | Edureka
What is Python JSON | EdurekaWhat is Python JSON | Edureka
What is Python JSON | Edureka
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design Introduction
 
Operating system deign and implementation
Operating system deign and implementationOperating system deign and implementation
Operating system deign and implementation
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Python.ppt
Python.pptPython.ppt
Python.ppt
 

Similaire à Ekon24 from Delphi to AVX2

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code ClinicMike Acton
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperConnor McDonald
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)err
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit AutomationMoabi.com
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorterManchor Ko
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCanSecWest
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 

Similaire à Ekon24 from Delphi to AVX2 (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 

Plus de Arnaud Bouchez

EKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdfEKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdfArnaud Bouchez
 
EKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdfEKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdfArnaud Bouchez
 
Ekon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side NotificationsEkon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side NotificationsArnaud Bouchez
 
Ekon25 mORMot 2 Cryptography
Ekon25 mORMot 2 CryptographyEkon25 mORMot 2 Cryptography
Ekon25 mORMot 2 CryptographyArnaud Bouchez
 
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMotEkon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMotArnaud Bouchez
 
Ekon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-DesignEkon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-DesignArnaud Bouchez
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)Arnaud Bouchez
 
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)Arnaud Bouchez
 
Ekon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOAEkon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOAArnaud Bouchez
 
Ekon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven DesignEkon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven DesignArnaud Bouchez
 
Ekon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop DelphiEkon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop DelphiArnaud Bouchez
 
Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference Arnaud Bouchez
 
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference Arnaud Bouchez
 
D1 from interfaces to solid
D1 from interfaces to solidD1 from interfaces to solid
D1 from interfaces to solidArnaud Bouchez
 
D2 domain driven-design
D2 domain driven-designD2 domain driven-design
D2 domain driven-designArnaud Bouchez
 

Plus de Arnaud Bouchez (20)

EKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdfEKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdf
 
EKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdfEKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdf
 
Ekon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side NotificationsEkon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side Notifications
 
Ekon25 mORMot 2 Cryptography
Ekon25 mORMot 2 CryptographyEkon25 mORMot 2 Cryptography
Ekon25 mORMot 2 Cryptography
 
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMotEkon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
 
Ekon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-DesignEkon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-Design
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)
 
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
 
Ekon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOAEkon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOA
 
Ekon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven DesignEkon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven Design
 
Ekon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop DelphiEkon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop Delphi
 
Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference
 
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
 
2016 mORMot
2016 mORMot2016 mORMot
2016 mORMot
 
A1 from n tier to soa
A1 from n tier to soaA1 from n tier to soa
A1 from n tier to soa
 
D1 from interfaces to solid
D1 from interfaces to solidD1 from interfaces to solid
D1 from interfaces to solid
 
A3 from sql to orm
A3 from sql to ormA3 from sql to orm
A3 from sql to orm
 
A2 from soap to rest
A2 from soap to restA2 from soap to rest
A2 from soap to rest
 
D2 domain driven-design
D2 domain driven-designD2 domain driven-design
D2 domain driven-design
 
A4 from rad to mvc
A4 from rad to mvcA4 from rad to mvc
A4 from rad to mvc
 

Dernier

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Dernier (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Ekon24 from Delphi to AVX2

  • 1. Arnaud Bouchez - Synopse Rewrite for Performance From Delphi to AVX2
  • 2. Welcome to a fun/wakeup session about performance hashes and assembly mystery
  • 3. Arnaud Bouchez • Open SourceFounder mORMot SynPDF • Delphiand FPC expert DDD, SOA, ORM, MVC Performance,SOLID • SynopseConsulting https://synopse.info
  • 4. Menu du jour • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 5. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 7. The Hash-Table Mystery mORMot is Fast and tries to be always faster
  • 8. The Hash-Table Mystery mORMot is Fast and tries to be always faster so works hard for it
  • 9. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array
  • 10. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array <> a hashed list (it does not own the data)
  • 11. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array Used e.g. by the TDynArray wrapper the TSynDictionary class the in-memory ORM engine
  • 12. The Hash-Table Mystery How does a Hash-Table work? bucketindex := hash(key) mod bucketscount for O(1) retrieval instead of O(n) manual lookup
  • 13. The Hash-Table Mystery How does a Hash-Table work? crc32c() (hardware accelerated SSE4.2)
  • 14. The Hash-Table Mystery How does a Hash-Table work? xxhash32() (on non-Intel or old CPUs)
  • 15. The Hash-Table Mystery How does a Hash-Table work? mORMot prefers indexes for efficiency (and don’t store the hashcode since crc32c is fast)
  • 16. The Hash-Table Mystery How does a Hash-Table work? mORMot stores keys with values within a (dynamic) array
  • 17. The Hash-Table Mystery How does a Hash-Table work? mORMot can hash several keys in the same (dynamic) array
  • 18. The Hash-Table Mystery How does a Hash-Table work? It is easy to insert a new item
  • 19. The Hash-Table Mystery How does a Hash-Table work? It is easy to insert a new item if we handle properly hash collision
  • 20. The Hash-Table Mystery How does a Hash-Table work? the Hard Thing is for Deletion you can not just reset the slot since indexes changed
  • 21. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table 2. Adjust the indexes 3. Use other algorithm
  • 22. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table What mORMot did for years. Not too bad in practice. 2. Adjust the indexes Brute force O(n) algorithm. 3. Use other algorithm More complex, and usually stores the data.
  • 23. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table What mORMot did for years. Not too bad in practice. 2. Adjust the indexes Brute force O(n) algorithm. 3. Use other algorithm More complex, and usually stores the data.
  • 24.
  • 25. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm Seems simple, lean and efficient.
  • 26. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm Seems simple, lean and efficient. Let’s try deleting 1/128th of 200,000 items !
  • 27. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm But not really fast on huge count. 23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms Why????
  • 28. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 29. Branches Are Evil Alt-F2 : The Obvious Pascal  asm  CPU Flow
  • 30. Branches Are Evil Alt-F2 : The Obvious Pascal  asm  CPU Flow
  • 31. Branches Are Evil Zilog Z80 nostalgic sight: “Why would I need more than 16KB RAM on my ZX81?”
  • 33. Branches Are Evil Processors Learn to Predict Branches Since Pentium 4 In case of misprediction, execution pipelines need to be flushed … just as if you needed to rewind a tape
  • 34. Branches Are Evil Processors Learn to Predict Branches Since Pentium 4 In case of misprediction, execution pipelines need to be flushed … just as if JS needed to garbage collect
  • 35. Branches Are Evil Processors Learn to Predict Branches Each CPU Vendor and Architecture changes the execution plan and even introduced Artificial Intelligence i.e. a CPU is a very complex beast  don’t trust the code, nor the asm!
  • 36. Branches Are Evil Be your own CPU: Let’s Predict !
  • 37. Branches Are Evil 2 is always taken, 3 is taken but the last time and 1 is “randomly” taken… so not predictable... 1 2 3
  • 38. Branches Are Evil Processors Learn to Predict Branches Source: https://lemire.me/blog/2019/10/16/benchmarkin g-is-hard-processors-learn-to-predict-branches/
  • 39. Branches Are Evil Processors Learn to Predict Branches Pseudo code: while (howmany != 0) { val = random(); if( val is an odd integer ) { out[index] = val; index += 1; } howmany--; }
  • 40. Branches Are Evil Processors Learn to Predict Branches The more trials, the better prediction… the CPU somehow learns from its mistakes!
  • 41. Branches Are Evil Processors Learn to Predict Branches
  • 42. Branches Are Evil Processors Learn to Predict Branches Perfect prediction! 
  • 43. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth From Lemire: “This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.” 
  • 44. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth  From Lemire: “You should probably avoid benchmarking branchy code over small problems.”
  • 45. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth  From Lemire: “You should probably avoid benchmarking branchy code over small problems.” That’s why I hate microbenchmarks! And in the Delphi world, I have seen so much!
  • 46. Branches Are Evil Branch Misprediction Hurts if … then … dec(P[i]) branch is taken or not taken evenly in not predictable manner (as random as the hash function itself)
  • 47. Branches Are Evil Branch Misprediction Hurts if … then … dec(P[i]) branch is taken or not taken evenly in not predictable manner Note: unrolling doesn’t help, by definition
  • 48. Branches Are Evil What about Going Parallel? We could divide P[] into sections, and use threads - it should scale up to how many CPU cores we have - but we are in a low-level library, so threads are unavailable - there should be a better way
  • 49. Branches Are Evil Introducing a Branch-Less Loop
  • 50. Branches Are Evil Introducing a Branch-Less Loop ord(P[count] > delete) boolean-to-integer expression returns either 0 (false) or 1 (true) and has no branch
  • 51. Branches Are Evil Introducing a Branch-Less Loop FACT: it is actually faster to execute dec(P[count], 0); than to handle a mispredicted branch… (i.e. execute nothing)
  • 52. Branches Are Evil Introducing a Branch-Less Loop while count > 0 is very likely to loop therefore easy to predict (by CPU Scheduler convention, an “upper jump” is estimated as most probable)
  • 53. Branches Are Evil Introducing a Branch-Less Loop ord(P[count] > delete) compiles to very efficient asm (branchless setl opcode)
  • 54. Branches Are Evil Introducing a Branch-Less Loop Here, a little unrolling (slightly) helps… since it avoids an unlikely count <= 0 condition/branch
  • 55. Branches Are Evil Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s We have almost 10X better performance, in pure pascal code !
  • 56. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 57. SIMD Assembly: SSE2 Can SIMD Improve It Further? SIMD = Single Instruction, Multiple Data
  • 58. SIMD Assembly: SSE2 Can SIMD Improve It Further? • Data Alignment Restrictions • Gathering/Scattering is Tricky • Architecture Specific • Not native to Delphi or FPC compilers • Sometimes needs setup/clear
  • 59. SIMD Assembly: SSE2 SSE2 SIMD Instructions • Introduced by Intel in 2000 (Pentium 4) • XMM0 to XMM7 Registers in 32-bit mode • XMM0 to XMM15 in x86_64 mode
  • 60. SIMD Assembly: SSE2 SSE2 SIMD Instructions • Each 128-bit XMM Register can handle Two 64-bit Doubles or Integers Four 32-bit Integers Eight 16-bit or Sixteen 8-bit Integers
  • 61. SIMD Assembly: SSE2 SSE2 SIMD Instructions
  • 62. SIMD Assembly: SSE2 We need to SIMD the following code:
  • 63. SIMD Assembly: SSE2 We need to SIMD the following code: We can identify two 4-integers = 128-bit blocks
  • 64. SIMD Assembly: SSE2 1. Prepare and Align the Input Parameters: rcx=P edx=deleted r8=count
  • 65. SIMD Assembly: SSE2 2. Processing Loop
  • 66. SIMD Assembly: SSE2 3. Trailing Bytes
  • 67. SIMD Assembly: SSE2 Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s sse2 adjust=201.53ms 11.3GB/s We expected X4 but we got a little less than X3 (pretty good, to be fair)
  • 68. SIMD Assembly: SSE2 Help Needed? https://www.agner.org/optimize/ The “Optimization Bible” (also per-CPU timing) https://gcc.godbolt.org/ Check what best compilers do https://www.felixcloutier.com/x86/ OpCode Reference Documentation
  • 69. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 70. SIMD Assembly: AVX2 AVX2 SIMD Instructions • AVX introduced in Sandy Bridge 2011 New 128-bit instructions New coding scheme • AVX2 introduced in Haswell 2013 YMM 256-bit registers FusedMultiplyAccumulate (FMA) ops
  • 71. SIMD Assembly: AVX2 AVX2 SIMD Instructions • Each 256-bit YMM Register can handle Four 64-bit Doubles or Integers Eight 32-bit Integers Sixteen 16-bit or Thirty-two 8-bit Integers
  • 72. SIMD Assembly: AVX2 AVX2 SIMD Instructions • Before using them: Check the CPUID flag Ensure the OS is AVX2-Aware • AVX2 is Supported in FPC asm • AVX2 is Not Supported in Delphi asm
  • 73. SIMD Assembly: AVX2 SSE2 Processing Loop
  • 74. SIMD Assembly: AVX2 New AVX2 Processing Loop
  • 75. SIMD Assembly: AVX2 Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s sse2 adjust=201.53ms 11.3GB/s avx2 adjust=161.73ms 14.1GB/s We got only 30% better numbers  We saturated the CPU bandwidth 
  • 76. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 77. Conclusion • On Deletion, TDynArrayHasher is not a bottleneck any more • The TDynArray.Delete data move takes most time now • We have a nice pure-pascal version
  • 78. Conclusion • Branches are Evil • Never Trust Micro Benchmarks • Unrolling is no magic • Branchless is magic: 10 X faster • SIMD is worth it if really needed for another 3 X boost
  • 79. From Delphi to AVX2 Questions? No Marmots Were Harmed in the Making of This Session