SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Using Graphics Cards
 to Break Passwords
          Andrey Belenko
    a.belenko@elcomsoft.com


                              !"#$%&'()"*
Why use GPUs?
Core i7 die layout




  Transistor count: 1.17B
Core i7 die layout
                  L3 Cache                         L3 Cache
IO & QPI




                                                                     IO & QPI
                                    Queue
           Core    Core      Core           Core    Core      Core



                          Memory Controller

                        Transistor count: 1.17B
Branch pred.
Fetch & L1


                            Paging
                                     L2



Decode
  &    Mem. L1
μ-code



Sched.                          Exec
Core i7 die layout




  Transistor count: 1.17B
10%
            CPU dedicates 1/10 of
            resources to calculations
90%
GTX 480 die layout




    Transistor count: 3B
GTX 480 die layout




    Transistor count: 3B
• GPU dedicates 1/3 of
      30%     resources to calculations
            • 2.5x more transistors
              than CPU
70%
            • 7x more computing
              power overall
PBKDF2-SHA1
                        with 2000 iterations


  i7-970        15.5K




GTX 480                       60K




GTX 580                         68K




HD 5970                                               195K



           0K           50K           100K     150K   200K
How to use GPUs?
Basics
• GPUs are SIMD and excel at data-parallel tasks
• Program for GPU is called ‘kernel’
• Kernel runs in instances called threads
• Hardware takes care of thread scheduling
• Typical GPU has 100s of processors
• Need 1000s of threads to fully utilize GPU
Example
                  C=A+B

Kernel:
void sum (int c[], int a[], int b[]) {
  int Index = getThreadId();
  c[Index] = a[Index] + b[Index];
}


Adding vectors:
int A[10], B[10], C[10];
sum<<10>> (C, A, B);
Example
                       MD5
 Kernel:
 void md5 (uint8 *dataIn, uint8 *dataOut) {
   int Index = getThreadId();
   uint8 *in = dataIn + MD5_BLOCK_SIZE * Index;
   uint8 *out = dataOut + MD5_HASH_SIZE * Index;
   MD5( dataOut, dataIn, MD5_BLOCK_SIZE );
 }

Computing hashes:
uint8 Src[10 * MD5_BLOCK_SIZE];
uint8 Dst[10 * MD5_HASH_SIZE];
md5<<10>> (Src, Dst);
GPU Computing Stack

                   High-level Language
Translation, no
optimizations
                  Intermediate Language
                                          Optimization
                                           goes here
                          ISA


                     GPU Hardware
GPU Computing Stack
       GPU world is bipolar

         NVIDIA                    ATI

HLL   CUDA C, OpenCL             OpenCL

IL          PTX                     IL

                              Documented for
ISA   Not documented
                               RV700 (48xx)

HW    G80 (8xxx) and up   RV670 (38xx) and up
Breaking passwords
             the CPU way




Generate
                H(p)        Verify hash
password



   Computing H(p) takes the most
    time, so offload it to the GPU
Breaking passwords
               the GPU way

CPU           GPU            CPU
                    H(p)
  Generate          H(p)       Verify hashes
  passwords
                     ...

                    H(p)
Breaking passwords
                       the GPU way

CPU                GPU                 CPU
  Generate               H(p)              Verify hashes
  passwords



•If H(p) is fast, PCIe data transfers are the bottleneck
  •E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s
      Solution is to offload everything to GPU
Breaking passwords
                       the GPU way

GPU                GPU                 GPU
  Generate               H(p)              Verify hashes
  passwords



•If H(p) is fast, PCIe data transfers are the bottleneck
  •E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s
      Solution is to offload everything to GPU
How to use GPUs?
  Implementation considerations
GPU Computing Stack
         NVIDIA                  ATI

HLL   CUDA C, OpenCL           OpenCL

IL          PTX                   IL

                           Documented for
ISA   Not documented
                            RV700 (48xx)

HW    G80 (8xxx) and up   RV670 (38xx) and up
Choosing language
                  CUDA C vs. PTX



• C code translates into PTX without
  optimizations
• Optimization is done when compiling PTX
• Intrinsics for device-specific instructions
   No real reason for developing in PTX
Choosing language
                     OpenCL


• Portability requires compilation at runtime
   • May take significant time and resources
   • Compiler is part of driver ➯ testing hell
   • Requires source code in HLL ➯ IP issues
• Implementations are not complete and vary
  across vendors
             Not mature enough
Choosing language
                      ATI IL


• The only viable option if you love your users
   • Access to device-specific instructions
   • Best performance
• Not a an option if you love your developers
   • Poor documentation, poor samples
   • Meaningless compiler errors, no debugger
Achieving performance
• Minimize data transfers
• Minimize memory accesses
  • Or at least plan them carefully
• Minimize number of registers used
  • Less registers used means more threads will
   run simultaneously
• Schedule enough threads to keep GPU
 processors busy
• Avoid thread divergence
Porting crypto to GPU
• Usually pretty straightforward
 • MD5, SHA1 and alike require little to no
    changes
• Can be tricky sometimes
 • RC4 requires many memory accesses, so
    careful layout is needed
 • DES requires table lookups which are
    very expensive
Porting crypto to GPU
                  The DES


• Table lookups (s-boxes) are the bottleneck
• Avoid them by using bitslicing
 • S-boxes replaced with logic functions
 • 32 encryptions in parallel
 • Requires many registers
 • Performance depends on compiler
    heuristics
How to use GPUs?
    Real-world problems
Scalability
            Not all GPUs created equal


1. Program should scale nicely with the number of
processors on GPU
 • Query processor count from the driver
 • Partition task accordingly
    numThreads = F(numProcessors)

 • Also helps to avoid triggering watchdog and
    freezing screen
Scalability
       8 GPUs in system are not uncommon



2. Program should scale nicely with the number of
GPUs
 • Query device count from the driver
 • Spawn CPU threads to control each device
 • Partition task accordingly
Speedup should be linear unless you hit PCIe limits
Compatibility
             Not everyone’s got Fermi.Yet.

• New hardware offers great new features
   • Cache on Fermi
   • bitalign instruction on RV770
• May require different optimization strategy
• May require separate codebase
• Support for legacy hardware shouldn’t be dropped
      Be prepared to handle this sort of
                complexity
Including GPU code
 Option 1: include PTX/IL code in your program

         Pros                     Cons
•Recommended way         •Compilation at runtime
•Forward compatibility   •Can’t test all hardware
•No hardware required    •IP issues
Including GPU code
    Option 2: include pre-compiled GPU binaries
          Pros                          Cons
•No dependency on users’    •May not work with future
 driver                       devices
•No compilation at runtime •Need to precompile for
                            every supported GPU
•Better IP protection
                           •No precompiled binary
                              for GPU = no support
Questions?
Thank you
Using Graphics Cards
 to Break Passwords
          Andrey Belenko
    a.belenko@elcomsoft.com


                              !"#$%&'()"*

Contenu connexe

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

En vedette (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Using Graphics Cards to Break Passwords

  • 1. Using Graphics Cards to Break Passwords Andrey Belenko a.belenko@elcomsoft.com !"#$%&'()"*
  • 3. Core i7 die layout Transistor count: 1.17B
  • 4. Core i7 die layout L3 Cache L3 Cache IO & QPI IO & QPI Queue Core Core Core Core Core Core Memory Controller Transistor count: 1.17B
  • 5.
  • 6. Branch pred. Fetch & L1 Paging L2 Decode & Mem. L1 μ-code Sched. Exec
  • 7. Core i7 die layout Transistor count: 1.17B
  • 8. 10% CPU dedicates 1/10 of resources to calculations 90%
  • 9. GTX 480 die layout Transistor count: 3B
  • 10. GTX 480 die layout Transistor count: 3B
  • 11. • GPU dedicates 1/3 of 30% resources to calculations • 2.5x more transistors than CPU 70% • 7x more computing power overall
  • 12. PBKDF2-SHA1 with 2000 iterations i7-970 15.5K GTX 480 60K GTX 580 68K HD 5970 195K 0K 50K 100K 150K 200K
  • 13. How to use GPUs?
  • 14. Basics • GPUs are SIMD and excel at data-parallel tasks • Program for GPU is called ‘kernel’ • Kernel runs in instances called threads • Hardware takes care of thread scheduling • Typical GPU has 100s of processors • Need 1000s of threads to fully utilize GPU
  • 15. Example C=A+B Kernel: void sum (int c[], int a[], int b[]) { int Index = getThreadId(); c[Index] = a[Index] + b[Index]; } Adding vectors: int A[10], B[10], C[10]; sum<<10>> (C, A, B);
  • 16. Example MD5 Kernel: void md5 (uint8 *dataIn, uint8 *dataOut) { int Index = getThreadId(); uint8 *in = dataIn + MD5_BLOCK_SIZE * Index; uint8 *out = dataOut + MD5_HASH_SIZE * Index; MD5( dataOut, dataIn, MD5_BLOCK_SIZE ); } Computing hashes: uint8 Src[10 * MD5_BLOCK_SIZE]; uint8 Dst[10 * MD5_HASH_SIZE]; md5<<10>> (Src, Dst);
  • 17. GPU Computing Stack High-level Language Translation, no optimizations Intermediate Language Optimization goes here ISA GPU Hardware
  • 18. GPU Computing Stack GPU world is bipolar NVIDIA ATI HLL CUDA C, OpenCL OpenCL IL PTX IL Documented for ISA Not documented RV700 (48xx) HW G80 (8xxx) and up RV670 (38xx) and up
  • 19. Breaking passwords the CPU way Generate H(p) Verify hash password Computing H(p) takes the most time, so offload it to the GPU
  • 20. Breaking passwords the GPU way CPU GPU CPU H(p) Generate H(p) Verify hashes passwords ... H(p)
  • 21. Breaking passwords the GPU way CPU GPU CPU Generate H(p) Verify hashes passwords •If H(p) is fast, PCIe data transfers are the bottleneck •E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s Solution is to offload everything to GPU
  • 22. Breaking passwords the GPU way GPU GPU GPU Generate H(p) Verify hashes passwords •If H(p) is fast, PCIe data transfers are the bottleneck •E.g. if H(p) is SHA-1, theoretical peak is ~200M p/s Solution is to offload everything to GPU
  • 23. How to use GPUs? Implementation considerations
  • 24. GPU Computing Stack NVIDIA ATI HLL CUDA C, OpenCL OpenCL IL PTX IL Documented for ISA Not documented RV700 (48xx) HW G80 (8xxx) and up RV670 (38xx) and up
  • 25. Choosing language CUDA C vs. PTX • C code translates into PTX without optimizations • Optimization is done when compiling PTX • Intrinsics for device-specific instructions No real reason for developing in PTX
  • 26. Choosing language OpenCL • Portability requires compilation at runtime • May take significant time and resources • Compiler is part of driver ➯ testing hell • Requires source code in HLL ➯ IP issues • Implementations are not complete and vary across vendors Not mature enough
  • 27. Choosing language ATI IL • The only viable option if you love your users • Access to device-specific instructions • Best performance • Not a an option if you love your developers • Poor documentation, poor samples • Meaningless compiler errors, no debugger
  • 28. Achieving performance • Minimize data transfers • Minimize memory accesses • Or at least plan them carefully • Minimize number of registers used • Less registers used means more threads will run simultaneously • Schedule enough threads to keep GPU processors busy • Avoid thread divergence
  • 29. Porting crypto to GPU • Usually pretty straightforward • MD5, SHA1 and alike require little to no changes • Can be tricky sometimes • RC4 requires many memory accesses, so careful layout is needed • DES requires table lookups which are very expensive
  • 30. Porting crypto to GPU The DES • Table lookups (s-boxes) are the bottleneck • Avoid them by using bitslicing • S-boxes replaced with logic functions • 32 encryptions in parallel • Requires many registers • Performance depends on compiler heuristics
  • 31. How to use GPUs? Real-world problems
  • 32. Scalability Not all GPUs created equal 1. Program should scale nicely with the number of processors on GPU • Query processor count from the driver • Partition task accordingly numThreads = F(numProcessors) • Also helps to avoid triggering watchdog and freezing screen
  • 33. Scalability 8 GPUs in system are not uncommon 2. Program should scale nicely with the number of GPUs • Query device count from the driver • Spawn CPU threads to control each device • Partition task accordingly Speedup should be linear unless you hit PCIe limits
  • 34. Compatibility Not everyone’s got Fermi.Yet. • New hardware offers great new features • Cache on Fermi • bitalign instruction on RV770 • May require different optimization strategy • May require separate codebase • Support for legacy hardware shouldn’t be dropped Be prepared to handle this sort of complexity
  • 35. Including GPU code Option 1: include PTX/IL code in your program Pros Cons •Recommended way •Compilation at runtime •Forward compatibility •Can’t test all hardware •No hardware required •IP issues
  • 36. Including GPU code Option 2: include pre-compiled GPU binaries Pros Cons •No dependency on users’ •May not work with future driver devices •No compilation at runtime •Need to precompile for every supported GPU •Better IP protection •No precompiled binary for GPU = no support
  • 39. Using Graphics Cards to Break Passwords Andrey Belenko a.belenko@elcomsoft.com !"#$%&'()"*