SlideShare une entreprise Scribd logo
1  sur  82
Auto Tuning
Hemanth and Siddharth
     UT Austin
Basics
What is Auto Tuning?
● Several Definitions
   ○ First result on Wikipedia - "Auto-Tune is an audio
     processor created by Antares Audio Technologies
     "


● A Definition
  ○ Autotuning is an automatic process for selecting one
      out of several possible solutions to a computational
      problem.


● Techniques used by:
   ○ Library generators, Compilers and Runtime systems
Possible Versions of a Solution
● The solutions may differ in the
  ○ algorithm (quicksort vs selection sort)
  ○ implementation (loop unroll).

● The versions may result from
  ○ transformations (unroll, tile, interchange)

● The versions could be generated by
  ○ programmer manually (coding or directives)
   ○ compiler automatically
Motivation
■ Increasing diversity of computation supports
■ New influences on the execution of parallel
  applications
  ○ Hierarchical structure
  ○ Heterogeneity of the processors
■ Design efficient software that takes full
  advantage of such systems
■ Solving a target problem by using a single
  algorithm is not always efficient everywhere
First Ideas
● Poly-Algorithms
    ○   (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic
        solution of nonlinear equations"


●   Profiling and feedback assisted compilation
    ○   (1982) S. Graham et.al : gprof
    ○   (1991) P. Chang et.a l: "Using profile information to assist classic
        code optimizations"


●   Code generation
    ○   (1989) J. Johnson et.al : “A methodology for designing, modifying,
        and implementing Fourier Transform algorithms on various
        architectures.”
    ○   (1992) M. Covell et.al : “Computer-aided algorithm design and
        arrangement”
Context: High Performance Libraries
● Linear Algebra
   ○ BLAS, LAPACK, ScaLAPACK
● Signal/Image Processing
  ○ Vector Signal Image Processing Library (VSIPL)
● Distributed/Parallel Systems
  ○ Message Passing Interface (MPI)
● Can we implement libraries:
  ○ Automatically and Portably
  ○ Incorporating platform-specific features
  ○ matching performance of hand-tuned
     implementations leveraging compiler technology
   ○ using domain-specific knowledge
AutoTuning
● 2 phase scheme for producing automatically
  tuned code

● Given: Program; inputs; machine

● Step1: Identify and generate a space of
  candidate implementations

● Step2: Select the fastest one using empirical
  modeling and/or automated experiments
Why not let the compiler worry?
● General Purpose
  ○ whereas Library generators can focus on specific
    problems


● Engineering
  ○ Hard to modify a production compiler and its effects
    are global


● Analysis
  ○ Limited access to relevant run-time information
  ○ Over specified dependencies
  ○ Correctness Criteria
Compiler Vs AutoTuner
                 Compiler                 AutoTuner
Input            General Purpose          Specification including
                 Source Code              problem size, machine
                                          parameters and
                                          problem specific
                                          transformations

Output           Low level Machine        Mostly High Level
                 Code                     Source (eg: C code)

Time to          Short (unless            Usually Long (depends
                 feedback/profiling       on search space)
Generate         enabled)

Select           Mostly Static Analysis   Automated Empirical
                 (rarely feedback         Models and
Implementation   tuning)                  experiments
Some AutoTuning Projects

● Linear Algebra
  ○ Portable High-Performance ANSI C
     ■ PHiPAC
  ○ Automatically Tuned Linear Algebra Software
    ■ ATLAS


● Signal and Image Processing
  ○ Fast Fourier Transformations of the West
    ■ FFTW
  ○ SPIRAL
PHiPAC
Traditional Approach
Hand Tuned Libraries
PHiPAC (1997)
● Developing Portable High-Performance
  matrix vector libraries in ANSI C
● Parametrized C-code Generator
  ○ produces code according to certain
     guidelines
● Auto Tune the code
● Exhaustive search over all parameters
● Claim: achieve over 90% of peak-perf and
PHiPAC Approach
Generate Optimized C Code
PHiPAC Approach
Parameters are Architecture Specific
Efficient Code Generation
● Studied several ANSI C Compilers and
  determined that it is best to

● Rely on Compilers for:
  ○ Register allocation
  ○ Instruction selection and Scheduling


● Manually perform:
  ○ register/cache blocking
  ○ loop unrolling
  ○ software pipe-lining, etc
Local Variables to explicitly remove false
dependencies
●        Before                    After
    a[i] = b[i] + c;             float f1, f2;
    a[i+1] = b[i+1] * d;   f1 = b[i]; f2 = b[i+1];
                                a[i] = f1 + c;
                               a[i+1] = f2 * d;



Compiler mayn't assume &a[i] != &b[i+1]
and so is forced to first store a[i] before
loading b[i+1] (Pointer Aliasing)
False Dependencies




              After Removing Dependency
Exploit Multiple Registers

● Explicitly keep values in local variables
  ○ Reduces memory bandwidth
   ○ compiler would reload fil values for every
     iteration (potential aliasing with res)

           Before                     After
  while(...) {              float f0 = fil[0];
  *res++ = fil[0] * sig[0]; float f1 = fil[1];
         + fil[1] * sig[1]; while(...) {
  signal ++;                  *res++ = f0 * sig[0]
  }                                  + f1 * sig[1];
                               signal ++
                            }
Minimize pointer updates by striding with
constant offsets

         Before                    After
●   f0 = *r8; r8 += 4;   f0   = r8[0];
    f1 = *r8; r8 += 4;   f1   = r8[4];
    f2 = *r8; r8 += 4;   f2   = r8[8];
                         r8   += 12;




Compilers can fold constant index into
(register + offset) addressing mode.
Minimize branches, avoid magnitude
compares
● Branches are costly
  ○ Unroll loops
  ○ Use do{} while(); loops to avoid loop
     head branches
● Using == and != is cheaper
          Before                      After
  for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE];
      i < ARRAY_SIZE;       do {
      i ++, a++) {            ...
      ....                    a++;
  }                         } while (a != end_ptr);
Explicitly unroll loops

● Instruction level parallelism
          Before                      After
  while(...) {              float f0, f1, s0, s1;
  *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1];
         + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1];
  signal ++;
  }                         *res++ = (f0*s0)+(f1*s1)
                            do { signal++;
                                 s0 = sig[0];
                              res[0] = f0*s1 + f1*s2;
                                 s1 = sig[1];
                              res[1] = f0*s2 + f1*s0;
                              res += 2;
                            } while(...);
Other Guidelines
● Balance Instruction Mix
  ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or
     stores
● Increase Locality
  ○ Arrange code to have unit-stride memory
     accesses and try to reuse data in cache
● Convert Integer multiplies to adds
  ○ * and / are slower than +
Matrix Multiply Generators
● Produce C code with PHiPAC guidelines
● C = αop(A)op(B) + βC
  ○ MxK, KxN and MxN matrices
  ○ op(X) is either X or transpose(X)

● mm_cgen and mm_lgen
    ○ Core (register blocking)
    ○ Level (higher level cache blocking)


●   mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
Blocked MMM
for (i=0; i<M; i+=M0)
 for (j=0; j<N; j+=N0)
  for (l=0; l<K; l+=K0)

   for (r=i; r<i+M0; r++)
    for (s=i; s<i+N0; s++)
     for (t=i; t<i+K0; t++)
      c[r][s] += a[r][t] * b[t][s];
Code Generator
 $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ]




  M0 K0 N0          mm_gen              Optimized C
  M1 K1 N1
Usage and Options
Usage: mm_cgen [OPTIONS]
● Semantics options:
    ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose
    ○ -no_fringes : don’t generate an M,K, or N reg block
      fringes


●   Optimization options:
    ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1)
      blocking parameters
    ○ -sp [1|2lm|2ma|3] : software pipelining options
Contd.
● Precision options:
   ○ prec/sprec/aprec/dprec [single|double|ldouble] :
     Precision (source, accumulator, destination)


● Misc. options:
  ○ file name : Write to file ’name’
   ○ routine_name name : Name of routines
Optimal Block Sizes
Use the search.pl script
Optimal Block Sizes
● Naive brute force search

● For Register Parameters
   ○ NR/4 <= M0N0 <= NR ; NR is max regs
   ○ 1 <= K0 <= K0max ; K0max = 20 (tunable)


● Benchmark all squares M = K = N = D
  ○ D runs over 2x, 3x, 10x and all primes
  ○ 3D2 fits in L1 cache
Contd.
● For L1 blocking Parameters
● The square case ( D x D)
● Search the neighborhood centered at 3D2 =
L1
● Set the values of M1, K1, N1 to ϕ D/M0
   ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 }
   ○ D = sqrt(L1/3)
   ○ 125 Combinations
Naive Brute Force ?
● Search take too long

● Generates very lengthy code

● Very slow under full optimization

● Need a better search strategy
Smarter Search
● Majority of the computation is performed in
  register blocked code
● Benchmark only in multiples of register block
  size
● Search space of M0, N0, K0 is not reduced
  ○ Prioritize neighborhood of the best ones found
  ○ {M0-1, M0, M0+1} etc.
● Terminate after reaching acceptable
  efficiency
Evaluation
Single Precision MMM (100 MHz SGI
Indigo R4k)




Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
Double Precision MMM (HP 712/80i)




Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
There is no Golden Hammer
Strengths:              Weaknesses:
● Automatic Search      ● Focus on
   for optimal Params     uniprocessor
● Produces portable       Machines
  ANSI C Code.          ● No support for
                          vector based CPUs
                        ● No control over
                          instruction
                          scheduling
Further Information
● http://www.icsi.berkeley.edu/~bilmes/phipac/

● http://www.inf.ethz.
  ch/personal/markusp/teaching/252-2600-
  ETH-fall11/slides/01-Dietiker.pdf
ATLAS
Siddharth Subramanian
ATLAS
● Automatically Tuned Linear Algebra
  Software
● Generates optimized BLAS library
● C and Fortran77
● Provides implementation for BLAS levels 1,2
  and 3.
● We will focus on Matrix-Matrix-Multiply
  (MMM)
Naive MMM
● C = A * B using 3 for-loops
● Dimensions of A, B and C are NxK, KxM and
  NxM respectively.
Optimization for L1 cache
● Matrix divided into NB x NB blocks
● Each block is called mini-MMM
● Optimization parameter NB is chosen such
  that each mini-MMM fits in cache
Optimization for L1 cache
Optimization for register file
● Mini-MMMs are further represented as micro-
  MMMs
● Multiplies MU x 1 sub-matrix of A by 1 x NU sub-
  matrix of B and accumulates the result into MU x
  NU sub-matrix of C
● Here MU and NU are the optimization parameters
● Necessary condition : MU + NU + MU*NU <= NR
● where NR = no. of floating point registers
Mini and Micro- MMM
Code
Pipeline scheduling
The 2 innermost loops (i'' and j'') are unrolled,
to create interleaved multiply and add
statements
Exploits instruction-level parallelism
● If there is fused multiply-add, then these 2
  operations can be executed together
● The optimization parameter FMA indicates
  the code generator whether this facility
Pipeline scheduling
● MU + NU loads and stores
● MU * NU additions and multiplications
● Latency of operations might stall the pipeline
● Solution : Interleave the operations such that
  dependent operations are separated by a
  particular distance (What would that be?)
● This is governed by another optimization
  parameter - LS
Pipeline scheduling

● Inject MU + NU loads of A and B
● Loads divided into:
  ○ Initial fetch (IF)
  ○ Blocks of other load operations (NF)
Loop Unrolling
● KU is the optimization parameter that
  controls loop unrolling
● Constrained by the capacity of instruction
  cache
● Should not be so small (wastage of cache)
  or so big (overflow of instruction cache)
Other Optimizations


● Copying tiles of A is done in the beginning of
  outermost loop. These tiles are fully reused
  in each iteration of j loop
● Copying jth vertical panel of B -- done before
  beginning of i loop.
● Copying tile (i,j) of C just before the "k" loop
  starts
Other optimizations
● Choosing loop order:

  ○ if N < M then JIK loop order (so that A

     completely fits into L2 cache)

  ○ else if M < N then IJK loop order
Other optimizations
● Copying A, B, C for smaller matrices might
  be an overhead
● Non-copying versions are generated with
  optimization parameter NCNB
● This version used if:
  ○ M * N * K is less than a threshold
  ○ at least 1 dimension of 1 of the matrices is
     smaller than 3 * NCNB
Estimating parameters
● Orthogonal search is used for optimizing
  parameters.
● It is a heuristic, and finds approximate
  solutions
● No guarantee of optimized solution
● It needs these details:
  ○ Optimized in what order?
  ○ Possible solution range for parameters
  ○ reference value used for parameter k during
     optimization of 1 to k-1
Summary of Parameters
Estimating Machine Parameters

Machine parameters are measured:
● C1 - Size of L1 data cache
● NR - Number of floating point registers
● FMA - Availability of fused multiply-add
● LS - Amount of separation between
  dependent multiply and add instructions
Estimating parameters

Optimization sequence
● NB
● MU and NU
● KU
● Ls
● I F, N F
● NCNB
Finding NB

● Generates values in range :

  16 <= NB <= min(80, √C1)


  where C1 = size of L1 data cache
Finding MU and NU

● All combinations that satisfy:

   ○ MU * NU + MU + NU + LS <= NR


● NB was obtained earlier
Finding LS and IF, NF

LS
● Tries values in interval [1, 6]
● Boundary value fixed based on experiments
● Divides MU * NU * KU (instruction scheduling)

● IF: Searches of IF in the interval [2, MU + NU]
● NF in the interval [1, MU + NU - IF]
Finding NCNB


● Searches in the range [NB : -4 : 4]

● Terminates search when performance drops
  by 20% of the best found solution
Is Search Really
   Necessary?
Finding KU


● Constrained by instruction cache
● Values between 4 and NB/2 are tried

● Special values 1 and NB are also considered
Empirical Optimization
● Estimation of optimal values is the key
    ○ Compilers use Analytical models
    ○ Library Generators (eg: ATLAS) use search
● Empirical Search:
    ○ Get a version of program for each combination of
      parameters
    ○ Execute it on the target machine and measure
      performance
    ○ Select the one that performs best
    ○ Increased installation time!!
●   How is the search space bounded?
    ○ The hardware parameters
Yotov et.al
● Realised that most optimizations used in
  ATLAS code generator are already known to
  the compilers.
  ○ cache Tiling, register tiling, etc.
● Replaced the search module with a
  parameter estimator based on standard
  analytical models
● Code generator is not modified
  ○ Any performance change is solely based on
    differently chosen parameters
ATLAS Architecture
Analysis
● Results indicated that a simple and intuitive
  model is able to estimate near-optimal
  values for the parameters

● Focus on the ATLAS generated code

● Notations:
   ○ ATLAS CGw/S - Code Generator with Search
   ○ ATLAS Model - Modified Atlas (No search)
   ○ Atlas Unleashed - Hand written code may be used
     along with predefined architecture defaults for the
     parameter values to produce the library.
Model-Based Optimization

● Requires more machine parameters than
  original ATLAS
  ○ No Search!!
● Empirical optimizers:
  ○ Approximate values of machine params are okay
  ○ Only used to bound the search space
● Model-based Optimizers:
  ○ Need accurate values
  ○ Developed a tool called X-RAY to accurately
    measure them
Hardware Parameters
● C1,B1: the capacity and the line size of the
  L1 data cache
● CI : The capacity of the L1 instruction cache
● Lx: hardware latency of the floating-point
  multiply instruction
● |ALUFP |: number of floating-point functional
  units
● NR: the number of floating-point registers
● FMA: the availability of a fused multiply-add
  instruction
Estimating NB
● Consider L1 cache - Fully Associative,
  Optimal replacement, Unit line size

● Working set of mini-MMM loop has 3 blocks
  of NB x NB
                3 NB2 <= C1
● In the inner most loop (C), element once
  computed is not used again. Similarly only 1
  column of B is needed in cache.
              NB2 + NB + 1 <= C1
Refined Estimate of NB


● Correcting for non-unit line size

        |N2B/B1| + |NB/B1| + 1 <= C1/B1
Further Refinement
● Estimated NB may not be multiple of MU and
  NU
● This might cause fractional register tiles and
  extra clean up
● Avoid this by choosing proper NB
● ATLAS needs NB to be an even integer
● So, we have: NB =
Estimating MU and NU

● View register file as a software cache
  ○ that is fully associative
  ○ unit line size
  ○ capacity = # registers, NR


● ATLAS performs outer products of (MU x 1)
  and (1 x NU) vectors for register tiling
Contd.
● ATLAS allocates MU elements for A, NU
  elements for B, and MU*NU elements for C
● Also need LS registers to store temp values
  of multiplications to make use of pipelining
● So we have:
      (MU x NU) + NU + MU + LS <= NR
LS calculation will be shown later, NR is known.
Only unknowns are MU and NU.
Estimation Scheme
● Let MU = NU = u. Solve prev inequality for u

● Let MU = max (u, 1). Solve for NU

● Let NU = max (NU, 1)

● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
Estimating KU

● Not limited by the size of the register file
● Limited by the size of I-Cache
● Unroll the innermost loop within the size
  constraints of instruction cache
● Avoid micro-MMM code cleanup
   ○ Trim KU so that it divides NB

   ○ Usually, KU = NB in most machines
Estimating LS

● Skew factor that ATLAS code generator
  uses to schedule dependent multiplication
  and addition operations for CPU Pipeline
● LS independent multiplications and LS-1
  independent additions between muli and
  corresponding addi should at least hide the
  latency of multiplication.
Estimating Ls

● LX = latency of multiplication
● 2 * LS - 1 independent instructions hides this
  latency
● So, 2 * LS - 1 >= LX
● There may be multiple floating point units
        (2 x LS) - 1/ |ALUFP| >= LX
● Solution for LS:
Summary
1.   Estimate FMA
2.   Estimate LS :


3. Estimate MU and Nu
MU*NU + NU + MU + LS <= NR
Set MU = NU = u. Solve for u
MU = max(1, u). Solve for NU
NU = max(NU, 1). If MU < NU swap MU and NU
4. Estimate NB
              |N2B/B1| + |NB/B1| + 1 <= C1/B1
     ○   Trim NB to be multiple of 2, MU and NU
5. Estimate KU
     ○   Constrained by I-cache.
     ○   Make KU divide NB
6. Estimate NF, IF
     ○   IF = 2 , N F = 2
Experimental Results
Conclusions
● In all machines (other than Itanium), the
  codes performed almost as well as global
  search based codes
● Models to find parameters are much faster
● Might be difficult to implement analytical
  methods in compilers
  ○ This model is focused on only 1 application

Contenu connexe

Tendances

Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiCysinfo Cyber Security Community
 
The Inner Secrets of Compilers
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of CompilersIT MegaMeet
 
TMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptTMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptIosif Itkin
 
Lecture 3 insertion sort and complexity analysis
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysisjayavignesh86
 
VLSI experiments II
VLSI experiments IIVLSI experiments II
VLSI experiments IIGouthaman V
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...Frank Nielsen
 
Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Ed Dodds
 
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Hsien-Hsin Sean Lee, Ph.D.
 
Unsupervised program synthesis
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesisAmrith Krishna
 
Introduction to digital logic
Introduction to digital logicIntroduction to digital logic
Introduction to digital logicKamal Acharya
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...NECST Lab @ Politecnico di Milano
 

Tendances (19)

Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)
 
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
 
Verilog tutorial
Verilog tutorialVerilog tutorial
Verilog tutorial
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 
The Inner Secrets of Compilers
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of Compilers
 
TMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptTMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScript
 
Lecture 3 insertion sort and complexity analysis
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysis
 
VLSI experiments II
VLSI experiments IIVLSI experiments II
VLSI experiments II
 
Cs2251 daa
Cs2251 daaCs2251 daa
Cs2251 daa
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
C046051216
C046051216C046051216
C046051216
 
Common Crypto Pitfalls
Common Crypto PitfallsCommon Crypto Pitfalls
Common Crypto Pitfalls
 
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
 
Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011
 
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
 
Unsupervised program synthesis
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesis
 
Introduction to digital logic
Introduction to digital logicIntroduction to digital logic
Introduction to digital logic
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 

En vedette

Projet 1200 מדפסת שולחנית לייצור תכשיטים
Projet 1200 מדפסת שולחנית לייצור תכשיטים Projet 1200 מדפסת שולחנית לייצור תכשיטים
Projet 1200 מדפסת שולחנית לייצור תכשיטים Caliber_Engineering
 
Casting (Part II)
Casting (Part II)Casting (Part II)
Casting (Part II)elm0011
 
Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Marius FAILLOT DEVARRE
 
Automotive casting part in reverse engineering
Automotive casting part in reverse engineeringAutomotive casting part in reverse engineering
Automotive casting part in reverse engineeringPSH Mechanical Design
 
Content Management for Web Designers
Content Management for Web DesignersContent Management for Web Designers
Content Management for Web DesignersReuben Jackson
 
IT Advance for Post Foundation
IT Advance for Post FoundationIT Advance for Post Foundation
IT Advance for Post FoundationVTC
 
Product categort shenzhen advanced titanium
Product categort shenzhen advanced titaniumProduct categort shenzhen advanced titanium
Product categort shenzhen advanced titaniumIvy gtg
 
Mawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 HiddenMawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 Hiddenevebby526
 
Toyota trunk Class A in Alias design
Toyota trunk  Class A in Alias designToyota trunk  Class A in Alias design
Toyota trunk Class A in Alias designPSH Mechanical Design
 
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10CATHERINEM1_
 
Five Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate DefectsFive Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate DefectsDesign World
 
Rinine Engineering
Rinine EngineeringRinine Engineering
Rinine EngineeringHussain M T
 

En vedette (19)

Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering  Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering
 
Projet 1200 מדפסת שולחנית לייצור תכשיטים
Projet 1200 מדפסת שולחנית לייצור תכשיטים Projet 1200 מדפסת שולחנית לייצור תכשיטים
Projet 1200 מדפסת שולחנית לייצור תכשיטים
 
Catia Part07
Catia Part07Catia Part07
Catia Part07
 
Casting (Part II)
Casting (Part II)Casting (Part II)
Casting (Part II)
 
final.portfolio
final.portfoliofinal.portfolio
final.portfolio
 
Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)
 
Parts washer
Parts washerParts washer
Parts washer
 
Automotive casting part in reverse engineering
Automotive casting part in reverse engineeringAutomotive casting part in reverse engineering
Automotive casting part in reverse engineering
 
Content Management for Web Designers
Content Management for Web DesignersContent Management for Web Designers
Content Management for Web Designers
 
IT Advance for Post Foundation
IT Advance for Post FoundationIT Advance for Post Foundation
IT Advance for Post Foundation
 
Investment casting details
Investment casting detailsInvestment casting details
Investment casting details
 
Product categort shenzhen advanced titanium
Product categort shenzhen advanced titaniumProduct categort shenzhen advanced titanium
Product categort shenzhen advanced titanium
 
Mawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 HiddenMawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 Hidden
 
Toyota trunk Class A in Alias design
Toyota trunk  Class A in Alias designToyota trunk  Class A in Alias design
Toyota trunk Class A in Alias design
 
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
 
Five Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate DefectsFive Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate Defects
 
Mould in Reverse Engineering
Mould in Reverse Engineering Mould in Reverse Engineering
Mould in Reverse Engineering
 
Rinine Engineering
Rinine EngineeringRinine Engineering
Rinine Engineering
 
apostila de catia
apostila de catiaapostila de catia
apostila de catia
 

Similaire à Auto Tuning Basics

Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
 
02 functions, variables, basic input and output of c++
02   functions, variables, basic input and output of c++02   functions, variables, basic input and output of c++
02 functions, variables, basic input and output of c++Manzoor ALam
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
 
Introduction to nand2 tetris
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetrisYodalee
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2ozgur_can
 
Compiler presention
Compiler presentionCompiler presention
Compiler presentionFaria Priya
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLinaro
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISelIgalia
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptxdk03006
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
LECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesLECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesAhmedMahjoub15
 

Similaire à Auto Tuning Basics (20)

Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
02 functions, variables, basic input and output of c++
02   functions, variables, basic input and output of c++02   functions, variables, basic input and output of c++
02 functions, variables, basic input and output of c++
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
Introduction to nand2 tetris
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetris
 
Towards hasktorch 1.0
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2
 
Compiler presention
Compiler presentionCompiler presention
Compiler presention
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
 
Cryptography 202
Cryptography 202Cryptography 202
Cryptography 202
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISel
 
micro:bit and JavaScript
micro:bit and JavaScriptmicro:bit and JavaScript
micro:bit and JavaScript
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
LECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesLECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphes
 

Plus de Hemanth Kumar Mantri

Plus de Hemanth Kumar Mantri (8)

TCP Issues in DataCenter Networks
TCP Issues in DataCenter NetworksTCP Issues in DataCenter Networks
TCP Issues in DataCenter Networks
 
Basic Paxos Implementation in Orc
Basic Paxos Implementation in OrcBasic Paxos Implementation in Orc
Basic Paxos Implementation in Orc
 
Neural Networks in File access Prediction
Neural Networks in File access PredictionNeural Networks in File access Prediction
Neural Networks in File access Prediction
 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components Labeling
 
JPEG Image Compression
JPEG Image CompressionJPEG Image Compression
JPEG Image Compression
 
Traffic Simulation using NetLogo
Traffic Simulation using NetLogoTraffic Simulation using NetLogo
Traffic Simulation using NetLogo
 
Search Engine Switching
Search Engine SwitchingSearch Engine Switching
Search Engine Switching
 
Hadoop and MapReduce
Hadoop and MapReduceHadoop and MapReduce
Hadoop and MapReduce
 

Dernier

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Auto Tuning Basics

  • 1. Auto Tuning Hemanth and Siddharth UT Austin
  • 3. What is Auto Tuning? ● Several Definitions ○ First result on Wikipedia - "Auto-Tune is an audio processor created by Antares Audio Technologies " ● A Definition ○ Autotuning is an automatic process for selecting one out of several possible solutions to a computational problem. ● Techniques used by: ○ Library generators, Compilers and Runtime systems
  • 4. Possible Versions of a Solution ● The solutions may differ in the ○ algorithm (quicksort vs selection sort) ○ implementation (loop unroll). ● The versions may result from ○ transformations (unroll, tile, interchange) ● The versions could be generated by ○ programmer manually (coding or directives) ○ compiler automatically
  • 5. Motivation ■ Increasing diversity of computation supports ■ New influences on the execution of parallel applications ○ Hierarchical structure ○ Heterogeneity of the processors ■ Design efficient software that takes full advantage of such systems ■ Solving a target problem by using a single algorithm is not always efficient everywhere
  • 6. First Ideas ● Poly-Algorithms ○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic solution of nonlinear equations" ● Profiling and feedback assisted compilation ○ (1982) S. Graham et.al : gprof ○ (1991) P. Chang et.a l: "Using profile information to assist classic code optimizations" ● Code generation ○ (1989) J. Johnson et.al : “A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” ○ (1992) M. Covell et.al : “Computer-aided algorithm design and arrangement”
  • 7. Context: High Performance Libraries ● Linear Algebra ○ BLAS, LAPACK, ScaLAPACK ● Signal/Image Processing ○ Vector Signal Image Processing Library (VSIPL) ● Distributed/Parallel Systems ○ Message Passing Interface (MPI) ● Can we implement libraries: ○ Automatically and Portably ○ Incorporating platform-specific features ○ matching performance of hand-tuned implementations leveraging compiler technology ○ using domain-specific knowledge
  • 8. AutoTuning ● 2 phase scheme for producing automatically tuned code ● Given: Program; inputs; machine ● Step1: Identify and generate a space of candidate implementations ● Step2: Select the fastest one using empirical modeling and/or automated experiments
  • 9. Why not let the compiler worry? ● General Purpose ○ whereas Library generators can focus on specific problems ● Engineering ○ Hard to modify a production compiler and its effects are global ● Analysis ○ Limited access to relevant run-time information ○ Over specified dependencies ○ Correctness Criteria
  • 10. Compiler Vs AutoTuner Compiler AutoTuner Input General Purpose Specification including Source Code problem size, machine parameters and problem specific transformations Output Low level Machine Mostly High Level Code Source (eg: C code) Time to Short (unless Usually Long (depends feedback/profiling on search space) Generate enabled) Select Mostly Static Analysis Automated Empirical (rarely feedback Models and Implementation tuning) experiments
  • 11. Some AutoTuning Projects ● Linear Algebra ○ Portable High-Performance ANSI C ■ PHiPAC ○ Automatically Tuned Linear Algebra Software ■ ATLAS ● Signal and Image Processing ○ Fast Fourier Transformations of the West ■ FFTW ○ SPIRAL
  • 14. PHiPAC (1997) ● Developing Portable High-Performance matrix vector libraries in ANSI C ● Parametrized C-code Generator ○ produces code according to certain guidelines ● Auto Tune the code ● Exhaustive search over all parameters ● Claim: achieve over 90% of peak-perf and
  • 16. PHiPAC Approach Parameters are Architecture Specific
  • 17. Efficient Code Generation ● Studied several ANSI C Compilers and determined that it is best to ● Rely on Compilers for: ○ Register allocation ○ Instruction selection and Scheduling ● Manually perform: ○ register/cache blocking ○ loop unrolling ○ software pipe-lining, etc
  • 18. Local Variables to explicitly remove false dependencies ● Before After a[i] = b[i] + c; float f1, f2; a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d; Compiler mayn't assume &a[i] != &b[i+1] and so is forced to first store a[i] before loading b[i+1] (Pointer Aliasing)
  • 19. False Dependencies After Removing Dependency
  • 20. Exploit Multiple Registers ● Explicitly keep values in local variables ○ Reduces memory bandwidth ○ compiler would reload fil values for every iteration (potential aliasing with res) Before After while(...) { float f0 = fil[0]; *res++ = fil[0] * sig[0]; float f1 = fil[1]; + fil[1] * sig[1]; while(...) { signal ++; *res++ = f0 * sig[0] } + f1 * sig[1]; signal ++ }
  • 21. Minimize pointer updates by striding with constant offsets Before After ● f0 = *r8; r8 += 4; f0 = r8[0]; f1 = *r8; r8 += 4; f1 = r8[4]; f2 = *r8; r8 += 4; f2 = r8[8]; r8 += 12; Compilers can fold constant index into (register + offset) addressing mode.
  • 22. Minimize branches, avoid magnitude compares ● Branches are costly ○ Unroll loops ○ Use do{} while(); loops to avoid loop head branches ● Using == and != is cheaper Before After for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE]; i < ARRAY_SIZE; do { i ++, a++) { ... .... a++; } } while (a != end_ptr);
  • 23. Explicitly unroll loops ● Instruction level parallelism Before After while(...) { float f0, f1, s0, s1; *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1]; + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1]; signal ++; } *res++ = (f0*s0)+(f1*s1) do { signal++; s0 = sig[0]; res[0] = f0*s1 + f1*s2; s1 = sig[1]; res[1] = f0*s2 + f1*s0; res += 2; } while(...);
  • 24. Other Guidelines ● Balance Instruction Mix ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or stores ● Increase Locality ○ Arrange code to have unit-stride memory accesses and try to reuse data in cache ● Convert Integer multiplies to adds ○ * and / are slower than +
  • 25. Matrix Multiply Generators ● Produce C code with PHiPAC guidelines ● C = αop(A)op(B) + βC ○ MxK, KxN and MxN matrices ○ op(X) is either X or transpose(X) ● mm_cgen and mm_lgen ○ Core (register blocking) ○ Level (higher level cache blocking) ● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
  • 26. Blocked MMM for (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];
  • 27. Code Generator $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ] M0 K0 N0 mm_gen Optimized C M1 K1 N1
  • 28. Usage and Options Usage: mm_cgen [OPTIONS] ● Semantics options: ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose ○ -no_fringes : don’t generate an M,K, or N reg block fringes ● Optimization options: ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1) blocking parameters ○ -sp [1|2lm|2ma|3] : software pipelining options
  • 29. Contd. ● Precision options: ○ prec/sprec/aprec/dprec [single|double|ldouble] : Precision (source, accumulator, destination) ● Misc. options: ○ file name : Write to file ’name’ ○ routine_name name : Name of routines
  • 30. Optimal Block Sizes Use the search.pl script
  • 31. Optimal Block Sizes ● Naive brute force search ● For Register Parameters ○ NR/4 <= M0N0 <= NR ; NR is max regs ○ 1 <= K0 <= K0max ; K0max = 20 (tunable) ● Benchmark all squares M = K = N = D ○ D runs over 2x, 3x, 10x and all primes ○ 3D2 fits in L1 cache
  • 32. Contd. ● For L1 blocking Parameters ● The square case ( D x D) ● Search the neighborhood centered at 3D2 = L1 ● Set the values of M1, K1, N1 to ϕ D/M0 ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 } ○ D = sqrt(L1/3) ○ 125 Combinations
  • 33. Naive Brute Force ? ● Search take too long ● Generates very lengthy code ● Very slow under full optimization ● Need a better search strategy
  • 34. Smarter Search ● Majority of the computation is performed in register blocked code ● Benchmark only in multiples of register block size ● Search space of M0, N0, K0 is not reduced ○ Prioritize neighborhood of the best ones found ○ {M0-1, M0, M0+1} etc. ● Terminate after reaching acceptable efficiency
  • 36. Single Precision MMM (100 MHz SGI Indigo R4k) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 37. Double Precision MMM (HP 712/80i) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 38. There is no Golden Hammer Strengths: Weaknesses: ● Automatic Search ● Focus on for optimal Params uniprocessor ● Produces portable Machines ANSI C Code. ● No support for vector based CPUs ● No control over instruction scheduling
  • 39. Further Information ● http://www.icsi.berkeley.edu/~bilmes/phipac/ ● http://www.inf.ethz. ch/personal/markusp/teaching/252-2600- ETH-fall11/slides/01-Dietiker.pdf
  • 41. ATLAS ● Automatically Tuned Linear Algebra Software ● Generates optimized BLAS library ● C and Fortran77 ● Provides implementation for BLAS levels 1,2 and 3. ● We will focus on Matrix-Matrix-Multiply (MMM)
  • 42. Naive MMM ● C = A * B using 3 for-loops ● Dimensions of A, B and C are NxK, KxM and NxM respectively.
  • 43. Optimization for L1 cache ● Matrix divided into NB x NB blocks ● Each block is called mini-MMM ● Optimization parameter NB is chosen such that each mini-MMM fits in cache
  • 45. Optimization for register file ● Mini-MMMs are further represented as micro- MMMs ● Multiplies MU x 1 sub-matrix of A by 1 x NU sub- matrix of B and accumulates the result into MU x NU sub-matrix of C ● Here MU and NU are the optimization parameters ● Necessary condition : MU + NU + MU*NU <= NR ● where NR = no. of floating point registers
  • 47. Code
  • 48. Pipeline scheduling The 2 innermost loops (i'' and j'') are unrolled, to create interleaved multiply and add statements Exploits instruction-level parallelism ● If there is fused multiply-add, then these 2 operations can be executed together ● The optimization parameter FMA indicates the code generator whether this facility
  • 49. Pipeline scheduling ● MU + NU loads and stores ● MU * NU additions and multiplications ● Latency of operations might stall the pipeline ● Solution : Interleave the operations such that dependent operations are separated by a particular distance (What would that be?) ● This is governed by another optimization parameter - LS
  • 50. Pipeline scheduling ● Inject MU + NU loads of A and B ● Loads divided into: ○ Initial fetch (IF) ○ Blocks of other load operations (NF)
  • 51. Loop Unrolling ● KU is the optimization parameter that controls loop unrolling ● Constrained by the capacity of instruction cache ● Should not be so small (wastage of cache) or so big (overflow of instruction cache)
  • 52. Other Optimizations ● Copying tiles of A is done in the beginning of outermost loop. These tiles are fully reused in each iteration of j loop ● Copying jth vertical panel of B -- done before beginning of i loop. ● Copying tile (i,j) of C just before the "k" loop starts
  • 53. Other optimizations ● Choosing loop order: ○ if N < M then JIK loop order (so that A completely fits into L2 cache) ○ else if M < N then IJK loop order
  • 54. Other optimizations ● Copying A, B, C for smaller matrices might be an overhead ● Non-copying versions are generated with optimization parameter NCNB ● This version used if: ○ M * N * K is less than a threshold ○ at least 1 dimension of 1 of the matrices is smaller than 3 * NCNB
  • 55. Estimating parameters ● Orthogonal search is used for optimizing parameters. ● It is a heuristic, and finds approximate solutions ● No guarantee of optimized solution ● It needs these details: ○ Optimized in what order? ○ Possible solution range for parameters ○ reference value used for parameter k during optimization of 1 to k-1
  • 57. Estimating Machine Parameters Machine parameters are measured: ● C1 - Size of L1 data cache ● NR - Number of floating point registers ● FMA - Availability of fused multiply-add ● LS - Amount of separation between dependent multiply and add instructions
  • 58. Estimating parameters Optimization sequence ● NB ● MU and NU ● KU ● Ls ● I F, N F ● NCNB
  • 59. Finding NB ● Generates values in range : 16 <= NB <= min(80, √C1) where C1 = size of L1 data cache
  • 60. Finding MU and NU ● All combinations that satisfy: ○ MU * NU + MU + NU + LS <= NR ● NB was obtained earlier
  • 61. Finding LS and IF, NF LS ● Tries values in interval [1, 6] ● Boundary value fixed based on experiments ● Divides MU * NU * KU (instruction scheduling) ● IF: Searches of IF in the interval [2, MU + NU] ● NF in the interval [1, MU + NU - IF]
  • 62. Finding NCNB ● Searches in the range [NB : -4 : 4] ● Terminates search when performance drops by 20% of the best found solution
  • 63. Is Search Really Necessary?
  • 64. Finding KU ● Constrained by instruction cache ● Values between 4 and NB/2 are tried ● Special values 1 and NB are also considered
  • 65. Empirical Optimization ● Estimation of optimal values is the key ○ Compilers use Analytical models ○ Library Generators (eg: ATLAS) use search ● Empirical Search: ○ Get a version of program for each combination of parameters ○ Execute it on the target machine and measure performance ○ Select the one that performs best ○ Increased installation time!! ● How is the search space bounded? ○ The hardware parameters
  • 66. Yotov et.al ● Realised that most optimizations used in ATLAS code generator are already known to the compilers. ○ cache Tiling, register tiling, etc. ● Replaced the search module with a parameter estimator based on standard analytical models ● Code generator is not modified ○ Any performance change is solely based on differently chosen parameters
  • 68. Analysis ● Results indicated that a simple and intuitive model is able to estimate near-optimal values for the parameters ● Focus on the ATLAS generated code ● Notations: ○ ATLAS CGw/S - Code Generator with Search ○ ATLAS Model - Modified Atlas (No search) ○ Atlas Unleashed - Hand written code may be used along with predefined architecture defaults for the parameter values to produce the library.
  • 69. Model-Based Optimization ● Requires more machine parameters than original ATLAS ○ No Search!! ● Empirical optimizers: ○ Approximate values of machine params are okay ○ Only used to bound the search space ● Model-based Optimizers: ○ Need accurate values ○ Developed a tool called X-RAY to accurately measure them
  • 70. Hardware Parameters ● C1,B1: the capacity and the line size of the L1 data cache ● CI : The capacity of the L1 instruction cache ● Lx: hardware latency of the floating-point multiply instruction ● |ALUFP |: number of floating-point functional units ● NR: the number of floating-point registers ● FMA: the availability of a fused multiply-add instruction
  • 71. Estimating NB ● Consider L1 cache - Fully Associative, Optimal replacement, Unit line size ● Working set of mini-MMM loop has 3 blocks of NB x NB 3 NB2 <= C1 ● In the inner most loop (C), element once computed is not used again. Similarly only 1 column of B is needed in cache. NB2 + NB + 1 <= C1
  • 72. Refined Estimate of NB ● Correcting for non-unit line size |N2B/B1| + |NB/B1| + 1 <= C1/B1
  • 73. Further Refinement ● Estimated NB may not be multiple of MU and NU ● This might cause fractional register tiles and extra clean up ● Avoid this by choosing proper NB ● ATLAS needs NB to be an even integer ● So, we have: NB =
  • 74. Estimating MU and NU ● View register file as a software cache ○ that is fully associative ○ unit line size ○ capacity = # registers, NR ● ATLAS performs outer products of (MU x 1) and (1 x NU) vectors for register tiling
  • 75. Contd. ● ATLAS allocates MU elements for A, NU elements for B, and MU*NU elements for C ● Also need LS registers to store temp values of multiplications to make use of pipelining ● So we have: (MU x NU) + NU + MU + LS <= NR LS calculation will be shown later, NR is known. Only unknowns are MU and NU.
  • 76. Estimation Scheme ● Let MU = NU = u. Solve prev inequality for u ● Let MU = max (u, 1). Solve for NU ● Let NU = max (NU, 1) ● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
  • 77. Estimating KU ● Not limited by the size of the register file ● Limited by the size of I-Cache ● Unroll the innermost loop within the size constraints of instruction cache ● Avoid micro-MMM code cleanup ○ Trim KU so that it divides NB ○ Usually, KU = NB in most machines
  • 78. Estimating LS ● Skew factor that ATLAS code generator uses to schedule dependent multiplication and addition operations for CPU Pipeline ● LS independent multiplications and LS-1 independent additions between muli and corresponding addi should at least hide the latency of multiplication.
  • 79. Estimating Ls ● LX = latency of multiplication ● 2 * LS - 1 independent instructions hides this latency ● So, 2 * LS - 1 >= LX ● There may be multiple floating point units (2 x LS) - 1/ |ALUFP| >= LX ● Solution for LS:
  • 80. Summary 1. Estimate FMA 2. Estimate LS : 3. Estimate MU and Nu MU*NU + NU + MU + LS <= NR Set MU = NU = u. Solve for u MU = max(1, u). Solve for NU NU = max(NU, 1). If MU < NU swap MU and NU 4. Estimate NB |N2B/B1| + |NB/B1| + 1 <= C1/B1 ○ Trim NB to be multiple of 2, MU and NU 5. Estimate KU ○ Constrained by I-cache. ○ Make KU divide NB 6. Estimate NF, IF ○ IF = 2 , N F = 2
  • 82. Conclusions ● In all machines (other than Itanium), the codes performed almost as well as global search based codes ● Models to find parameters are much faster ● Might be difficult to implement analytical methods in compilers ○ This model is focused on only 1 application