Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Numba,  aJ| T compiler for
fast numerical code

1 CONTINUUM
A -. "/I
Speaker presentation

- Antoine Pitrou < >
- Numba developer at Continuum since 2014

- Core Python developer since 2007
-...
What is Numba? 

- Ajust-in-time compiler based on LLVM

- Runs on CPython 2.6 to 3.4

- Opt-in

- Specialized in numerica...
Why a just—in-time compiler? 

- Pure Python is slow at number crunching

- Numpy has C accelerations,  but they only appl...
Why a just—in-time compiler? 
Mandelbrot (20 iterations): 

CPython ix
Numpy array-wide operations 13x
Numba (CPU) 120x
Nu...
LLVM

- A mature library and toolkit for writing compilers (clang)
- Multi-platform

- Supported by the industry

- Has a ...
LLVM optimizations

- inlining

- loop unrolling

- S| MDvectorization
- etc. 

% CONTINUUM
LLVM crazy optimizations

Constant time arithmetic series
In [2]:  @numba. jit

def f(x): 
res =  0
for i in range(x): 
re...
LLVM crazy optimizations

Assembler output

In [16]:  print(f. inspect_asm((numba. int64,)))

. text

. file "<string>"

....
Runs on CPython

- 2.6, 2.7, 3.3, 3.4, 3.5
- Can run side by side with regular Python code

- Can run side by side with al...
Opt-in

- Only accelerate select functions decorated by you
- Allows us to relax semantics in exchange for speed

- High-l...
Specialized

- Tailored for number crunching
- Tailored for Numpy arrays
- And a bunch of other things. ..

% CONTINUUM
12
Multiple targets

- Main target is the CPU
- officially supported:  X86, x86-64
- CUDA target for NVidia GPUs with a limit...
Numba architecture

- Straight-forward function-based JIT

- Compilation pipeline from Python bytecode to LLVM IR
- Low-le...
Compilation pipeline

Python bytecode

Eyeéde 21%

3 Control flow graph Data flow graph

Byteofle iniyztation
1

[ Numba IR...
Numba specializations

- "Lowering" pass generates LLVM code for specific types
and operations

- built-in types and opera...
Supported Python syntax

- Supported constructs: 

- if/  else /  for /  while /  break /  continue
- raising exceptions

...
Unsupported Python syntax

- Unsupported constructs: 
- try/ except/ finally
- with
- (list,  set,  dict) comprehensions
-...
Supported Python features

- Types: 
- int,  bool,  float,  complex
- tuple,  None

- bytes,  bytearray,  memoryview (and ...
Supported Python modules

- Standard library: 
- cmath,  math,  random,  ctypes. ..

% CONTINUUM
Supported Numpy features

- All kinds ofarrays
- scalar

- structured

- except when containing Python objects
- iterating...
Limitations

- Recursion not supported

- Can't compile classes

- Can't allocate array data

- Type inference must be abl...
Semantic changes

- Fixed-sized integers

- Global and outer variables frozen

- No frame introspection insideJiT function...
Using Numba:  @jit

- @jit—decorate a function to designate it forj| T
compilation

- Automatic lazy compilation (recommen...
GIL removal with @jit(nogil= True)

- N—core scalability by releasing the Global interpreter
Lock: 

@numba. jit(nogil= Tr...
Using Numba:  @vectorize

- Compiies a scalar function into a Numpy universal
function

- What is a universal function? 
-...
Using Numba:  @guvectorize

- Compiies a e| ement—wise or subarray—wise function into
a generalized universal function

- ...
@vectorize performance

Vectorizing optimizes the memory cost on large arrays. 

In [2]:  @numba. vectorize(['float64(floa...
@jit example:  ising models

 

800

600

400

29

‘I CONTINUUM
C
ising model:  code

kT =  2 /  math. log(1 + math. sqrt(2),  math. e)

@numba. jit(nopython= True)

def update_one_element...
ising model:  performance

CPython
Numba (CPU)
Fortran

fl CONTINUUM
CUDA support

- Numba provides a @cuda. jit decorator
- Exposes the CU DA programming model
- Parallel operation: 
- threa...
CUDA support

- Limited array of features available
- features requiring C helper code unavailable
- Programmer needs to m...
CUDA example

[2]:  @cuda. jit
def gpu_cos(a,  out): 
i =  cuda. grid(1)
if i < a. shape[0]: 
out[i] =  math. cos(a[i])

I...
installing Numba

- Recommended:  precompiied binaries with Anaconda or
Miniconda: 

conda install numba

- OthenNise:  in...
Contact
-  z

- Code and issue tracker at
 [

- Numba-users mailing-list

- Numba is commercially supported

( )
- consult...
Prochain SlideShare
Chargement dans…5
×

PyData Paris 2015 - Track 3.3 Antoine Pitrou

1 266 vues

Publié le

Publié dans : Données & analyses
  • Soyez le premier à commenter

PyData Paris 2015 - Track 3.3 Antoine Pitrou

  1. 1. Numba, aJ| T compiler for fast numerical code 1 CONTINUUM A -. "/I
  2. 2. Speaker presentation - Antoine Pitrou < > - Numba developer at Continuum since 2014 - Core Python developer since 2007 - Not a scientist % CONTINUUM
  3. 3. What is Numba? - Ajust-in-time compiler based on LLVM - Runs on CPython 2.6 to 3.4 - Opt-in - Specialized in numerical computation - BSD-licensed, cross-platform (Linux, OS X, Windows) jw CONTINUUM 3
  4. 4. Why a just—in-time compiler? - Pure Python is slow at number crunching - Numpy has C accelerations, but they only apply to well- behaved problems - array operations are memory-heavy, can thrash CPU caches - Many algorithms have irregular data access, per- element branching, etc. - Want C-like speed but without writing C (or Fortran! ) - Fit for interactive use 4 CONTINUUM -£
  5. 5. Why a just—in-time compiler? Mandelbrot (20 iterations): CPython ix Numpy array-wide operations 13x Numba (CPU) 120x Numba (Nvidia Tesla K20c) 21 00x 3 E E E Q CONTINUUM -J2
  6. 6. LLVM - A mature library and toolkit for writing compilers (clang) - Multi-platform - Supported by the industry - Has a wide range of integrated optimizations - Allows us to focus on Python jw CONTINUUM 6
  7. 7. LLVM optimizations - inlining - loop unrolling - S| MDvectorization - etc. % CONTINUUM
  8. 8. LLVM crazy optimizations Constant time arithmetic series In [2]: @numba. jit def f(x): res = 0 for i in range(x): res += i return res In [10]: %timeit -c f(l0) 1000000 loops, best of 3: 211 ns per loop In [15]: %timeit -c f(100000) 1000000 loops, best of 3: 231 ns per loop (H ['1] . . ‘f, :iI 1121771
  9. 9. LLVM crazy optimizations Assembler output In [16]: print(f. inspect_asm((numba. int64,))) . text . file "<string>" . globl __main__. f$1.int64 . align 16, 0x90 . type __main__. f$1.int64,@function __main__. f$1.int64: xorl %edx, %edx testq %rcx, %rcx jle . LBBO_2 movq %rcx, %rax negq %rax cmpq $-2, %rax movq $-1, %rdx cmovgq %rax, %rdx , — , leaq (%rdx, %rcx), %rsi » leaq -1(%rdx, %rcx), %raxl mulq %rsi ' shldq $63, %rax, %rdx_ addq %rsi, %rdx ‘ . LBBO_2: movq %rdx, (%rdi) xorl %eax, %eax retq
  10. 10. Runs on CPython - 2.6, 2.7, 3.3, 3.4, 3.5 - Can run side by side with regular Python code - Can run side by side with all third-party C extensions and libraries - all the numpy/ scipy / etc. ecosystem gm CONTINUUM 10
  11. 11. Opt-in - Only accelerate select functions decorated by you - Allows us to relax semantics in exchange for speed - High-level code surrounding Numba-compiled functions can be arbitrarily complex jw CONTINUUM 11
  12. 12. Specialized - Tailored for number crunching - Tailored for Numpy arrays - And a bunch of other things. .. % CONTINUUM 12
  13. 13. Multiple targets - Main target is the CPU - officially supported: X86, x86-64 - CUDA target for NVidia GPUs with a limited feature set - Potential support for: - HSA (GPU+CPU on AMD APUs) - ARM processors gm CONTINUUM 13
  14. 14. Numba architecture - Straight-forward function-based JIT - Compilation pipeline from Python bytecode to LLVM IR - Low-level optimizations and codegen delegated to LLVM - Python—facing wrappers gm CONTINUUM 14
  15. 15. Compilation pipeline Python bytecode Eyeéde 21% 3 Control flow graph Data flow graph Byteofle iniyztation 1 [ Numba IR Type inierence Typed Numba IR Lowtiring _ __ _ LLVMIR . '.. .‘. s‘. ':] ~ I01, (‘II 15
  16. 16. Numba specializations - "Lowering" pass generates LLVM code for specific types and operations - built-in types and operators - specific libraries (math, cmath, random. ..) - Opens opportunities for inlining and other optimizations % CONTINUUM
  17. 17. Supported Python syntax - Supported constructs: - if/ else / for / while / break / continue - raising exceptions - calling other compiled functions - generators! - etc. % CONTINUUM
  18. 18. Unsupported Python syntax - Unsupported constructs: - try/ except/ finally - with - (list, set, dict) comprehensions - yield from % CONTINUUM
  19. 19. Supported Python features - Types: - int, bool, float, complex - tuple, None - bytes, bytearray, memoryview (and other buffer-like objects) - Bui| t—in functions: - abs, enumerate, len, min, max, print, range, round, zip % CONTINUUM
  20. 20. Supported Python modules - Standard library: - cmath, math, random, ctypes. .. % CONTINUUM
  21. 21. Supported Numpy features - All kinds ofarrays - scalar - structured - except when containing Python objects - iterating, indexing, slicing - Reductions: argmax(), max(), prod() etc. - Scalar types and values (including datetime64 and timedelta64) Q CONTINUUM -£
  22. 22. Limitations - Recursion not supported - Can't compile classes - Can't allocate array data - Type inference must be able to determine all types jw CONTINUUM 22
  23. 23. Semantic changes - Fixed-sized integers - Global and outer variables frozen - No frame introspection insideJiT functions: - tracebacks - debugging jw CONTINUUM 23
  24. 24. Using Numba: @jit - @jit—decorate a function to designate it forj| T compilation - Automatic lazy compilation (recommended): @numba. jit def my_function(x, y, z): - Manual specialization: @numba. jit("(int32, float64, float64)") def my_function(x, y, z): % CONTINUUM
  25. 25. GIL removal with @jit(nogil= True) - N—core scalability by releasing the Global interpreter Lock: @numba. jit(nogil= True) def my_function(x, y, z): - No protection from race conditions! Tip Use concurrent . futures. ThreadPoolExecutor on Python 3 LIULLTTIEE 25
  26. 26. Using Numba: @vectorize - Compiies a scalar function into a Numpy universal function - What is a universal function? - Examples: np. add, np. mu| t, np. sqrt. .. - Apply an element-wise operation on entire arrays - Automatic broadcasting - Reduction methods: np. add. reduce(), np. add. accumu| ate(). .. - Traditionally requires coding in C Q CONTINUUM -£
  27. 27. Using Numba: @guvectorize - Compiies a e| ement—wise or subarray—wise function into a generalized universal function - What is a generalized universal function? - like a universal function, but allows to peek at other elements - e. g. moving window average - automatic broadcasting, but not automatic reduction methods 4 CONTINUUM -£
  28. 28. @vectorize performance Vectorizing optimizes the memory cost on large arrays. In [2]: @numba. vectorize(['float64(float64, float64)']) def relative_difference(a, b): return abs(a - b) / (abs(a) + abs(b)) In [3]: x Y np. arange(le8, dtype= np. float64) x + 1.1 In [4]: %timeit abs(x - y) / (abs(x) + abs(y)) 1 loops, best of 3: 2.2 5 per loop In [5]: %timeit relative_difference(x, y) 1 loops, best of 3: 741 ms per loop III IO]. as . II I 212177} 28
  29. 29. @jit example: ising models 800 600 400 29 ‘I CONTINUUM C
  30. 30. ising model: code kT = 2 / math. log(1 + math. sqrt(2), math. e) @numba. jit(nopython= True) def update_one_element(x, i, j): n, m = x. shape assert n > 0 assert m > 0 dE = 2 * x[i, j] + + + + +++ ) * < x[(i-1)%n, x[(i-1)%n, x[(i-1)%n, x[(i+1)%n, x[(i+1)%n, x[(i+1)%n, if dE <= 0 or exp(-dE / kT) xii. j] = -xii. ii @numba. jit(nopython= True) def update_one_frame(x): n, m = x. shape for i in range(n): for j in range(0, m, 2): (j-1)%m] J 1 (j+1)%m] (j-1)%mi (j+1)%m] (j-l)%mi j l (j+1)%m] > np. random. random(): # Even columns first to avoid overlap update_one_element(x, j, i) for i in range(n): I for j in range(1, m, 2): update_one_element(x, j, i) # Odd columns second to avoid overlap 30
  31. 31. ising model: performance CPython Numba (CPU) Fortran fl CONTINUUM
  32. 32. CUDA support - Numba provides a @cuda. jit decorator - Exposes the CU DA programming model - Parallel operation: - threads - blocks of threads - grid of blocks - Distinguishing between: - kernel functions (called from CPU) - device functions (called from GPU) gm CONTINUUM 32
  33. 33. CUDA support - Limited array of features available - features requiring C helper code unavailable - Programmer needs to make use of CUDA knowledge - Programmer needs to take hardware capabilities into account gm CONTINUUM 33
  34. 34. CUDA example [2]: @cuda. jit def gpu_cos(a, out): i = cuda. grid(1) if i < a. shape[0]: out[i] = math. cos(a[i]) In [3]: x = np. linspace(0, 2 * math. pi, 1e7, dtype= np. float32) cpu_out = np. zeros_like(x) gpu_out = np. zeros_like(x) thread_config = (len(x) / / 512 + 1), 512 In [4]: %timeit np. cos(x, cpu_out) 10 loops, best of 3: 149 ms per loop In [7]: %timeit gpu_cos[thread_config](X, gpu_out) 10 loops, best of 3: 27.8 ms per loop The CPU is a Core I7-4820K (3.7 GHz), the GPU is a Tesla K20c. In [8]: np. allclose(cpu_out, gpu_out) 0ut[8]: True 4 CONTINUUM -J:
  35. 35. installing Numba - Recommended: precompiied binaries with Anaconda or Miniconda: conda install numba - OthenNise: install LLVM 3.5.x, compile livmlite, install numba from source % CONTINUUM
  36. 36. Contact - z - Code and issue tracker at [ - Numba-users mailing-list - Numba is commercially supported ( ) - consulting - enhancements - support for new architectures - NumbaPro % CONTINUUM

×