Boost Fertility New Invention Ups Success Rates.pdf
On the Necessity and Inapplicability of Python
1. Yung-Yu Chen (@yungyuc)
On the necessity and
inapplicability of Python
Help us develop numerical software
2. Whom I am
• I am a mechanical engineer by training, focusing on
applications of continuum mechanics. A computational
scientist / engineer rather than a computer scientist.
• In my day job, I write high-performance code for
semiconductor applications of computational geometry
and lithography.
• In my spare time, I am teaching a course ‘numerical
software development’ in the dept. of computer science
in NCTU.
2
You can contact me through twitter: https://twitter.com/yungyuc
or linkedin: https://www.linkedin.com/in/yungyuc/.
3. PyHUG
• Python Hsinchu User Group (established in late
2011)
• The first group of staff of PyCon Taiwan (2012)
• Weekly meetups at a pub for 3 years, not
stopped by COVID-19
• 7+ active user groups in Taiwan
• I have been in PyConJP in 2012, 2013 (APAC),
2015, 2019
• Last year I led a visit group to PyConJP (thank
you Terada san for the sharing the know-
how!)
• I hope we can do more
3
4. PyCon
Taiwan
5-6 Sep, 2020, Tainan, Taiwan
• It is planned to be an on-site conference
(unless something incredibly bad
happens again)
• Speakers may choose to speak online
• We still need to wear a face mask
• Appreciate the Taiwan citizens and
government, who work hard to
counter COVID-19
• https://g0v.hackmd.io/@kiang/
mask-info
• We hope to see you again in Taiwan!
4
https://tw.pycon.org/2020/
5. Numerical software
• Numerical software: Computer programs to solve scientific or
mathematic problems.
• Other names: Mathematical software, scientific software, technical
software.
• Python is a popular language for application experts to describe the
problems and solutions, because it is easy to use.
• Most of the computing systems (the numerical software) are designed in
a hybrid architecture.
• The computing kernel uses C++.
• Python is chosen for the user-level API.
5
6. Example: OPC
6
photoresist
silicon substrate
photomask
light source
Photolithography in semiconductor fabrication
wave length is only
hundreds of nm
image I want to
project on the PR
shape I need
on the mask
Optical proximity correction (OPC)
(smaller than the
wave length)
write code to
make it happen
7. Example: PDEs
7
Numerical simulations of
conservation laws:
∂u
∂t
+
3
∑
k=1
∂F(k)
(u)
∂xk
= 0
Use case: stress waves in
anisotropic solids
Use case: compressible flows
8. Example: What others do
• Machine learning
• Examples: TensorFlow, PyTorch
• Also:
• Computer aided design and engineering (CAD/CAE)
• Computer graphics and visualization
• Hybrid architecture provides both speed and flexibility
• C++ makes it possible to do the huge amount of calculations, e.g.,
distributed computing of thousands of computers
• Python helps describe the complex problems of mathematics or sciences
8
9. Crunch real numbers
• Simple example: solve the Laplace equation
•
•
•
• Use a two-dimensional array as the spatial grid
• Point-Jacobi method: 3-level nested loop
∂2
u
∂x2
+
∂2
u
∂y2
= 0 (0 < x < 1; 0 < y < 1)
u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1)
u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1)
def solve_python_loop():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt]
+ u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
9
Non-trivial boundary condition
10. Power of Numpy C++
def solve_numpy_array():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
while not converged:
step += 1
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
def solve_python_loop():
u = uoriginal.copy()
un = u.copy()
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
CPU times: user 62.1 ms, sys: 1.6 ms, total: 63.7 ms
Wall time: 63.1 ms: Pretty good!
CPU times: user 5.24 s, sys: 22.5 ms, total: 5.26 s
Wall time: 5280 ms: Poor speed
10
std::tuple<xt::xarray<double>, size_t, double>
solve_cpp(xt::xarray<double> u)
{
const size_t nx = u.shape(0);
xt::xarray<double> un = u;
bool converged = false;
size_t step = 0;
double norm;
while (!converged)
{
++step;
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
}
}
norm = xt::amax(xt::abs(un-u))();
if (norm < 1.e-5) { converged = true; }
u = un;
}
return std::make_tuple(u, step, norm);
}
CPU times: user 29.7 ms, sys: 506 µs, total: 30.2 ms
Wall time: 29.9 ms: Definitely good!
Pure Python 5280 ms
Numpy 63.1 ms
C++ 29.9 ms
83.7x
2.1x 176.6x
Pure Python Numpy
C++
The speed is the reason
1000 computers → 5.67
Save a lot of $
11. Recap: Why Python?
• Python is slow, but numpy may be reasonably fast.
• Coding in C++ is time-consuming.
• C++ is only needed in the computing kernel.
• Most code is supportive code, but it must not slow down the
computing kernel.
• Python makes it easier to organize structure the code.
This is why high-performance system usually uses a hybrid
architecture (C++ with Python or another scripting language).
11
12. Let’s go hybrid, but …
• A dilemma:
• Engineers (domain experts) know the problems but
don’t know C++ and software engineering.
• Computer scientists (programmers) know about C++
and software engineering but not the problems.
• Either side takes years of practices and study.
• Not a lot of people want to play both roles.
12
13. NSD: attempt to improve
• Numerical software development: a graduate-level
course
• Train computer scientists the hybrid architecture
for numerical software
• https://github.com/yungyuc/nsd
• Runnable Jupyter notebooks
13
• Part 1: Start with Python
• Lecture 1: Introduction
• Lecture 2: Fundamental engineering practices
• Lecture 3: Python and numpy
• Part 2: Computer architecture for performance
• Lecture 4: C++ and computer architecture
• Lecture 5: Matrix operations
• Lecture 6: Cache optimization
• Lecture 7: SIMD
• Part 3: Resource management
• Lecture 8: Memory management
• Lecture 9: Ownership and smart pointers
• Part 4: How to write C++ for Python
• Lecture 10: Modern C++
• Lecture 11: C++ and C for Python
• Lecture 12: Array code in C++
• Lecture 13: Array-oriented design
• Part 5: Conclude with Python
• Lecture 14: Advanced Python
• Term project presentation
14. Memory hierarchy
• We go to C++ to make it easier to access hardware
• Modern computer has faster CPU than memory
• High performance comes with hiding the memory-access latency
registers (0 cycle)
L1 cache (4 cycles)
L2 cache (10 cycles)
L3 cache (50 cycles)
Main memory (200 cycles)
Disk (storage) (100,000 cycles)
14
15. Data object
• Numerical software processes
huge amount of data. Copying
them is expensive.
• Use a pipeline to process the
same block of data
• Use an object to manage the
data: data object
• Data objects may not always be a
good idea in other fields.
• Here we do what it takes for
uncompromisable
performance.
Field initialization
Interior time-marching
Boundary condition
Parallel data sync
Finalization
Data
15
Data access at all phases
16. Zero-copy: do it where it fits
Python app C++ app
C++
container
Ndarray
manage
access
Python app C++ app
C++
container
Ndarray
manage
accessa11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language memory buffer shared across language
Top (Python) - down (C++) Bottom (C++) - up (Python)
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory buffer shared across language
Ndarray
C++
container
16
17. More detail …
Notes about moving from Python to C++
• Python frame object
• Building Python extensions using pybind11
and cmake
• Inspecting assembly code
• x86 intrinsics
• PyObject, CPython API and pybind11 API
• Shared pointer, unique pointer, raw pointer,
and ownership
• Template generic programming
https://tw.pycon.org/2020/en-us/events/talk/
1164539411870777736/
17
18. How to learn
• Work on a real project.
• Keep in mind that Python is 100x slower than C/C++.
• Always profile (time).
• Don’t treat Python as simply Python.
• View Python as an interpreter library written in C.
• Use tools to call C/C++: Cython, pybind11, etc.
18
19. What we want
19
See problems
Formulate the
problems
Get something
working
Automate PrototypeReusable
software
? ?
One-time programs may happen