4. Python's Speed
Among Most Popular Languages
C C++ Java Lisp C# Pascal Python Ruby PHP Perl
data source (on Oct.17, 2011):
• http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
• http://shootout.alioth.debian.org/u64q/which-programming-languages-are-fastest.php
5. 7 Steps to Gain Speed
1) Find performance bottlenecks
2) Use better algorithms
3) Use faster tools
4) Write optimized code
5) Hire optimizers
6) Write your own extension modules
7) Parallelize the computation
8. Find Performance
Bottlenecks
• Profile, no guess
- profile
• a pure Python module
- cProfile
• written in C, new in Python 2.5
• same interface with profile, but lower overhead
- hotshot
• written in C, new in Python 2.2
• not maintained and might be removed
9. cProfile Usage
• cProfile.run('foo()')
• cProfile.run('foo()', 'profile.result')
• python -m cProfile -o profile.result myscript.py
• p = pstats.Stats('profile.result')
• p.sort_stats('cumulative').print_stats()
• sort by 'cumulative' to find what algorithms are taking time
• sort by 'time' to find what functions are taking time
• RunSnakeRun for GUI guys
• RTFM, please
• for IPython, type %prun?
10. Line Profile
• line_profile and kernprof
@profile
def slow_function():
...
$ kernprof.py -l -v script_to_profile.py
...
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 @profile
2 def slow_function():
3 1 3 3.0 0.2 s = 0
4 1001 934 0.9 48.6 for i in xrange(1000):
5 1000 984 1.0 51.2 s += i
6 1 1 1.0 0.1 return s
12. How To Know Which is
Better?
• timeit!
• python -m timeit -s "setup" "statement"
• e.g. which is faster, "d.has_key(k)" or "k in
d"?
$ python -m timeit -s "d=dict(zip(range(1000), range(1000)))"
"d.has_key(500)"
1000000 loops, best of 3: 0.223 usec per loop
$ python -m timeit -s "d=dict(zip(range(1000), range(1000)))"
"500 in d"
10000000 loops, best of 3: 0.115 usec per loop
14. Use Bettern Algorithms
• How to calculate sum([1, 2, ..., 100])?
s = 0
for i in range(101): 8.3usec
s += i
15. Use Bettern Algorithms
• How to calculate sum([1, 2, ..., 100])?
s = 0
for i in range(101): 8.3usec
s += i
s = sum(range(101)) 2.8usec
16. Use Bettern Algorithms
• How to calculate sum([1, 2, ..., 100])?
s = 0
for i in range(101): 8.3usec
s += i
s = sum(range(101)) 2.8usec
s = sum(xrange(101)) 2.03usec
17. Use Bettern Algorithms
• How to calculate sum([1, 2, ..., 100])?
s = 0
for i in range(101): 8.3usec
s += i
s = sum(range(101)) 2.8usec
s = sum(xrange(101)) 2.03usec
s = (1 + 100) * 100 / 2 0.109usec
18. Advanced Data Types
• membership testing:
• set & dict: O(1) vs. tuple & list: O(n)
• return iterator instead of a large list
• array, collections.deque, heapq, bisect
19. Examples
lst = []
for i in xrange(10000):
lst.insert(0, i)
lst = collections.deque()
for i in xrange(10000): 25317% faster
lst.appendleft(i)
sorted(lst, reverse=True)[:10]
heapq.nlargest(10, lst)
613% faster
20. Do Less Computation
• Pre-computation
• Lazy computation
• Cache
• Approximation Algorithms
21. Example
def fib(n):
if n <= 1:
return 1 fib(25): 59.8ms
return fib(n-2) + fib(n-1)
22. Example
def cache(func):
c = {}
def _(n):
r = c.get(n) fib(25): 59.8ms
if r is None:
r = c[n] = func(n)
return r
return _ with @cache:
@cache fib(25): 0.524us
def fib(n):
if n <= 1:
return 1 112000 times faster!
return fib(n-2) + fib(n-1)
24. Use Faster Tools
• use iterator form
• range() -> xrange()
• map() -> itertools.imap()
• list comprehension -> generator expression
• dict.items() -> dict.iteritems()
• for i in range(len(seq)): ->
• for item in seq:
• for i, item in enumerate(seq):
25. Use Faster Tools
• SAX is faster and memory efficient than DOM
• use C version of modules
• profile -> cProfile
• StringIO -> cStringIO
• pickle -> cPickle
• elementTree -> cElementTree / lxml
• select has lower overhead than poll (and epoll at low
number of connections)
• numpy is essential for high volume numeric work
26. numpy Example
from itertools import izip
a=range(1000)
b=range(1000)
c = [ai+bi for ai, bi in izip(a, b)]
import numpy
a=numpy.arange(1000)
b=numpy.arange(1000)
c = a + b
28. Write Optimized Code
• Less temporary objects
• e.g. accumulator vs. sum
• however, string concatenation has been
optimized after Python 2.5
29. Write Optimized Code
• use key= instead of cmp= when sorting
lst = open('/Users/hongqn/projects/shire/luzong/
group.py').read().split()
lst.sort(cmp=lambda x, y: cmp(x.lower(), y.lower()))
lst.sort(key=str.lower) 377% faster
30. Write Optimized Code
• local variables are faster than global
variables
def f():
for i in xrange(10000):
r = abs(i)
def f():
_abs = abs
for i in xrange(10000):
r = _abs(i)
28% faster
• you can eliminate dots, too
31. Write Optimized Code
• inline function inside time-critical loops
def f(x):
return x + 1
for i in xrange(10000):
r = f(i)
for i in xrange(10000):
r = i + 1 187% faster
32. Write Optimized Code
• do not import modules in loops
for i in xrange(10000):
import string
r = string.lower('Python')
import string
for i in xrange(10000):
r = string.lower('Python')
178% faster
33. Write Optimized Code
• list comprehensions are faster than for-
loops
lst = []
for i in xrange(10000):
lst.append(i)
lst = [i for i in xrange(10000)] 213% faster
34. Write Optimized Code
• use "while 1" for time-critical loops
(readability lost!)
a = 0
while True:
a += 1
if a > 10000:
break
a = 0
while 1:
a += 1
if a > 10000: 78% faster
break
35. Write Optimized Code
• "not not x" is faster than "bool(x)" (not
recommended!)
bool([])
not not [] 196% faster
37. Hire Optimizers
• sys.setcheckinterval()
• Python checks for thread switch and
signal handling periodly (default 100
python virtual instructions)
• set it to a larger value for better
performance in cost of responsiveness
41. Write Your Own
Extension Modules
• Python/C API
• Official API
• ctypes
• Call dynamic link library in Python
• SWIG
• Automatically generate interface code
• Pyrex / Cython
• write extension using Python-like language
• Boost.Python
• C++ API
• Weave
• Inline C code
45. multiprocessing
Example
sum(xrange(1, 10000001)) 172ms
from multiprocessing import Pool
pool = Pool()
sum(pool.map(sum, (xrange(i, i+1000000)
for i in xrange(1, 10000000, 1000000))))
104ms on dual-core
49ms on 8-core
81. Exception Handling
• errno-like mechanism
• PyErr_SetString() / PyErr_SetObject() to
set error
• PyErr_Occurred() to check error
• Most functions return an error indicator,
e.g. NULL, -1
82. Global Interpreter Lock
• Release GIL before running blocking C code
Py_BEGIN_ALLOW_THREADS
...Do some blocking I/O operation...
Py_END_ALLOW_THREADS
• Reacquire GIL before calling into Python functions
PyGILState_STATE gstate;
gstate = PyGILState_Ensure();
/* Perform Python actions here. */
result = CallSomeFunction();
/* evaluate result */
/* Release the thread. No Python API allowed beyond this point. */
PyGILState_Release(gstate);