7. Me
• Mathematician at heart
• 3 years in the quant finance industry
• Last 2: statistics + freelance + open source
• My new company: Lambda Foundry
• Building analytics and tools for finance
and other domains
7
8. Me
• Blog: http://blog.wesmckinney.com
• GitHub: http://github.com/wesm
• Twitter: @wesmckinn
• Working on “Python for Data Analysis” for
O’Reilly Media
• Giving PyCon tutorial on pandas (!)
8
9. pandas?
• http://pandas.sf.net
• Swiss-army knife of (in-memory) data
manipulation in Python
• Like R’s data.frame on steroids
• Excellent performance
• Easy-to-use, highly consistent API
• A foundation for data analysis in Python
9
10. pandas
• In heavy production use in the financial
industry
• Generally much better performance than
other open source alternatives (e.g. R)
• Hope: basis for the “next generation” data
analytical environment in Python
10
11. Simplifying data wrangling
• Data munging / preparation / cleaning /
integration is slow, error prone, and time
consuming
• Everyone already <3’s Python for data
wrangling: pandas takes it to the next level
11
12. Explosive pandas growth
• Last 6 months: 240 files changed
49428 insertions(+), 15358 deletions(-)
Cython-generated C removed
12
13. Rigorous unit testing
• Need to be able to trust your $1e3/e6/e9s
to pandas
• > 98% line coverage as measured by
coverage.py
• v0.3.0 (2/19/2011): 533 test functions
• v0.7.0 (1/09/2012): 1272 test functions
13
14. Some development asides
• I get a lot of questions about my dev env
• Emacs + IPython FTW
• Indispensible development tools
• pdb (and IPython-enhanced pdb)
• pylint / pyflakes (integrated with Emacs)
• nose
• coverage.py
• grin, for searching code. >> ack/grep IMHO
14
15. IPython
• Matthew Goodman: “If you are not using
this tool, you are doing it wrong!”
• Tab completion, introspection, interactive
debugger, command history
• Designed to enhance your productivity in
every way. I can’t live without it
• IPython HTML notebook is a game changer
15
16. Profiling and optimization
• %time, %timeit in IPython
• %prun, to profile a statement with cProfile
• %run -p to profile whole programs
• line_profiler module, for line-by-line timing
• Optimization: find right algorithm first.
Cython-ize the bottlenecks (if need be)
16
17. Other things that matter
• Follow PEP8 religiously
• Naming conventions, other code style
• 80 character per line hard limit
• Test more than you think you need to, aim
for 100% line coverage
• Avoid long functions (> 50 lines), refactor
aggressively
17
23. pandas DataFrame
• Jack-of-trades tabular data structure
In [10]: tips[:10]
Out[10]:
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
7 8.770 2.00 Male No Sun Dinner 2
8 26.88 3.12 Male No Sun Dinner 4
9 15.04 1.96 Male No Sun Dinner 2
10 14.78 3.23 Male No Sun Dinner 2
23
24. DataFrame
• Heterogeneous columns
• Data alignment and axis indexing
• No-copy data selection (!)
• Agile reshaping
• Fast joining, merging, concatenation
24
25. DataFrame
• Axis indexing enable rich data alignment,
joins / merges, reshaping, selection, etc.
day Fri Sat Sun Thur
sex smoker
Female No 3.125 2.725 3.329 2.460
Yes 2.683 2.869 3.500 2.990
Male No 2.500 3.257 3.115 2.942
Yes 2.741 2.879 3.521 3.058
25
26. Let’s have a little fun
To the IPython Notebook, Batman
http://ashleyw.co.uk/project/food-nutrient-database
26
27. Axis indexing, the special
pandas-flavored sauce
• Enables “alignment-free” programming
• Prevents major source of data munging
frustration and errors
• Fast (O(1) or O(log n)) selecting data
• Powerful way of describing reshape / join /
merge / pivot-table operations
27
28. Data alignment, join ops
• The brains live in the axis index
• Indexes know how to do set logic
• Join/align ops: produce “indexers”
• Mapping between source/output
• Indexer passed to fast “take” function
28
29. Index join example
left right joined lidx ridx
a -1 0
d
a b 1 1
b
JOIN b c 2 2
c
c d 0 -1
e
e 3 -1
left_values.take(lidx, axis) reindexed data
29
30. Implementing index joins
• Completely irregular case: use hash tables
• Monotonic / increasing values
• Faster specialized left/right/inner/outer
join routines, especially for native types
(int32/64, datetime64)
• Lookup hash table is persisted inside the
Index object!
30
31. Um, hash table?
left joined indexer
{ }
a -1
d 0
b 1
b 1
map c 2
c 2
d 0
e 3
e 3
31
32. Hash tables
• Form the core of many critical pandas
algorithms
• unique (for set intersection / union)
• “factor”ize
• groupby
• join / merge / align
32
33. GroupBy, a brief
algorithmic exploration
• Simple problem: compute group sums for a
vector given group identifications
labels values
b -1
unique group
b 3
labels sums
a 2
a 2
a 3
b 4
b 2
a -4
a 1
33
34. GroupBy: Algo #1
unique_labels = np.unique(labels)
results = np.empty(len(unique_labels))
for i, label in enumerate(unique_labels):
results[i] = values[labels == label].sum()
For all these examples, assume N data
points and K unique groups
34
35. GroupBy: Algo #1, don’t do this
unique_labels = np.unique(labels)
results = np.empty(len(unique_labels))
for i, label in enumerate(unique_labels):
results[i] = values[labels == label].sum()
Some obvious problems
• O(N * K) comparisons. Slow for large K
• K passes through values
• numpy.unique is pretty slow (more on this later)
35
36. GroupBy: Algo #2
Make this dict in O(N) (pseudocode)
g_inds = {label : [i where labels[i] == label]}
Now
for i, label in enumerate(unique_labels):
indices = g_inds[label]
label_values = values.take(indices)
result[i] = label_values.sum()
Pros: one pass through values. ~O(N) for N >> K
Cons: g_inds can be built in O(N), but too many
list/dict API calls, even using Cython
36
37. GroupBy: Algo #3, much faster
• “Factorize” labels
• Produce vectorto the unique observedK-1
corresponding
of integers from 0, ...,
values (use a hash table)
result = np.zeros(k)
for i, j in enumerate(factorized_labels):
result[j] += values[i]
Pros: avoid expensive dict-of-lists creation. Avoid
numpy.unique and have option to not to sort the
unique labels, skipping O(K lg K) work
37
38. Speed comparisons
• Test case: 100,000 data points, 5,000 groups
• Algo 3, don’t sort groups: 5.46 ms
• Algo 3, sort groups: 10.6 ms
• Algo 2: 155 ms (14.6x slower)
• Algo 1: 10.49 seconds (990x slower)
• Algos 2/3 implemented in Cython
38
39. GroupBy
• Situation is significantly more complicated
in the multi-key case.
• More on this later
39
40. Algo 3, profiled
In [32]: %prun for _ in xrange(100) algo3_nosort()
cumtime filename:lineno(function)
0.592 <string>:1(<module>)
0.584 groupby_ex.py:37(algo3_nosort)
0.535 {method 'factorize' of DictFactorizer' objects}
0.047 {pandas._tseries.group_add}
0.002 numeric.py:65(zeros_like)
0.001 {method 'fill' of 'numpy.ndarray' objects}
0.000 {numpy.core.multiarray.empty_like}
0.000 {numpy.core.multiarray.empty}
Curious
40
41. Slaves to algorithms
• Turns out that numpy.unique works by
sorting, not a hash table. Thus O(N log N)
versus O(N)
• Takes > 70% of the runtime of Algo #2
• Factorize is the new bottleneck, possible to
go faster?!
41
42. Unique-ing faster
Basic algorithm using a dict, do this in Cython
table = {}
uniques = []
for value in values:
if value not in table:
table[value] = None # dummy
uniques.append(value)
if sort:
uniques.sort()
Performance may depend on the number of
unique groups (due to dict resizing)
42
43. Unique-ing faster
No Sort: at best ~70x faster, worst 6.5x faster
Sort: at best ~70x faster, worst 1.7x faster
43
45. Can we go faster?
• Python dictimplementations one of the best
hash table
is renowned as
anywhere
• But:
• No abilityresizings
arbitrary
to preallocate, subject to
• We don’t care about reference counting,
throw away table once done
• Hm, what to do, what to do?
45
50. Gloves come off
with int64
PyObject* boxing / PyRichCompare obvious culprit
50
51. Some NumPy-fu
• Think about the sorted factorize algorithm
• Want to compute sorted unique labels
• Also compute integer ids relative to the
unique values, without making 2 passes
through a hash table!
sorter = uniques.argsort()
reverse_indexer = np.empty(len(sorter))
reverse_indexer.put(sorter, np.arange(len(sorter)))
labels = reverse_indexer.take(labels)
51
52. Aside, for the R community
• R’s factor function is suboptimal
• Makes two hash table passes
• unique uniquify and sort
• match ids relative to unique labels
• This is highly fixable
• R’s integer unique is about 40% slower than
my khash_int64 unique
52
53. Multi-key GroupBy
• Significantly more complicated because the
number of possible key combinations may
be very large
• Example, group by two sets of labels
• 1000 unique values in each
• “Key space”: 1,000,000, even though
observed key pairs may be small
53
55. Multi-GroupBy
• Pathological, but realistic example
• 50,000 values, 1e4 unique keys x 2, key
space 1e8
• Compress key space: 9.2 ms
• Don’t compress: 1.2s (!)
• I actually discovered this problem while
writing this talk (!!)
55
56. Speaking of performance
• Testing the correctness of code is easy:
write unit tests
• How to systematically test performance?
• Need to catch performance regressions
• Being mildly performance obsessed, I got
very tired of playing performance whack-a-
mole with pandas
56
57. vbench project
• http://github.com/wesm/vbench
• Run benchmarks for each version of your
codebase
• vbench checks out each revision of your
codebase, builds it, and runs all the
benchmarks you define
• Results stored in a SQLite database
• Only works with git right now
57
60. Fast database joins
• Problem: SQL-compatible left, right, inner,
outer joins
• Row duplication
• Join on index and / or join on columns
• Sorting vs. not sorting
• Algorithmically closely related to groupby
etc.
60
61. Row duplication
left right outer join
key lvalue key rvalue key lvalue rvalue
foo 1 foo 5 foo 1 5
foo 2 foo 6 foo 1 6
bar 3 bar 7 foo 2 5
baz 4 qux 8 foo 2 6
bar 3 7
baz 4 NA
qux NA 8
61
63. Join indexers
left right outer join
key lvalue key rvalue key lidx ridx
foo 1 foo 5 foo 0 0
foo 2 foo 6 foo 0 1
bar 3 bar 7 foo 1 0
baz 4 qux 8 foo 1 1
bar 2 2
baz 3 -1
Problem: factorized keys qux -1 3
need to be sorted!
63
64. An algorithmic observation
• If N values are known to be from the range
0 through K - 1, can be sorted in O(N)
• Variant of counting sort
• For our purposes, only compute the
sorting indexer (argsort)
64
65. Winning join algorithm
sort keys don’t sort keys
Factorize keys columns
O(K log K) or O(N)
Compute / compress
group indexes O(N) (refactorize)
"Sort" by group indexes
O(N) (counting sort)
Compute left / right join
indexers for join method O(N_output)
Remap indexers relative
to original row ordering O(N_output)
O(N_output) (this step is actually
Move data efficiently into
output DataFrame fairly nontrivial)
65
67. Join test case
• Left:pairs rows, 2 key columns, 8k unique
key
80k
• Right: 8k rows, 2 key columns, 8k unique
key pairs
• 6k matching key pairs between the tables,
many-to-one join
• One column of numerical values in each
67
68. Join test case
• Many-to-many case: stack right DataFrame
on top of itself to yield 16k rows, 2 rows
for each key pair
• Aside: sorting the pesky O(K log K)), not
the runtime (that
unique keys dominates
included in these benchmarks
68
71. Results vs SQLite3
Absolute timings
* outer is LEFT OUTER in SQLite3
Note: In SQLite3 doing something like
71
72. DataFrame sort by columns
• Applied same ideas / tools to “sort by
multiple columns op” yesterday
72
73. The bottom line
• Just a flavor: pretty much all of pandas has
seen the same level of design effort and
performance scrutiny
• Make sure whoever implemented your data
structures and algorithms care about
performance. A lot.
• Python has amazingly powerful and
productive tools for implementation work
73
74. Thanks!
• Follow me on Twitter: @wesmckinn
• Blog: http://blog.wesmckinney.com
• Exciting Python things ahead in 2012
74