What's new in pandas and the SciPy stack for financial users

What’s new in pandas and
the SciPy stack for ﬁnancial
users
Wes McKinney

Me
• AQR: August 2007 - July 2010

• Duke Statistics: 2010 - present (now on leave)

• My plans

• Improving Python libs for statistics and ﬁnance

• Building a ﬁnancial software + consulting business
based on said tools

Core Python stack for ﬁnance
• NumPy, SciPy (heavy lifting)
• pandas (data handling / computation)
• IPython (dev and research env)
• Cython (perf optimization)
• matplotlib (visualization)
• statsmodels (statistics / econometrics)

General sentiments
• Scientific Python growing solidly in finance and
in many other fields

• Though good sci-pythonistas are still scarce

• Important work happening in many of the core
projects

• Growing consensus: a new computational
model is needed to better cope with “big data”

NumPy
• Signiﬁcantly refactored C internals
• Great progress on native datetime64 type
• Will signiﬁcantly improve date-handling
performance and usability
• Extensible business day / holiday logic
planned / in progress
• Addition of low-level missing data (NA)
support in the works

IPython

• One of Python’s killer apps gets even better
• Rich Qt GUI console with inline plotting
• New and improved architecture for high perf
parallel / distributed computing
• See Fernando Pérez’s SciPy 2011 talk / video

Cython
• Still the ﬁrst tool you should reach for to get
better performance
• New: OpenMP integration (for multi-core)
with nogil:
for i in prange(n):
# do something in parallel

• Supports (almost) all of standard Python now
(some things, like closures, used to not work)

statsmodels
• Statistics and econometrics in Python

• Major work in time series models over last year+

• VAR, SVAR models, eventually (V)ECM models
for cointegrated time series

• AR/ARMA, Kalman Filter, various macro ﬁlters
(e.g. Hodrick-Prescott) implemented

• Soon: Bayesian state space models (DLMs),
ARCH/GARCH models, etc.

statsmodels
• Major criticism: weak user interface
• No R-style formula framework
• pandas not integrated (need to pass raw
NumPy arrays)
• I have begun work on pandas integration,
formulas have been implemented and will
hopefully arrive within the next few months

pandas
• Still the Python data hacker’s best friend?
• Most recent release: 0.3.0 on 2/20/2011
• However, last 4 months have been the most
active development period in the library’s
history
• ~375 commits since 0.3.0 release (more than
the entire prior open source history)

Ambitious big picture

• I want to make pandas the cornerstone of the
“next generation” statistical computing
environment
• Ease-of-use, performance, ﬂexibility all equally
important

Ambitious big picture

• Taking the best features of other languages (R
and friends) and making them better and
easier to use
• See my recent blog article “A Roadmap for
Rich Scientiﬁc Data Structures in Python”

pandas: under the hood
• Complete redesign of DataFrame internals
• Now a single class for 2D data retaining
optimal performance of old DataFrame and
DataMatrix classes
• Signiﬁcantly improved mixed-type and missing
data handling
• Plan to use internal data structure to
implement “NDFrame” for n-dimensional data

Fancy indexing
• Index a Series / DataFrame in a matrix-like
way via special .ix attribute, use:
• Slices with integers or labels
• Lists of integers, labels, or boolean vecs
• Integer or label locations
df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

Misc new features
• “Sparse” (mostly NA) versions of Series,
DataFrame, WidePanel
• Many new functions on Series/DataFrame
• describe, quantile, select, drop, dropna,
corrwith, ...
• New moving window methods: rolling_quantile
and rolling_apply

Improved IO
• read_csv, read_table functions more
ﬂexible and robust, better type inferencing

df = read_table(‘foo.txt’, skiprows=[0,1],
na_values=[‘#N/A’])

• ExcelFile class for reading multiple sheets
out of .xls ﬁles

Improved IO
• HDFStore class provides a complete, tested
dict-like PyTables storage container
store = HDFStore(‘mydata.h5’)
store[‘x’] = x
store[‘y’] = y
y = store[‘y’]

• Experimental: store as Table and query
store.put('df', df, table=True)
piece = store.select(‘df’,
[{‘field’ : ‘index’, ‘op’ : ‘>=’,
‘value’ : date}])

Group by enhancements
• Can group by multiple columns or key
functions, SQL-like but more general
• Syntactic sugar to invoke aggregation
functions on groups
• Automatic exclusion of “nuisance”
columns of DataFrames
• Various other usability enhancements

Very soon: hierarchical indexing

• Enable axis ticks to be identiﬁed by multiple
labels instead of a single label
• Easily select subsets of data by “level”
• Create Excel-style pivot tables / cross-
tabulations in a sensible way
• Will integrate naturally with groupby

Other misc things

• Flexible binary operators
• a.add(b, ﬁll_value=0.)
• Some timezone support in DateRange
• Numerous performance optimizations
• See the (long) release notes =)

Planned work

• Fast time series up/downsampling
• Improved support and perf for HF/tick data
• Even more sophisticated group by tools
• Better documentation, online screencast
tutorials / examples

Thanks

• Email: wesmckinn@gmail.com
• Twitter: @wesmckinn
• Blog: http://blog.wesmckinney.com
• pandas: http://github.com/wesm/pandas
• statsmodels: http://statsmodels.sourceforge.net

What's new in pandas and the SciPy stack for financial users

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à What's new in pandas and the SciPy stack for financial users

Similaire à What's new in pandas and the SciPy stack for financial users (20)

Plus de Wes McKinney

Plus de Wes McKinney (19)

Dernier

Dernier (20)

What's new in pandas and the SciPy stack for financial users