Getting started with pandas

2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companies Friday, May 18, 2012

3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular display Friday, May 18, 2012

4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandas Friday, May 18, 2012

5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operations Friday, May 18, 2012

6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=['a','b','c']) s a -0.889880 b 1.102135 c -2.187296 Friday, May 18, 2012

7. Series to/from dict d = dict(s) {'a': -0.88988001423312313, 'c': -2.1872960440695666, 'b': 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keys Friday, May 18, 2012

8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex(['c','b','a']) c -1.570596 b 0.607173 a -0.496848 Friday, May 18, 2012

9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with Numpy Friday, May 18, 2012

10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the object Friday, May 18, 2012

11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and Numpy Friday, May 18, 2012

12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columns Friday, May 18, 2012

13. DataFrame >>> d = {'one': s*s, 'two': s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592 Friday, May 18, 2012

14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df['three'] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888 Friday, May 18, 2012

15. Select row by label >>> row = df.xs('a') one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <class'pandas.core.series.Series'> >>> df.dtypes one float64 two float64 three float64 Friday, May 18, 2012

16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cummin Friday, May 18, 2012

17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearman Friday, May 18, 2012

18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structures Friday, May 18, 2012

19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandas Friday, May 18, 2012

20. Book coming soon... Friday, May 18, 2012

Getting started with pandas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting started with pandas

Similar to Getting started with pandas (20)

More from maikroeder

More from maikroeder (6)

Recently uploaded

Recently uploaded (20)

Getting started with pandas