SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
pandas: a Foundational Python library for Data Analysis
                    and Statistics

                                   Wes McKinney


                            PyHPC 2011, 18 November 2011




Wes McKinney (@wesmckinn)          Data analysis with pandas   PyHPC 2011   1 / 25
An alternate title




       High Performance Structured Data
            Manipulation in Python




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   2 / 25
My background



     Former quant hacker at AQR Capital, now entrepreneur
     Background: math, statistics, computer science, quant finance.
     Shaken, not stirred
     Active in scientific Python community
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! “Python for Data Analysis”, to hit the shelves later next year
     from O’Reilly




 Wes McKinney (@wesmckinn)   Data analysis with pandas       PyHPC 2011   3 / 25
Structured data



      cname              year   agefrom      ageto             ls     lsc    pop   ccode
0     Australia          1950   15           19                64.3   15.4   558   AUS
1     Australia          1950   20           24                48.4   26.4   645   AUS
2     Australia          1950   25           29                47.9   26.2   681   AUS
3     Australia          1950   30           34                44     23.8   614   AUS
4     Australia          1950   35           39                42.1   21.9   625   AUS
5     Australia          1950   40           44                38.9   20.1   555   AUS
6     Australia          1950   45           49                34     16.9   491   AUS
7     Australia          1950   50           54                29.6   14.6   439   AUS
8     Australia          1950   55           59                28     12.9   408   AUS
9     Australia          1950   60           64                26.3   12.1   356   AUS




    Wes McKinney (@wesmckinn)      Data analysis with pandas                 PyHPC 2011   4 / 25
Structured data



     A familiar data model
           Heterogeneous columns or hyperslabs
           Each column/hyperslab is homogeneously typed
           Relational databases (SQL, etc.) are just a special case
     Need good performance in row- and column-oriented operations
     Support for axis metadata
     Data alignment is critical
     Seamless integration with Python data structures and NumPy




 Wes McKinney (@wesmckinn)       Data analysis with pandas            PyHPC 2011   5 / 25
Structured data challenges



     Table modification: column insertion/deletion
     Axis indexing and data alignment
     Aggregation and transformation by group (“group by”)
     Missing data handling
     Pivoting and reshaping
     Merging and joining
     Time series-specific manipulations
     Fast IO: flat files, databases, HDF5, ...




 Wes McKinney (@wesmckinn)    Data analysis with pandas     PyHPC 2011   6 / 25
Not all fun and games




     We care nearly equally about
           Performance
           Ease-of-use (syntax / API fits your mental model)
           Expressiveness
     Clean, consistent API design is hard and underappreciated




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   7 / 25
The big picture



     Build a foundation for data analysis and statistical computing
     Craft the most expressive / flexible in-memory data manipulation tool
     in any language
           Preferably also one of the fastest, too
     Vastly simplify the data preparation, munging, and integration process
     Comfortable abstractions: master data-fu without needing to be a
     computer scientist
     Later: extend API with distributed computing backend for
     larger-than-memory datasets




 Wes McKinney (@wesmckinn)       Data analysis with pandas   PyHPC 2011   8 / 25
pandas: a brief history




     Starting building April 2008 back at AQR
     Open-sourced (BSD license) mid-2009
     29075 lines of Python/Cython code as of yesterday, and growing fast
     Heavily tested, being used by many companies (inc. lots of financial
     firms) in production




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   9 / 25
Cython: getting good performance



     My choice tool for writing performant code
     High level access to NumPy C API internals
     Buffer syntax/protocol abstracts away striding details of
     non-contiguous arrays, very low overhead vs. working with raw C
     pointers
     Reduce/remove interpreter overhead associated with working with
     Python data structures
     Interface directly with C/C++ code when necessary




 Wes McKinney (@wesmckinn)   Data analysis with pandas     PyHPC 2011   10 / 25
Axis indexing




     Key pandas feature
     The axis index is a data structure itself, which can be customized to
     support things like:
           1-1 O(1) indexing with hashable Python objects
           Datetime indexing for time series data
           Hierarchical (multi-level) indexing
     Use Python dict to support O(1) lookups and O(n) realignment ops.
     Can specialize to get better performance and memory usage




 Wes McKinney (@wesmckinn)      Data analysis with pandas    PyHPC 2011   11 / 25
Axis indexing



     Every axis has an index
     Automatic alignment between differently-indexed objects: makes it
     nearly impossible to accidentally combine misaligned data
     Hierarchical indexing provides an intuitive way of structuring and
     working with higher-dimensional data
     Natural way of expressing “group by” and join-type operations
     As good or in many cases much more integrated/flexible than
     commercial or open-source alternatives to pandas/Python




 Wes McKinney (@wesmckinn)     Data analysis with pandas     PyHPC 2011   12 / 25
The trouble with Python dicts...




     Python dict memory footprint can be quite large
           1MM key-value pairs: something like 70mb on a 64-bit system
           Even though sizeof(PyObject*) == 8
     Python dict is great, but should use a faster, threadsafe hash table for
     primitive C types (like 64-bit integer)
     BUT: using a hash table only necessary in the general case. With
     monotonic indexes you don’t need one for realignment ops




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   13 / 25
Some alignment numbers



     Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
     Outer-join 500k-length indexes chosen from 1MM elements
           Dict-based with random strings: 2.2 seconds
           Sorted strings: 400ms (5.5x faster)
           Sorted int64: 19ms (115x faster)
     Fortunately, time series data falls into this last category
     Alignment ops with C primitives could be fairly easily parallelized with
     OpenMP in Cython




 Wes McKinney (@wesmckinn)      Data analysis with pandas          PyHPC 2011   14 / 25
DataFrame, the pandas workhorse




     A 2D tabular data structure with row and column indexes
     Hierarchical indexing one way to support higher-dimensional data in a
     lower-dimensional structure
     Simplified NumPy type system: float, int, boolean, object
     Rich indexing operations, SQL-like join/merges, etc.
     Support heterogeneous columns WITHOUT sacrificing performance in
     the homogeneous (e.g. floating point only) case




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   15 / 25
DataFrame, under the hood




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   16 / 25
Supporting size mutability



     In order to have good row-oriented performance, need to store
     like-typed columns in a single ndarray
     “Column” insertion: accumulate 1 × N × . . . homogeneous columns,
     later consolidate with other like-typed into a single block
     I.e. avoid reallocate-copy or array concatenation steps as long as
     possible
     Column deletions can be no-copy events (since ndarrays support
     views)




 Wes McKinney (@wesmckinn)    Data analysis with pandas       PyHPC 2011   17 / 25
Hierarchical indexing




     New this year, but really should have done long ago
     Natural result of multi-key groupby
     An intuitive way to work with higher-dimensional data
     Much less ad hoc way of expressing reshaping operations
     Once you have it, things like Excel-style pivot tables just “fall out”




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   18 / 25
Reshaping




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   19 / 25
Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   20 / 25
Reshaping implementation nuances




     Must deal with unbalanced group sizes / missing data
     Play vectorization tricks with the NumPy C-contiguous memory
     layout: no Python for loops allowed
     Care must be taken to handle heterogeneous and homogeneous data
     cases




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   21 / 25
GroupBy




     High level process
           split data set into groups
           apply function to each group (an aggregation or a transformation)
           combine results intelligently into a result data structure
     Can be used to emulate SQL GROUP BY operations




 Wes McKinney (@wesmckinn)      Data analysis with pandas        PyHPC 2011    22 / 25
GroupBy



     Grouping closely related to indexing
     Create correspondence between axis labels and group labels using one
     of:
           Array of group labels (like a DataFrame column)
           Python function to be applied to each axis tick
     Can group by multiple keys
     For a hierarchically indexed axis, can select a level and group by that
     (or some transformation thereof)




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   23 / 25
GroupBy implementation challenges


     Computing the group labels from arbitrary Python objects is very
     expensive
           77ms for 1MM strings with 1K groups
           107ms for 1MM strings with 10K groups
           350ms for 1MM strings with 100K groups
     To sort or not to sort (for iteration)?
           Once you have the labels, can reorder the data set in O(n) (with a
           much smaller constant than computing the labels)
           Roughly 35ms to reorder 1MM float64 data points given the labels
     (By contrast, computing the mean of 1MM elements takes 1.4ms)
     Python function call overhead is significant in cases with lots of small
     groups; much better (orders of magnitude speedup) to write
     specialized Cython routines


 Wes McKinney (@wesmckinn)      Data analysis with pandas         PyHPC 2011    24 / 25
Demo, time permitting




Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   25 / 25

Contenu connexe

Tendances

Lecture 1 data structures and algorithms
Lecture 1 data structures and algorithmsLecture 1 data structures and algorithms
Lecture 1 data structures and algorithms
Aakash deep Singhal
 

Tendances (20)

Pandas
PandasPandas
Pandas
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data Analytics
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorial
 
Pandas
PandasPandas
Pandas
 
Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
 
NUMPY
NUMPY NUMPY
NUMPY
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
Lecture 1 data structures and algorithms
Lecture 1 data structures and algorithmsLecture 1 data structures and algorithms
Lecture 1 data structures and algorithms
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Numpy
NumpyNumpy
Numpy
 
Data Structures in Python
Data Structures in PythonData Structures in Python
Data Structures in Python
 
NUMPY-2.pptx
NUMPY-2.pptxNUMPY-2.pptx
NUMPY-2.pptx
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Matplotlib
MatplotlibMatplotlib
Matplotlib
 

Similaire à pandas: a Foundational Python Library for Data Analysis and Statistics

Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
Ken Mwai
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazier
Tim Frazier
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
ssuser52a19e
 

Similaire à pandas: a Foundational Python Library for Data Analysis and Statistics (20)

Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazier
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 

Plus de Wes McKinney

Plus de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Dernier

Dernier (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

pandas: a Foundational Python Library for Data Analysis and Statistics

  • 1. pandas: a Foundational Python library for Data Analysis and Statistics Wes McKinney PyHPC 2011, 18 November 2011 Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25
  • 2. An alternate title High Performance Structured Data Manipulation in Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25
  • 3. My background Former quant hacker at AQR Capital, now entrepreneur Background: math, statistics, computer science, quant finance. Shaken, not stirred Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25
  • 4. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25
  • 5. Structured data A familiar data model Heterogeneous columns or hyperslabs Each column/hyperslab is homogeneously typed Relational databases (SQL, etc.) are just a special case Need good performance in row- and column-oriented operations Support for axis metadata Data alignment is critical Seamless integration with Python data structures and NumPy Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25
  • 6. Structured data challenges Table modification: column insertion/deletion Axis indexing and data alignment Aggregation and transformation by group (“group by”) Missing data handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast IO: flat files, databases, HDF5, ... Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25
  • 7. Not all fun and games We care nearly equally about Performance Ease-of-use (syntax / API fits your mental model) Expressiveness Clean, consistent API design is hard and underappreciated Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25
  • 8. The big picture Build a foundation for data analysis and statistical computing Craft the most expressive / flexible in-memory data manipulation tool in any language Preferably also one of the fastest, too Vastly simplify the data preparation, munging, and integration process Comfortable abstractions: master data-fu without needing to be a computer scientist Later: extend API with distributed computing backend for larger-than-memory datasets Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25
  • 9. pandas: a brief history Starting building April 2008 back at AQR Open-sourced (BSD license) mid-2009 29075 lines of Python/Cython code as of yesterday, and growing fast Heavily tested, being used by many companies (inc. lots of financial firms) in production Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25
  • 10. Cython: getting good performance My choice tool for writing performant code High level access to NumPy C API internals Buffer syntax/protocol abstracts away striding details of non-contiguous arrays, very low overhead vs. working with raw C pointers Reduce/remove interpreter overhead associated with working with Python data structures Interface directly with C/C++ code when necessary Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25
  • 11. Axis indexing Key pandas feature The axis index is a data structure itself, which can be customized to support things like: 1-1 O(1) indexing with hashable Python objects Datetime indexing for time series data Hierarchical (multi-level) indexing Use Python dict to support O(1) lookups and O(n) realignment ops. Can specialize to get better performance and memory usage Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25
  • 12. Axis indexing Every axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations As good or in many cases much more integrated/flexible than commercial or open-source alternatives to pandas/Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25
  • 13. The trouble with Python dicts... Python dict memory footprint can be quite large 1MM key-value pairs: something like 70mb on a 64-bit system Even though sizeof(PyObject*) == 8 Python dict is great, but should use a faster, threadsafe hash table for primitive C types (like 64-bit integer) BUT: using a hash table only necessary in the general case. With monotonic indexes you don’t need one for realignment ops Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25
  • 14. Some alignment numbers Hardware: Macbook Pro Core i7 laptop, Python 2.7.2 Outer-join 500k-length indexes chosen from 1MM elements Dict-based with random strings: 2.2 seconds Sorted strings: 400ms (5.5x faster) Sorted int64: 19ms (115x faster) Fortunately, time series data falls into this last category Alignment ops with C primitives could be fairly easily parallelized with OpenMP in Cython Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Hierarchical indexing one way to support higher-dimensional data in a lower-dimensional structure Simplified NumPy type system: float, int, boolean, object Rich indexing operations, SQL-like join/merges, etc. Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25
  • 16. DataFrame, under the hood Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25
  • 17. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25
  • 18. Hierarchical indexing New this year, but really should have done long ago Natural result of multi-key groupby An intuitive way to work with higher-dimensional data Much less ad hoc way of expressing reshaping operations Once you have it, things like Excel-style pivot tables just “fall out” Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25
  • 19. Reshaping Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 19 / 25
  • 20. Reshaping In [5]: df.unstack(’agefrom’).stack(’year’) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 20 / 25
  • 21. Reshaping implementation nuances Must deal with unbalanced group sizes / missing data Play vectorization tricks with the NumPy C-contiguous memory layout: no Python for loops allowed Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25
  • 22. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25
  • 23. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25
  • 24. GroupBy implementation challenges Computing the group labels from arbitrary Python objects is very expensive 77ms for 1MM strings with 1K groups 107ms for 1MM strings with 10K groups 350ms for 1MM strings with 100K groups To sort or not to sort (for iteration)? Once you have the labels, can reorder the data set in O(n) (with a much smaller constant than computing the labels) Roughly 35ms to reorder 1MM float64 data points given the labels (By contrast, computing the mean of 1MM elements takes 1.4ms) Python function call overhead is significant in cases with lots of small groups; much better (orders of magnitude speedup) to write specialized Cython routines Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25
  • 25. Demo, time permitting Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 25 / 25