SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Financial data analysis in Python with pandas

                  Wes McKinney
                   @wesmckinn


                    10/17/2011




@wesmckinn ()     Data analysis with pandas   10/17/2011   1 / 22
My background




   3 years as a quant hacker at AQR, now consultant / entrepreneur
   Math and statistics background with the zest of computer science
   Active in scientific Python community
   My blog: http://blog.wesmckinney.com
   Twitter: @wesmckinn




    @wesmckinn ()          Data analysis with pandas      10/17/2011   2 / 22
Bare essentials for financial research



    Fast time series functionality
         Easy data alignment
         Date/time handling
         Moving window statistics
         Resamping / frequency conversion
    Fast data access (SQL databases, flat files, etc.)
    Data visualization (plotting)
    Statistical models
         Linear regression
         Time series models: ARMA, VAR, ...




     @wesmckinn ()            Data analysis with pandas   10/17/2011   3 / 22
Would be nice to have




   Portfolio and risk analytics, backtesting
        Easy enough to write yourself, though most people do a bad job of it
   Portfolio optimization
        Most financial firms use a 3rd party library anyway
   Derivative pricing
        Can use QuantLib in most languages




    @wesmckinn ()            Data analysis with pandas          10/17/2011   4 / 22
What are financial firms using?



   HFT: a C++ and hardware arms race, a different topic
   Research
        Mainstream: R, MATLAB, Python, ...
        Econometrics: Stata, eViews, RATS, etc.
        Non-programmatic environments: ClariFI, Palantir, ...
   Production
        Popular: Java, C#, C++
        Less popular, but growing: Python
        Fringe: Functional languages (Ocaml, Haskell, F#)




    @wesmckinn ()            Data analysis with pandas          10/17/2011   5 / 22
What are financial firms using?



   Many hybrid languages environments (e.g. Java/R, C++/R,
   C++/MATLAB, Python/C++)
        Which is the main implementation language?
        If main language is Java/C++, result is lower productivity and higher
        cost to prototyping new functionality
   Trends
        Banks and hedge funds are realizing that Java-based production
        systems can be replaced with 20% as much Python code (or less)
        MATLAB is being increasingly ditched in favor of Python. R and
        Python use for research generally growing




    @wesmckinn ()            Data analysis with pandas          10/17/2011   6 / 22
Python language



   Simple, expressive syntax
   Designed for readability, like “runnable pseudocode”
   Easy-to-use, powerful built-in types and data structures:
        Lists and tuples (fixed-size, immutable lists)
        Dicts (hash maps / associative arrays) and sets
   Everything’s an object, including functions
   “There should be one, and preferably only one way to do it”
   “Batteries included”: great general purpose standard library




    @wesmckinn ()            Data analysis with pandas         10/17/2011   7 / 22
A simple example: quicksort


Pseudocode from Wikipedia:

function qsort(array)
    if length(array) < 2
        return array
    var list less, greater
    select and remove a pivot value pivot from array
    for each x in array
        if x < pivot then append x to less
        else append x to greater
    return concat(qsort(less), pivot, qsort(greater))




     @wesmckinn ()           Data analysis with pandas   10/17/2011   8 / 22
A simple example: quicksort

First try Python implementation:
def qsort ( array ):
    if len ( array ) < 2:
        return array

    less , greater = [] , []

    pivot , rest = array [0] , array [1:]

    for x in rest :
        if x < pivot :
             less . append ( x )
        else :
             greater . append ( x )

    return qsort ( less ) + [ pivot ] + qsort ( greater )



      @wesmckinn ()         Data analysis with pandas       10/17/2011   9 / 22
A simple example: quicksort



Use list comprehensions:
def qsort ( array ):
    if len ( array ) < 2:
        return array

     pivot , rest = array [0] , array [1:]
     less = [ x for x in rest if x < pivot ]
     greater = [ x for x in rest if x >= pivot ]

     return qsort ( less ) + [ pivot ] + qsort ( greater )




      @wesmckinn ()         Data analysis with pandas        10/17/2011   10 / 22
A simple example: quicksort




Heck, fit it onto one line!
qs = lambda r : ( r if len ( r ) < 2
                  else ( qs ([ x for x in r [1:] if x < r [0]])
                         + [ r [0]]
                         + qs ([ x for x in r [1:] if x >= r [0]])))

Though that’s starting to look like Lisp code...




      @wesmckinn ()           Data analysis with pandas   10/17/2011   11 / 22
A simple example: quicksort


A quicksort using NumPy arrays
def qsort ( array ):
    if len ( array ) < 2:
        return array
    pivot , rest = array [0] , array [1:]
    less = rest [ rest < pivot ]
    greater = rest [ rest >= pivot ]
    return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]

Of course no need for this when you can just do:

sorted_array = np.sort(array)




      @wesmckinn ()           Data analysis with pandas       10/17/2011   12 / 22
Python: drunk with power
This comic has way too much airtime but:




     @wesmckinn ()          Data analysis with pandas   10/17/2011   13 / 22
Staples of Python for science: MINS




   (M) matplotlib: plotting and data visualization
   (I) IPython: rich interactive computing and development environment
   (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random
   number generation, etc.
   (S) SciPy: optimization, probability distributions, signal processing,
   ODEs, sparse matrices, ...




     @wesmckinn ()          Data analysis with pandas        10/17/2011   14 / 22
Why did Python become popular in science?



   NumPy traces its roots to 1995
   Extremely easy to integrate C/C++/Fortran code
   Access fast low level algorithms in a high level, interpreted language
   The language itself
        “It fits in your head”
        “It [Python] doesn’t get in my way” - Robert Kern
   Python is good at all the things other scientific programming
   languages are not good at (e.g. networking, string processing, OOP)
   Liberal BSD license: can use Python for commercial applications




    @wesmckinn ()            Data analysis with pandas       10/17/2011   15 / 22
Some exciting stuff in the last few years


    Cython
         “Augmented” Python language with type declarations, for generating
         compiled extensions
         C-like speedups with Python-like development time
    IPython: enhanced interactive Python interpreter
         The best research and software development env for Python
         An integrated parallel / distributed computing backend
         GUI console with inline plotting and a rich HTML notebook (more on
         this later)
    PyCUDA / PyOpenCL: GPU computing in Python
         Transformed Python overnight into one of the best languages for doing
         GPU computing




     @wesmckinn ()            Data analysis with pandas        10/17/2011   16 / 22
Where has Python historically been weak?



   Rich data structures for data analysis and statistics
         NumPy arrays, while powerful, feel distinctly “lower level” if you’re
         used to R’s data.frame
         pandas has filled this gap over the last 2 years
   Statistics libraries
         Nowhere near the depth of R’s CRAN repository
         statsmodels provides tested implementations a lot of standard
         regression and time series models
         Turns out that most financial data analysis requires only fairly
         elementary statistical models




     @wesmckinn ()             Data analysis with pandas           10/17/2011    17 / 22
pandas library



    Began building at AQR in 2008, open-sourced late 2009
    Why
         R / MATLAB, while good for research / data analysis, are not suitable
         implementation languages for large-scale production systems
                (I personally don’t care for them for data analysis)
         Existing data structures for time series in R / MATLAB were too
         limited / not flexible enough my needs
    Core idea: indexed data structures capable of storing heterogeneous
    data
    Etymology: panel data structures




     @wesmckinn ()                Data analysis with pandas            10/17/2011   18 / 22
pandas in a nutshell



    A clean axis indexing design to support fast data alignment, lookups,
    hierarchical indexing, and more
    High-performance data structures
         Series/TimeSeries: 1D labeled vector
         DataFrame: 2D spreadsheet-like structure
         Panel: 3D labeled array, collection of DataFrames
    SQL-like functionality: GroupBy, joining/merging, etc.
    Missing data handling
    Time series functionality




     @wesmckinn ()              Data analysis with pandas    10/17/2011   19 / 22
pandas design philosophy



   “Think outside the matrix”: stop thinking about shape and start
   thinking about indexes
   Indexing and data alignment are essential
   Fault-tolerance: save you from common blunders caused by coding
   errors (specifically misaligned data)
   Lift the best features of other data analysis environments (R,
   MATLAB, Stata, etc.) and make them better, faster
   Performance and usability equally important




     @wesmckinn ()          Data analysis with pandas       10/17/2011   20 / 22
The pandas killer feature: indexing



    Each axis has an index
    Automatic alignment between differently-indexed objects: makes it
    nearly impossible to accidentally combine misaligned data
    Hierarchical indexing provides an intuitive way of structuring and
    working with higher-dimensional data
    Natural way of expressing “group by” and join-type operations
    Better integrated and more flexible indexing than anything available
    in R or MATLAB




     @wesmckinn ()           Data analysis with pandas       10/17/2011   21 / 22
Tutorial time




                     To the IPython console!




     @wesmckinn ()          Data analysis with pandas   10/17/2011   22 / 22

Contenu connexe

Tendances

Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSGayathri Gaayu
 
Lecture Note-1: Algorithm and Its Properties
Lecture Note-1: Algorithm and Its PropertiesLecture Note-1: Algorithm and Its Properties
Lecture Note-1: Algorithm and Its PropertiesRajesh K Shukla
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python PandasNeeru Mittal
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-iDr. Awase Khirni Syed
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquerKrish_ver2
 
Association rule Mining
Association rule MiningAssociation rule Mining
Association rule Miningafsana40
 
Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notationsNikhil Sharma
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm AnalysisMary Margarat
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithmsRajendran
 
Algorithms Lecture 3: Analysis of Algorithms II
Algorithms Lecture 3: Analysis of Algorithms IIAlgorithms Lecture 3: Analysis of Algorithms II
Algorithms Lecture 3: Analysis of Algorithms IIMohamed Loey
 

Tendances (20)

Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMS
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Lecture Note-1: Algorithm and Its Properties
Lecture Note-1: Algorithm and Its PropertiesLecture Note-1: Algorithm and Its Properties
Lecture Note-1: Algorithm and Its Properties
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-i
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
Association rule Mining
Association rule MiningAssociation rule Mining
Association rule Mining
 
Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notations
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
 
Time complexity
Time complexityTime complexity
Time complexity
 
Back propagation
Back propagation Back propagation
Back propagation
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Algorithms Lecture 3: Analysis of Algorithms II
Algorithms Lecture 3: Analysis of Algorithms IIAlgorithms Lecture 3: Analysis of Algorithms II
Algorithms Lecture 3: Analysis of Algorithms II
 
Analysis of algorithm
Analysis of algorithmAnalysis of algorithm
Analysis of algorithm
 

Similaire à Python for Financial Data Analysis with pandas

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Fru Louis
 

Similaire à Python for Financial Data Analysis with pandas (20)

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?
 

Plus de Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

Plus de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Python for Financial Data Analysis with pandas

  • 1. Financial data analysis in Python with pandas Wes McKinney @wesmckinn 10/17/2011 @wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
  • 2. My background 3 years as a quant hacker at AQR, now consultant / entrepreneur Math and statistics background with the zest of computer science Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn @wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
  • 3. Bare essentials for financial research Fast time series functionality Easy data alignment Date/time handling Moving window statistics Resamping / frequency conversion Fast data access (SQL databases, flat files, etc.) Data visualization (plotting) Statistical models Linear regression Time series models: ARMA, VAR, ... @wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
  • 4. Would be nice to have Portfolio and risk analytics, backtesting Easy enough to write yourself, though most people do a bad job of it Portfolio optimization Most financial firms use a 3rd party library anyway Derivative pricing Can use QuantLib in most languages @wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
  • 5. What are financial firms using? HFT: a C++ and hardware arms race, a different topic Research Mainstream: R, MATLAB, Python, ... Econometrics: Stata, eViews, RATS, etc. Non-programmatic environments: ClariFI, Palantir, ... Production Popular: Java, C#, C++ Less popular, but growing: Python Fringe: Functional languages (Ocaml, Haskell, F#) @wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
  • 6. What are financial firms using? Many hybrid languages environments (e.g. Java/R, C++/R, C++/MATLAB, Python/C++) Which is the main implementation language? If main language is Java/C++, result is lower productivity and higher cost to prototyping new functionality Trends Banks and hedge funds are realizing that Java-based production systems can be replaced with 20% as much Python code (or less) MATLAB is being increasingly ditched in favor of Python. R and Python use for research generally growing @wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
  • 7. Python language Simple, expressive syntax Designed for readability, like “runnable pseudocode” Easy-to-use, powerful built-in types and data structures: Lists and tuples (fixed-size, immutable lists) Dicts (hash maps / associative arrays) and sets Everything’s an object, including functions “There should be one, and preferably only one way to do it” “Batteries included”: great general purpose standard library @wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
  • 8. A simple example: quicksort Pseudocode from Wikipedia: function qsort(array) if length(array) < 2 return array var list less, greater select and remove a pivot value pivot from array for each x in array if x < pivot then append x to less else append x to greater return concat(qsort(less), pivot, qsort(greater)) @wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
  • 9. A simple example: quicksort First try Python implementation: def qsort ( array ): if len ( array ) < 2: return array less , greater = [] , [] pivot , rest = array [0] , array [1:] for x in rest : if x < pivot : less . append ( x ) else : greater . append ( x ) return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
  • 10. A simple example: quicksort Use list comprehensions: def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = [ x for x in rest if x < pivot ] greater = [ x for x in rest if x >= pivot ] return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
  • 11. A simple example: quicksort Heck, fit it onto one line! qs = lambda r : ( r if len ( r ) < 2 else ( qs ([ x for x in r [1:] if x < r [0]]) + [ r [0]] + qs ([ x for x in r [1:] if x >= r [0]]))) Though that’s starting to look like Lisp code... @wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
  • 12. A simple example: quicksort A quicksort using NumPy arrays def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = rest [ rest < pivot ] greater = rest [ rest >= pivot ] return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )] Of course no need for this when you can just do: sorted_array = np.sort(array) @wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
  • 13. Python: drunk with power This comic has way too much airtime but: @wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
  • 14. Staples of Python for science: MINS (M) matplotlib: plotting and data visualization (I) IPython: rich interactive computing and development environment (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random number generation, etc. (S) SciPy: optimization, probability distributions, signal processing, ODEs, sparse matrices, ... @wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
  • 15. Why did Python become popular in science? NumPy traces its roots to 1995 Extremely easy to integrate C/C++/Fortran code Access fast low level algorithms in a high level, interpreted language The language itself “It fits in your head” “It [Python] doesn’t get in my way” - Robert Kern Python is good at all the things other scientific programming languages are not good at (e.g. networking, string processing, OOP) Liberal BSD license: can use Python for commercial applications @wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
  • 16. Some exciting stuff in the last few years Cython “Augmented” Python language with type declarations, for generating compiled extensions C-like speedups with Python-like development time IPython: enhanced interactive Python interpreter The best research and software development env for Python An integrated parallel / distributed computing backend GUI console with inline plotting and a rich HTML notebook (more on this later) PyCUDA / PyOpenCL: GPU computing in Python Transformed Python overnight into one of the best languages for doing GPU computing @wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
  • 17. Where has Python historically been weak? Rich data structures for data analysis and statistics NumPy arrays, while powerful, feel distinctly “lower level” if you’re used to R’s data.frame pandas has filled this gap over the last 2 years Statistics libraries Nowhere near the depth of R’s CRAN repository statsmodels provides tested implementations a lot of standard regression and time series models Turns out that most financial data analysis requires only fairly elementary statistical models @wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
  • 18. pandas library Began building at AQR in 2008, open-sourced late 2009 Why R / MATLAB, while good for research / data analysis, are not suitable implementation languages for large-scale production systems (I personally don’t care for them for data analysis) Existing data structures for time series in R / MATLAB were too limited / not flexible enough my needs Core idea: indexed data structures capable of storing heterogeneous data Etymology: panel data structures @wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
  • 19. pandas in a nutshell A clean axis indexing design to support fast data alignment, lookups, hierarchical indexing, and more High-performance data structures Series/TimeSeries: 1D labeled vector DataFrame: 2D spreadsheet-like structure Panel: 3D labeled array, collection of DataFrames SQL-like functionality: GroupBy, joining/merging, etc. Missing data handling Time series functionality @wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
  • 20. pandas design philosophy “Think outside the matrix”: stop thinking about shape and start thinking about indexes Indexing and data alignment are essential Fault-tolerance: save you from common blunders caused by coding errors (specifically misaligned data) Lift the best features of other data analysis environments (R, MATLAB, Stata, etc.) and make them better, faster Performance and usability equally important @wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
  • 21. The pandas killer feature: indexing Each axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations Better integrated and more flexible indexing than anything available in R or MATLAB @wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
  • 22. Tutorial time To the IPython console! @wesmckinn () Data analysis with pandas 10/17/2011 22 / 22