SlideShare a Scribd company logo
1 of 20
Download to read offline
Getting started with
                             Pandas

                                    Maik Röder
                          Barcelona Python Meetup Group
                                    17.05.2012




Friday, May 18, 2012
Pandas
                       • Python data analysis library
                       • Built on top of Numpy
                       • Panel Data System
                       • Open Sourced by AQR Capital
                         Management, LLC in late 2009
                       • 30.000 lines of tested Python/Cython code
                       • Used in production in many companies

Friday, May 18, 2012
The ideal tool for data
                             scientists
                       • Munging data
                       • Cleaning data
                       • Analyzing
                       • Modeling data
                       • Organizing the results of the analysis into a
                         form suitable for plotting or tabular display


Friday, May 18, 2012
Installation
                       • Install Python 2.6.8 or later
                       • Current versions:
                        • Numpy 1.6.1 and Pandas 0.7.3
                       • Recommendation: Install with pip
                         pip install numpy
                         pip install pandas



Friday, May 18, 2012
Axis Indexing

                       • Every axis has an index
                       • Highly optimized data structure
                       • Hierarchical indexing
                       • group by and join-type operations


Friday, May 18, 2012
Series data structure
              • 1-dimensional
                       import numpy as np
                       randn = np.random.randn
                       from pandas import *
                       s = Series(randn(3),
                                  index=['a','b','c'])
                       s
                       a   -0.889880
                       b    1.102135
                       c   -2.187296


Friday, May 18, 2012
Series to/from dict
                       d = dict(s)
                       {'a': -0.88988001423312313,
                         'c': -2.1872960440695666,
                         'b': 1.1021347373670938}
                       Series(d)
                       a    -0.889880
                       b     1.102135
                       c    -2.187296
                 • Index comes from sorted dictionary keys
Friday, May 18, 2012
Reindexing labels
                       >>>   s
                       a     -0.496848
                       b       0.607173
                       c     -1.570596
                       >>>   s.reindex(['c','b','a'])
                       c     -1.570596
                       b       0.607173
                       a     -0.496848


Friday, May 18, 2012
Vectorization
                       >>> s + s
                       a   -1.779760
                       b    2.204269
                       c   -4.374592
                       >>> np.exp(s)
                       a    0.410705
                       b    3.010586
                       c    0.112220
                 • Series work with Numpy
Friday, May 18, 2012
Structured Data
          • Data that can be represented as tables
           • rows and columns
          • Each row is a different object
          • Columns represent attributes of the object




Friday, May 18, 2012
Structured data
                       • Like SQL Table or Excel Sheet
                       • Heterogeneous columns, but each column
                         homogeneously typed
                       • Row and column-oriented operations
                       • Axis meta data
                       • Seamless integration with Python data
                         structures and Numpy


Friday, May 18, 2012
DataFrame data structure

                       • Like data.frame in R
                       • 2-dimensional tabular data structure
                       • Data manipulation with integrated indexing
                       • Support heterogeneous columns
                       • Homogeneous columns

Friday, May 18, 2012
DataFrame

                       >>> d = {'one': s*s,
                                'two': s+s}
                       >>> DataFrame(d)
                               one       two
                       a 0.791886 -1.779760
                       b 1.214701 2.204269
                       c 4.784264 -4.374592



Friday, May 18, 2012
Dataframe add column
                       >>> s
                       a   -0.889880
                       b     1.102135
                       c   -2.187296
                       >>> df['three'] = s * 3
                       >>> df
                               one      two     three
                       a 0.791886 -1.779760 -2.669640
                       b 1.214701 2.204269 3.306404
                       c 4.784264 -4.374592 -6.561888
Friday, May 18, 2012
Select row by label
                 >>> row = df.xs('a')
                 one      0.791886
                 two     -1.779760
                 three   -2.669640
                 Name: a
                 >>> type(row)
                 <class'pandas.core.series.Series'>
                 >>> df.dtypes
                 one      float64
                 two      float64
                 three    float64
Friday, May 18, 2012
Descriptive statistics
                       >>> df.mean()
                       one      2.263617
                       two     -1.316694
                       three   -1.975041
                 • Also: count, sum, median, min, max, abs, prod,
                       std, var, skew, kurt, quantile, cumsum,
                       cumprod, cummax, cummin


Friday, May 18, 2012
Computational Tools

                 • Covariance
                       >>> s1 = Series(randn(1000))
                       >>> s2 = Series(randn(1000))
                       >>> s1.cov(s2)
                       0.013973709323221539
                 • Also: pearson, kendall, spearman


Friday, May 18, 2012
This and much more...
                       • Group by: split-apply-combine
                       • Merge, join and aggregate
                       • Reshaping and Pivot Tables
                       • Time Series / Date functionality
                       • Plotting with matplotlib
                       • IO Tools (Text, CSV, HDF5, ...)
                       • Sparse data structures
Friday, May 18, 2012
Resources


                       • http://pypi.python.org/pypi/pandas
                       • http://code.google.com/p/pandas


Friday, May 18, 2012
Book coming soon...




Friday, May 18, 2012

More Related Content

What's hot

파이썬+주요+용어+정리 20160304
파이썬+주요+용어+정리 20160304파이썬+주요+용어+정리 20160304
파이썬+주요+용어+정리 20160304
Yong Joon Moon
 
Introduction to django
Introduction to djangoIntroduction to django
Introduction to django
Ilian Iliev
 

What's hot (20)

Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Python GUI Programming
Python GUI ProgrammingPython GUI Programming
Python GUI Programming
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Datastructures in python
Datastructures in pythonDatastructures in python
Datastructures in python
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
 
Python seaborn cheat_sheet
Python seaborn cheat_sheetPython seaborn cheat_sheet
Python seaborn cheat_sheet
 
Web Development with Python and Django
Web Development with Python and DjangoWeb Development with Python and Django
Web Development with Python and Django
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
파이썬+주요+용어+정리 20160304
파이썬+주요+용어+정리 20160304파이썬+주요+용어+정리 20160304
파이썬+주요+용어+정리 20160304
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Introduction to django
Introduction to djangoIntroduction to django
Introduction to django
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
 
Date and Time Module in Python | Edureka
Date and Time Module in Python | EdurekaDate and Time Module in Python | Edureka
Date and Time Module in Python | Edureka
 
Python libraries
Python librariesPython libraries
Python libraries
 
Extending MariaDB with user-defined functions
Extending MariaDB with user-defined functionsExtending MariaDB with user-defined functions
Extending MariaDB with user-defined functions
 

Similar to Getting started with pandas

A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
Víctor Zabalza
 
Data science in Node.js
Data science in Node.jsData science in Node.js
Data science in Node.js
Sean Byrnes
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map ReduceQuick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
ohkura
 

Similar to Getting started with pandas (20)

Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
ggplotcourse.pptx
ggplotcourse.pptxggplotcourse.pptx
ggplotcourse.pptx
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
 
Effective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web CollectionsEffective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web Collections
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
ClickHouse 2018.  How to stop waiting for your queries to complete and start ...ClickHouse 2018.  How to stop waiting for your queries to complete and start ...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
 
Data science in Node.js
Data science in Node.jsData science in Node.js
Data science in Node.js
 
Quick dive to pandas
Quick dive to pandasQuick dive to pandas
Quick dive to pandas
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slides
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map ReduceQuick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
 

More from maikroeder

Encode RNA Dashboard
Encode RNA DashboardEncode RNA Dashboard
Encode RNA Dashboard
maikroeder
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
maikroeder
 

More from maikroeder (6)

Google charts
Google chartsGoogle charts
Google charts
 
Encode RNA Dashboard
Encode RNA DashboardEncode RNA Dashboard
Encode RNA Dashboard
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
 
Cms - Content Management System Utilities for Django
Cms - Content Management System Utilities for DjangoCms - Content Management System Utilities for Django
Cms - Content Management System Utilities for Django
 
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik RöderPlone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Getting started with pandas

  • 1. Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012
  • 2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companies Friday, May 18, 2012
  • 3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular display Friday, May 18, 2012
  • 4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandas Friday, May 18, 2012
  • 5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operations Friday, May 18, 2012
  • 6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=['a','b','c']) s a -0.889880 b 1.102135 c -2.187296 Friday, May 18, 2012
  • 7. Series to/from dict d = dict(s) {'a': -0.88988001423312313, 'c': -2.1872960440695666, 'b': 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keys Friday, May 18, 2012
  • 8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex(['c','b','a']) c -1.570596 b 0.607173 a -0.496848 Friday, May 18, 2012
  • 9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with Numpy Friday, May 18, 2012
  • 10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the object Friday, May 18, 2012
  • 11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and Numpy Friday, May 18, 2012
  • 12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columns Friday, May 18, 2012
  • 13. DataFrame >>> d = {'one': s*s, 'two': s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592 Friday, May 18, 2012
  • 14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df['three'] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888 Friday, May 18, 2012
  • 15. Select row by label >>> row = df.xs('a') one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <class'pandas.core.series.Series'> >>> df.dtypes one float64 two float64 three float64 Friday, May 18, 2012
  • 16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cummin Friday, May 18, 2012
  • 17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearman Friday, May 18, 2012
  • 18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structures Friday, May 18, 2012
  • 19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandas Friday, May 18, 2012