SlideShare une entreprise Scribd logo
1  sur  26
Data visualization tools in
Python
Roman Merkulov
Data Scientist at InData Labs
r_merkulov@indatalabs.com
merkylovecom@mail.ru
Content
- why dataviz is important
- dataviz libraries in python
- facets tool
- interactive maps
- Apache Superset
data visualization
- EDA & understanding the data
- fix data
- show insights
- models validation
- analytics & reporting
Plots vs descriptive statistics
Anscombe's quartet
*https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Plots vs descriptive statistics
Anscombe's quartet
*https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Property Value Accuracy
Mean of X 9 exact
Sample
variance of X
11 exact
Mean of y 7.5
2 decimal
places
Sample
variance of y
4.125 +- 0.003
Correlation
coef.
0.816
3 decimal
places
Linear
regression
y = 3.00 +
0.5x
2 decimal
places
Determ. coef. 0.67
2 decimal
places
*http://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html
*https://matplotlib.org/gallery.html
Pros:
- very powerful
- large community, long history
Doesn’t look simple enough...
Cons:
- imperative API
- poor support for interactivity
Just to add a popup...
matplotlib based solutions
*https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017
matplotlib based solutions
http://yhat.github.io/ggpy/
http://scitools.org.uk/cartopy/docs/latest/gallery.html
https://seaborn.pydata.org/examples/index.html
https://networkx.github.io/documentation/networkx-1.9.1/examples/drawing/random_geometric_graph.html
javascript based solutions
*https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017
folium
bqplot
*https://plot.ly/python/
Pros:
- interactivity
- lots of visualization
types
- both declarative and
imperative capabilities
Cons:
- paid features
bokeh
Pros:
- interactivity
- lots of visualization
types
- both declarative and
imperative capabilities
Cons:
- limited vector graphic
export
Datashader
when you have millions and billions of points
NYC Taxi
US Census 2010
*https://datashader.readthedocs.io/en/latest/
Altair
(based on Vega-Lite)
Fully declarative paradigm
*https://altair-viz.github.io/#
Facets
Overview
Dive
Quick Draw Dataset https://pair-code.github.io/facets/quickdraw.html
*https://pair-code.github.io/facets/
https://github.com/PAIR-code/facets
*https://pair-code.github.io/facets/quickdraw.html
https://research.googleblog.com/2017/07/facets-open-source-visualization-tool.html
Folium
*https://github.com/python-visualization/folium
https://indatalabs.com/discover-hong-kong-through-the-lense-of-instagram/
https://indatalabs.com/brands-on-london-instagram
Visualization of the week according to InsideBigData
https://insidebigdata.com/2016/02/03/visualization-of-the-week-hong-kong-social-media-data-map/
Apache Superset
*https://superset.incubator.apache.org/
Apache Superset
Whatever!
if SQLAlchemy dialect is available for your DB
*https://github.com/apache/incubator-superset
Apache Superset
Who uses:
Airbnb Amino Brilliant.org Clark.de Digit Game Studios Douban
Endress+Hauser FBK - ICT center Faasos GfK Data Lab InData Labs
Maieutical Labs Qunar Shopkick Tails.com Tobii Tooploox Udemy Yahoo!
Zalando
Panoramix Caravel Superset
*https://github.com/apache/incubator-superset
Article on Superset benefits
and limitations
https://indatalabs.com/blog/data-strategy/open-
source-data-visualization-tool-superset
Roaring Elephant podcast
Episode 41
https://roaringelephant.org/2017/04/25/episode-41-
news-news-and-some-more-news/
Thanks for your attention!
some examples shown are available here
https://github.com/merkylove/data_visualisations_for_datathon_2017
Contacts:
r_merkulov@indatalabs.com
merkylovecom@mail.ru
https://www.linkedin.com/in/roman-merkulov-a61804a4/

Contenu connexe

Tendances

Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization
Ana Jofre
 

Tendances (20)

Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Pandas
PandasPandas
Pandas
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Alteryx Presentation
Alteryx PresentationAlteryx Presentation
Alteryx Presentation
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introduction
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with R
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Data Visualization With Tableau | Edureka
Data Visualization With Tableau | EdurekaData Visualization With Tableau | Edureka
Data Visualization With Tableau | Edureka
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 

Similaire à Data Visualization Tools in Python

Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
Ken Mwai
 
VerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdfVerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdf
Amzath3
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab
 

Similaire à Data Visualization Tools in Python (20)

Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
VerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdfVerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdf
 
Data transformation
Data transformationData transformation
Data transformation
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Enabling semantic integration
Enabling semantic integration Enabling semantic integration
Enabling semantic integration
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
06-07 Chapter interpolation in MATLAB
06-07 Chapter interpolation in MATLAB06-07 Chapter interpolation in MATLAB
06-07 Chapter interpolation in MATLAB
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 

Dernier

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Dernier (20)

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 

Data Visualization Tools in Python

Notes de l'éditeur

  1. The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality. The second graph (top right) is not distributed normally; while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate. In the third graph (bottom left), the distribution is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables. The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][6] The datasets are as follows. The x values are the same for the first three datasets.[1]
  2. https://ru.wikipedia.org/wiki/%D0%9A%D0%BE%D1%8D%D1%84%D1%84%D0%B8%D1%86%D0%B8%D0%B5%D0%BD%D1%82_%D0%B4%D0%B5%D1%82%D0%B5%D1%80%D0%BC%D0%B8%D0%BD%D0%B0%D1%86%D0%B8%D0%B8
  3. it's possible to generate bivariate data with a given mean, median, and correlation in any shape you like — even a dinosaur The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the "DataSaurus"
  4. designed like MatLab many output formats ( A lot of documentation on the website and in the mailing lists refers to the “backend” and many new users are confused by this term. matplotlib targets many different use cases and output formats. Some people use matplotlib interactively from the python shell and have plotting windows pop up when they type commands. Some people embed matplotlib into graphical user interfaces like wxpython or pygtk to build rich applications. Others use matplotlib in batch scripts to generate postscript images from some numerical simulations, and still others in web application servers to dynamically serve up graphs. To support all of these use cases, matplotlib can target different outputs, and each of these capabilities is called a backend; the “frontend” is the user facing code, i.e., the plotting code, whereas the “backend” does all the hard work behind-the-scenes to make the figure. There are two types of backends: user interface backends (for use in pygtk, wxpython, tkinter, qt4, or macosx; also referred to as “interactive backends”) and hardcopy backends to make image files (PNG, SVG, PDF, PS; also referred to as “non-interactive backends”). ) can reproduce any plot well-tested, 14 year as a standard tool
  5. I want population vs area coloured by Region
  6. imperative and too verbose API poor styles sometimes poor support of webview/interactions often slow for large and complicated data
  7. keep matplotlib as a backend and provide domain specific APIs pandas - dataframe object with plotting methods seaborn - focus on statistical visualization. Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. (more than 5 years) ggplot is a Python implementation of the grammar of graphics. It is not intended to be a feature-for-feature port of ggplot2 for R--though there is much greatness in ggplot2, the Python world could stand to benefit from it. So there will be feature overlap, but not neccessarily mimicry (after all, R is a little weird). cartopy: ( Some of the key features of cartopy are: object oriented projection definitions point, line, polygon and image transformations between projections integration to expose advanced mapping in matplotlib with a simple and intuitive interface powerful vector data handling by integrating shapefile reading with Shapely capabilities ) http://proj4.org/ http://trac.osgeo.org/geos/ networkx: NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Features Data structures for graphs, digraphs, and multigraphs Many standard graph algorithms Network structure and analysis measures Generators for classic graphs, random graphs, and synthetic networks Nodes can be "anything" (e.g., text, images, XML records) Edges can hold arbitrary data (e.g., weights, time-series) Open source 3-clause BSD license Well tested with over 90% code coverage Additional benefits from Python include fast prototyping, easy to teach, and multi-platform scikit-plot Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought. Gaining insights is simply a lot easier when you're looking at a colored heatmap of a confusion matrix complete with class labels rather than a single-line dump of numbers enclosed in brackets. Besides, if you ever need to present your results to someone (virtually any time anybody hires you to do data science), you show them visualizations, not a bunch of numbers in Excel. That said, there are a number of visualizations that frequently pop up in machine learning. Scikit-plot is a humble attempt to provide aesthetically-challenged programmers (such as myself) the opportunity to generate quick and beautiful graphs and plots with as little boilerplate as possible.
  8. build an API that serializes the plot (usually JSON) that can be displayed in browser. Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. Plotly's Python graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts. toyploy: Plot types: bar plots, filled region plots, graph visualizations, image visualizations, line plots, matrix plots, numberline plots, scatter plots, tabular plots, text plots. Styling: standard CSS, rich text with HTML markup. Integrates with Jupyter without any need for plugins, magics, etc. Interaction types: display interactive mouse coordinates, export figure data to CSV. Interactive output formats: Embeddable, self-contained HTML. Static output formats: SVG, PDF, PNG, MP4, WEBM. Portability: single code base for Python 2.7 / Python 3.6. Testing: greater-than-95% regression test coverage. Main feature: easy animations Cufflinks: This library binds the power of plotly with the flexibility of pandas for easy plotting. ipyvolume: 3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL. Ipyvolume currenty can Do volume rendering. Create scatter plots (up to ~1 million glyphs). Create quiver plots (like scatter, but with an arrow pointing in a particular direction). Render in the Jupyter notebook, or create a standalone html page (or snippet to embed in your page). Render in stereo, for virtual reality with Google Cardboard. Animate in d3 style, for instance if the x coordinates or color of a scatter plots changes. Animations / sequences, all scatter/quiver plot properties can be a list of arrays, which can represent time snapshots. Stylable (although still basic) Integrates with ipywidgets for adding gui controls (sliders, button etc), see an example at the documentation homepage bokeh by linking the selection bqplot by linking the selection Ipyvolume will probably, but not yet: Render labels in latex. Do isosurface rendering. Do selections using mouse or touch. Show a custom popup on hovering over a glyph.
  9. python, R, Matlab, JS chart, dashboard, slides Every chart that matplotlib or MATLAB graphics can do. Interactive charts and maps out-of-the-box. Get started working offline. Optional hosted sharing platform through Plotly On-Premises or Plotly Cloud. on top of d3.js Streaming API (paid) community, chat, email, phone support (depends on plan) public\private charts, dashboards, slides (depends on plan) png, jpeg, pdf, svg, eps, html export (depends on plan) connect to 7-18 sources (depends on plan)
  10. Python, R, Scala, Julia Bokeh, a Python interactive visualization library, enables beautiful and meaningful visual presentation of data in modern web browsers. With Bokeh, you can quickly and easily create interactive plots, dashboards, and data applications. Bokeh helps provide elegant, concise construction of novel graphics in the style of D3.js, while also delivering high-performance interactivity over very large or streaming datasets.
  11. Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data. Datashader breaks the creation of images of data into 3 main steps: Projection Each record is projected into zero or more bins of a nominal plotting grid shape, based on a specified glyph. Aggregation Reductions are computed for each bin, compressing the potentially large dataset into a much smaller aggregate array. Transformation These aggregates are then further processed, eventually creating an image. Using this very general pipeline, many interesting data visualizations can be created in a performant and scalable way. Datashader contains tools for easily creating these pipelines in a composable manner, using only a few lines of code. Datashader can be used on its own, but it is also designed to work as a pre-processing stage in a plotting library, allowing that library to work with much larger datasets than it would otherwise. Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader breaks the creation of images into a series of explicit steps that allow computations to be done on intermediate representations. This approach allows accurate and effective visualizations to be produced automatically, and also makes it simple for data scientists to focus on particular data and relationships of interest in a principled way. Using highly optimized rendering routines written in Python but compiled to machine code using Numba, datashader makes it practical to work with extremely large datasets even on standard hardware. https://datashader.readthedocs.io/en/latest/
  12. Altair is a declarative statistical visualization library for Python, based on Vega-Lite. With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code. Note: Altair and the underlying Vega-Lite library are under active development; new plot types and streamlined plotting interfaces will be added in future releases. Please stay tuned for developments in the coming months! – October 2016 The key idea is that you are declaring links between data columns to encoding channels, such as the x-axis, y-axis, color, etc. and the rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising number of useful plots and visualizations can be created. One of the unique design philosophies of Altair is that it leverages the Vega-Lite specification to create “beautiful and effective visualizations with minimal amount of code.” What does this mean? The Altair site explains it well: Altair provides a Python API for building statistical visualizations in a declarative manner. By statistical visualization we mean: The data source is a DataFrame that consists of columns of different data types (quantitative, ordinal, nominal and date/time). The DataFrame is in a tidy format where the rows correspond to samples and the columns correspond the observed variables. The data is mapped to the visual properties (position, color, size, shape, faceting, etc.) using the group-by operation of Pandas and SQL. The Altair API contains no actual visualization rendering code but instead emits JSON data structures following the Vega-Lite specification. For convenience, Altair can optionally use ipyvega to display client-side renderings seamlessly in the Jupyter notebook. Where Altair differentiates itself from some of the other tools is that it attempts to interpret the data passed to it and make some reasonable assumptions about how to display it. By making reasonable assumptions, the user can spend more time exploring the data than trying to figure out a complex API for displaying it. To illustrated this point, here is one very small example of where Altair differs from matplotlib when charting values. In Altair, if I plot a value like 10,000,000, it will display it as 10M whereas default matplotlib plots it in scientific notation (1.0 X 1e8). Obviously it is possible to change the value but trying to figure that out takes away from interpreting the data. You will see more of this behavior in the examples below. The Altair documentation is an excellent series of notebooks and I encourage folks interested in learning more to check it out. Before going any further, I wanted to highlight one other unique aspect of Altair related to the data format it expects. As described above, Altair expects all of the data to be in tidy format. The general idea is that you wrangle your data into the appropriate format, then use the Altair API to perform various grouping or other data summary techniques for your specific situation. For new users, this may take some time getting used to. However, I think in the long-run it is a good skill to have and the investment in the data wrangling (if needed) will pay off in the end by enforcing a consistent process for visualizing data. If you would like to learn more, I found this article to be a good primer for using pandas to get data into the tidy format. Vega is a visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format, and generate web-based views using Canvas or SVG. Version 3.0.5 Vega provides basic building blocks for a wide variety of visualization designs: data loading and transformation, scales, map projections, axes, legends, and graphical marks such as rectangles, lines, plotting symbols, etc. Interaction techniques can be specified using reactive signals that dynamically modify a visualization in response to input event streams. A Vega specification defines an interactive visualization in a JSON format. Specifications are parsed by Vega’s JavaScript runtime to generate both static images or interactive web-based views. Vega provides a convenient representation for computational generation of visualizations, and can serve as a foundation for new APIs and visual analysis tools. Plotting data in the python ecosystem is a good news/bad news story. The good news is that there are a lot of options. The bad news is that there are a lot of options. Trying to figure out which ones works for you will depend on what you’re trying to accomplish. To some degree, you need to play with the tools to figure out if they will work for you. I don’t see one clear winner or clear loser. Here are a few of my closing thoughts: Pandas is handy for simple plots but you need to be willing to learn matplotlib to customize. Seaborn can support some more complex visualization approaches but still requires matplotlib knowledge to tweak. The color schemes are a nice bonus. ggplot has a lot of promise but is still going through growing pains. bokeh is a robust tool if you want to set up your own visualization server but may be overkill for the simple scenarios. pygal stands alone by being able to generate interactive svg graphs and png files. It is not as flexible as the matplotlib based solutions. Plotly generates the most interactive graphs. You can save them offline and create very rich web-based visualizations. As it stands now, I’ll continue to watch progress on the ggplot landscape and use pygal and plotly where interactivity is needed.
  13. The power of machine learning comes from its ability to learn patterns from large amounts of data. Understanding your data is critical to building a powerful machine learning system. Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive. ********************* Explore Facets Overview and Facets Dive on the UCI Census Income dataset, used for predicting whether an individual’s income exceeds $50K/yr based on their census data. The census data contains features such as age, education level and occupation for each individual.1 ******************************************************** Overview takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. Overview gives users a quick understanding of the distribution of values across the features of their dataset(s). Uncover several uncommon and common issues such as unexpected feature values, missing feature values for a large number of observation, training/serving skew and train/test/validation set skew. Facets Overview summarizes statistics for each feature and compares the training and test datasets. It becomes easy to learn the distribution of values across the 6 numeric and 9 categorical features for both datasets. Use the “Sort by” dropdown to sort features by “Distribution distance”. This sort order brings to the top of the tables, the features that are the most different between the two datasets. “Target” becomes the first feature in the table of categorical features. The chart for this feature shows that the training and test datasets actually use slightly different labels (“>50K” for the training data and “>50K.” for test data - notice the trailing period). This helps us uncover an unexpected difference between the training data and the test data. Dive is a tool for interactively exploring large numbers of data points at once. Dive provides an interactive interface for exploring the relationship between data points across all of the different features of a dataset. Each individual item in the visualization represents a data point. Position items by "faceting" or bucketing them in multiple dimensions by their feature values. Success stories of Dive include the detection of classifier failure, identification of systematic errors, evaluating ground truth and potential new signals for ranking. The Dive visualization shows each individual item in the training dataset. Clicking on an individual item reveals key/value pairs that represent the features of that record; values may be strings or numbers. Using the menus on the left, you can change how the data is organized in order to gain insight into the dataset. Use the “Faceting” menu to do Row-based faceting” by “Education-num”. Use the “Color” menu to color by “Target”. This will show how higher levels of education are related to whether or not an individual earns more than $50K/yr. Overview gives a high-level view of one or more data sets. It produces a visual feature-by-feature statistical analysis, and can also be used to compare statistics across two or more data sets. The tool can process both numeric and string features, including multiple instances of a number or string per feature. Overview can help uncover issues with datasets, including the following: Unexpected feature values Missing feature values for a large number of examples Training/serving skew Training/test/validation set skew Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets. Dive is a tool for interactively exploring up to tens of thousands of multidimensional data points, allowing users to seamlessly switch between a high-level overview and low-level details. Each example is a represented as single item in the visualization and the points can be positioned by faceting/bucketing in multiple dimensions by their feature values. Combining smooth animation and zooming with faceting and filtering, Dive makes it easy to spot patterns and outliers in complex data sets. The Facets visualizations currently work only in Chrome - Issue 9. Disclaimer: This is not an official Google product Note: When visualizing a large amount of data, as is done in the Dive demo Jupyter notebook, you will need to start the notebook server with an increased IOPub data rate. This can be done with the command jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000.
  14. Fun Fact: In large datasets, such as the CIFAR-10 dataset[2], a small human labelling error can easily go unnoticed. We inspected the CIFAR-10 dataset with Dive and were able to catch a frog-cat – an image of a frog that had been incorrectly labelled as a cat! Exploration of the CIFAR-10 dataset using Facets Dive. Here we facet the ground truth labels by row and the predicted labels by column. This produces a confusion matrix view, allowing us to drill into particular kinds of misclassifications. In this particular case, the ML model incorrectly labels some small percentage of true cats as frogs. The interesting thing we find by putting the real images in the confusion matrix is that one of these "true cats" that the model predicted was a frog is actually a frog from visual inspection. With Facets Dive, we can determine that this one misclassification wasn't a true misclassification of the model, but instead incorrectly labeled data in the dataset. We’ve gotten great value out of Facets inside of Google and are excited to share the visualizations with the world. We hope they can help you discover new and interesting things about your data that lead you to create more powerful and accurate machine learning models. And since they are open source, you can customize the visualizations for your specific needs or contribute to the project to help us all better understand our data. If you have feedback about your experience with Facets, please let us know what you think.
  15. folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.
  16. More than 1.5 million Instagram posts have been gathered to create this interactive infographics. All of the posts are geo-tagged so that mapping them out was possible. The colors on the map show density and sentiments of Instagram posts across Hong Kong.
  17. Apache Superset is a data exploration and visualization web application. Superset provides: An intuitive interface to explore and visualize datasets, and create interactive dashboards. A wide array of beautiful visualizations to showcase your data. Easy, code-free, user flows to drill down and slice and dice the data underlying exposed dashboards. The dashboards and charts acts as a starting point for deeper analysis. A state of the art SQL editor/IDE exposing a rich metadata browser, and an easy workflow to create visualizations out of any result set. An extensible, high granularity security model allowing intricate rules on who can access which product features and datasets. Integration with major authentication backends (database, OpenID, LDAP, OAuth, REMOTE_USER, ...) A lightweight semantic layer, allowing to control how data sources are exposed to the user by defining dimensions and metrics Out of the box support for most SQL-speaking databases Deep integration with Druid allows for Superset to stay blazing fast while slicing and dicing large, realtime datasets Fast loading dashboards with configurable caching On top of having the ability to query your relational databases, Superset has ships with deep integration with Druid (a real time distributed column-store). When querying Druid, Superset can query humongous amounts of data on top of real time dataset. Note that Superset does not require Druid in any way to function, it's simply another database backend that it can query.
  18. MySQL Postgres Vertica Oracle Microsoft SQL Server SQLite Greenplum Firebird MariaDB Sybase IBM DB2 Exasol MonetDB Snowflake Redshift more! look for the availability of a SQLAlchemy dialect for your database to find out whether it will work with Superset