A walk through the maze of understanding Data Visualization using several tools such as Python, R, Knime and Google Data Studio.
This workshop is hands-on and this set of presentations is designed to be an agenda to the workshop
2. "In God We Trust…All Other's, Bring Data,"
Deming
3. 1. Exploratory Data Analysis
2. Fundamentals of Effective Data
Visualization
3. Tools for Data Visualization
4. Demo using Python, R and Knime to
create visualization
5. Creating insightful reports with
Visual tools
6. Q & A
3
AGENDA
4. What is EDA?
• Exploratory data analysis is a
data analysis approach to reveal the
important characteristics of a dataset,
mainly through visualization.
• Get to know your data!
• Distributions (symmetric, normal, skewed)
• Data quality problems
• Outliers
• Correlations and inter-relationships
• Functional relationships
• Derived attributes, keys such as Primary,
Foreign keys,
• Static attributes, dynamic attributes etc
5. Get a good look and feel of the Data.
• Always check your datasets
• Mean
• Medians
• Quantiles
• Histograms
• Boxplots
• Scatter Diagrams
Consider looking at every attribute - you will understand
what it represents!
6. Visualization beforeAnalysis
(Anscombe’s Quartet)
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
7. For all the Datasets
Property Value Accuracy
Mean of x 9 exact
Sample variance of x 11 exact
Mean of y 7.50 to 2 decimal places
Sample variance of y 4.125 plus/minus 0.003
Correlation between x and y 0.816 to 3 decimal places
Linear regression line y = 3.00 + 0.500x
to 2 and 3 decimal places,
respectively
Coefficient of
determination of the linear
regression
0.67 to 2 decimal places
8. • The first scatter plot (top left) appears to be a simple linear relationship, corresponding to
two variables correlated and following the assumption of normality.
• The second graph (top right) is not distributed normally; while a relationship between the two
variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A
more general regression and the corresponding coefficient of determination would be more
appropriate.
9. • In the third graph (bottom left), the distribution is linear, but should have a different regression
line (a robust regression would have been called for). The calculated regression is offset by the
one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
• Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce
a high correlation coefficient, even though the other data points do not indicate any relationship
between the variables.
10. Get a general sense of the data
• Make sure your first visualization is - Data-driven (model-free)
• Think interactive and visual
• Humans are the best pattern recognizers
• Use as many dimensions as your data will permit 2, 3
• x,y,z, space, color, time….
• Visualization is useful in early stages of data mining
• detect outliers (e.g. assess data quality)
• test assumptions (e.g. normal distributions or skewed?)
• identify useful raw data & transforms (e.g. log(x))
Take Away: it is always well worth looking at your data!
11. 1. Exploratory Data Analysis
2. Fundamentals of Effective Data
Visualization
3. Tools for Data Visualization
4. Demo using Python, R and Knime to
create visualization
5. Creating insightful reports with
Visual tools
6. Q & A
11
AGENDA
18. Introduction to Information Visualization - Fall 2013
*Adapted from The ParaView
Tutorial, Moreland
Visualization: Converting raw data to a graphics that is
understandable to people
22. HEATMAPVISUALIZATION
• A heatmap is a two-dimensional
graphical representation of data
where the individual values that
are contained in a matrix are
represented as colors.
• The seaborn python package
allows the creation of
annotated heatmaps which can
be tweaked
using Matplotlib tools as per the
creator's requirement.
22
23. 1. Exploratory Data Analysis
2. Fundamentals of Effective Data
Visualization
3. Tools for Data Visualization
4. Demo using Python, R and Knime to
create visualization
5. Creating insightful reports with
Visual tools
6. Q & A
23
AGENDA
25. KNIME DATAVISUALIZATIONTOOLS
• KNIME Analytics Platform provides many nodes for data visualization,
including scatter plots, pie charts, box plots, histograms as well as tag
clouds and visualizations of networks.
Data Visualization Nodes
• KNIME has a number of native visualization dedicated nodes.
• Hiliting
• Geo-location
• R Choropleths
25
26. KNIME FEATURES
KNIME uses modular workflow approach, which documents and
stores the analysis process in the exact same order it was conceived
and implemented. All results in the workflow are instantly available
for review by the user, aiding debugging at every stage in the
workflow
Core KNIME features include:
• Scalability through sophisticated data handling
(intelligent automatic caching of data in the background
while maximizing throughput performance)
• Highly and easily extensible via a well-defined API for
plugin extensions
• Intuitive user interface
• Import/export of workflows (for exchanging with other
KNIME users)
• Parallel execution on multi-core systems
• Command line version for "headless" batch executions
26
27. KNIME FUNCTIONALITIES
Available KNIME modules cover a vast range of functionality,
such as:
• I/O: retrieves data from files or data bases
• Data Manipulation: pre-processes your input data with
filtering, group-by, pivoting, binning, normalization,
aggregation, joining, sampling, partitioning, etc.
• Views: inspects the data and results with several
interactive views, supporting interactive data exploration
• Hiliting: ensures hillite data points in one view are also
immediately hillite in all other views
• Mining: uses state-of-the-art data mining algorithms like
clustering, rule induction, decision tree, association rules,
naïve bayes, neural networks, support vector machines,
etc. to better understand your data
27
29. MATPLOTLIB – 2 D Graphics
29
Simple and powerful visualizations can be generated
using the Matplotlib Python Library.
It is the most widely-used library for plotting in the
Python community.
Libraries like pandas are “wrappers” over Matplotlib
allowing access to a number of Matplotlib’s methods
with less code.
The versatility of Matplotlib can be used to make many
visualization types:-
•Scatter plots
•Bar charts and Histograms
•Line plots
•Pie charts
•Stem plots
•Contour plots etc
30. SEABORN
• Seaborn is a popular data
visualization library that is built
on top of Matplotlib.
• Seaborn’s default styles and
color palettes are much more
sophisticated than Matplotlib.
• Seaborn is a higher-level library,
meaning it’s easier to generate
certain kinds of plots, including
heat maps, time series, and
violin plots.
30
31. ggplot
• Ggplot is a python visualization library
based on R’s ggplot2 and the Grammar of
Graphics.
• Ggplot operates differently compared to
Matplotlib: it lets users layer components
to create a full plot.
• The Grammar of Graphics has been hailed
as an “intuitive” method for plotting,
though, seasoned Matplotlib users might
need time to adjust to this new mindset.
31
32. Bokeh
https://bokeh.pydata.org/en/latest/docs/gallery.html#gallery
• Bokeh is native to Python, not ported over from R, unlike ggplot. Bokeh, like
ggplot, is also based on The Grammar of Graphics.
• It also supports streaming, and real-time data and its unique selling proposition
is its ability to create interactive, web-ready plots, which can easily output as
JSON objects, HTML documents, or interactive web applications.
• Bokeh has three interfaces with varying degrees of control to accommodate
different types of users.
• The topmost level is for creating charts quickly. It includes methods for creating
common charts such as bar plots, box plots, and histograms.
• The middle level allows the user to control the basic building blocks of each chart (for
example, the dots in a scatter plot) and has the same specificity as Matplotlib.
• The bottom level is geared toward developers and software engineers. It has no pre-
set defaults and requires the user to define every element of the chart.
32
https://demo.bokehplots.com/apps/crossfilter https://realpython.com/python-data-visualization-bokeh/
33. PLOTLY
• Plotly is widely known as an online platform for
data visualization.
• It can be accessed from a Python notebook.
• Like Bokeh, Plotly’s strength lies in making
interactive plots, and it offers some charts not
found in most libraries, like contour plots.
• Can also be used by people with no technical
background for creating interactive plots by
uploading the data and using plotly GUI.
• Plotly is compatible with ggplots in R and Python.
• It allows to embed interactive plots in projects or
websites using iframes or html.
33
https://plot.ly/python/line-and-scatter/ https://plot.ly/feed/?q=plottype:choropleth
34. PYGAL
• Offers interactive plots that can be
embedded in a web browser. The ability
to output charts as SVGs, is its prime
differentiator. For work involving smaller
datasets, SVGs will do just fine. However,
for charts with hundreds of thousands of
data points, they become sluggish and
have trouble rendering.
It’s easy to create a nice-looking chart
with just a few lines of code since each
chart type is packaged into a method and
the built-in styles are pretty.
34
35. ALTAIR
35
Altair is a declarative statistical
visualization python library based
on Vega-lite.
Declarative means you only need to
mention the links between data columns
to the encoding channels, such as x-axis,
y-axis, color, etc. and the rest of the
plotting details are handled automatically.
Being declarative makes Altair simple,
friendly and consistent. It is easy to
design effective and beautiful
visualizations with a minimal amount of
code using Altair.
36. Geoplotlib
• It is a toolbox used for plotting
geographical data and map creation.
• It can be used to create a variety of map-
types, like choropleths, heatmaps, and dot
density maps.
• It provides a set of in-built tools for the
most common tasks such as density
visualization, spatial graphs, and shape
files.
• Simply said Geoplotlib is a Python library
dedicated to visualization of maps
36
37. Major RVisual Libraries
37
• Plotly - Plotly's R graphing library makes interactive, publication-quality
graphs online. Can be used to make line plots, scatter plots, area
charts, bar charts, error bars, box plots, histograms, heatmaps,
subplots, multiple-axes, and 3D (WebGL based) charts.
• Ggplot2 - The ggplot2 package lets you make beautiful and
customizable plots of your data. It implements the grammar of
graphics, an easy to use system for building plots.
• Shiny - Shiny is an R package that makes it easy to build interactive web
apps straight from R. You can host standalone apps on a webpage or
embed them in R Markdown documents or build dashboards. You can
also extend your Shiny apps with CSS themes, htmlwidgets, and
JavaScript actions.
https://shiny.rstudio.com/gallery/genome-browser.html
https://rdrr.io/snippets/http://gallery.htmlwidgets.org/ docs.ggplot2.or
38. GOOGLE DATA STUDIO
• Currently in beta, Google Data Studio allows you
to create branded reports
with data visualizations to share with your
clients. ... Google Data Studio is part of
theGoogle Analytics 360 Suite — the high-end
(i.e., pricey)Google Analytics Enterprise package.
• Data Studio is Google's reporting solution for
power users who want to go beyond
the data and dashboards of Google Analytics.
The data widgets in Data Studio are notable for
their variety, customization options,
live data and interactive controls (such as
column sorting and table pagination).
• You can create up to five custom reports for free
earlier – now you can create as many as required 38
https://datastudio.google.com/reporting/1Rg5y6r0640X8uo2xo
2XY48sG9IyMiYEN/page/wcCU
39. D3.JS
• D3.js is a JavaScript library for
manipulating documents based on data.
• D3 helps you bring data to life using
HTML, SVG, and CSS. D3’s emphasis on
web standards gives you the full
capabilities of modern browsers without
tying yourself to a proprietary framework,
combining powerful visualization
components and a data-driven approach
to DOM manipulation.
39
40. 1. Exploratory Data Analysis
2. Fundamentals of Effective Data
Visualization
3. Tools for Data Visualization
4. Demo using Python, R and Knime to
create visualization
5. Creating insightful reports with
Visual tools
6. Q & A
40
AGENDA
41. 1. Exploratory Data Analysis
2. Fundamentals of Effective Data
Visualization
3. Tools for Data Visualization
4. Demo using Python, R and Knime to
create visualization
5. Creating insightful reports with
Visual tools
6. Q & A
41
AGENDA
42. Reporting and Analysis
• Reporting is “the process of
organizing data into informational
summaries in order to monitor
how different areas of a business
are performing.”
• Analytics is “the process of
exploring data and reports in order
to extract meaningful insights,
which can be used to better
understand and improve business
performance.”
42
43. An Analytical Report?
An analytical report is a business report
• It uses qualitative and quantitative data
to analyze as well as evaluate a business
strategy or process.
• Empowers decision makers to make data-
driven decisions based on evidence and
analytics.
43
45. Collecting Metrics is easy – Generating Insights is what nails it!
Generate - Actionable insight
• Actionable Insights is a term in data analytics and big data for information that can be
acted upon or information that gives enough insight into the future that the actions
that should be taken become clear for decision makers.
• Analytics (mathematical ways of synthesizing metrics) must illuminate business
conditions, sentiment and directional changes over time.
• Insights are what humans make from analytics - once you have data and perform the
analysis, you have the knowledge to form insights and change your actions or
responses.
45
46. 1. Exploratory Data Analysis
2. Fundamentals of Effective Data
Visualization
3. Tools for Data Visualization
4. Demo using Python, R and Knime to
create visualization
5. Creating insightful reports with
Visual tools
6. Q & A
46
AGENDA
47. 47
This session is for education purpose and the material used in this presentation has been compiled from various free
and readily available resources, a full acknowledgement list can be furnished on request