SlideShare a Scribd company logo
1 of 20
Download to read offline
[Unclear] words are denoted in brackets
Webinar: Data Visualisation Part 2 – Tools and
Techniques
12 April 2018
Video & slides available from ANDS website
START OF TRANSCRIPT
Gerry Ryder: Good afternoon everyone. My name is Gerry Ryder and it's my
pleasure to host this webinar today. Now, on to our speaker today - for
those that weren't with us for the previous webinar in this series, our
speaker is Martin Schweitzer, who is a data technologist with ANDS in
our Melbourne office.
Martin has a background in computer science and a particular interest
in visualisation, data science and user interface design. He has a very
professional background, which includes photography, working on
large IT systems, lecturing, as well as running workshops and training
courses. Martin is currently seconded to ANDS from the Bureau of
Meteorology, where he is largely responsible for the climate record of
Australia.
Today Martin is presenting the second in the series of two webinars
on data visualisation and today's focus will be on tools and
techniques. So without any further ado, I'll hand over to you, Martin.
Martin Schweitzer: Thanks Gerry, and thanks Susannah, who's behind the controls. I
hope everybody can see my screen.
Page 2 of 20
Today we're going to look at creating visualisations and pretty much
everything you see is going to be live. I'm using a tool called Jupyter
Notebook. You don't have to be familiar with this tool to follow along.
Also, I'll be using Python for my examples, but if you don't know
Python, once again, that shouldn’t be a problem, because most of the
tools and techniques that I will be showing you will be available in
other languages, for example, R and the languages like that.
I'll be going through a number of libraries, showing generally the
strengths of each library, where they can be used, how they can be
used, and as we progress, we'll move from more static to more web-
based type environments.
Jupyter Notebook runs in a web browser, and so what you see here is
my web browser, and I'll just maximise it now - that you know it's a
web browser. What it allows us to do is to type in Python code and
then execute it immediately. This is great for anybody doing research,
because a lot of the work is in experimenting. You try something, you
adjust a few parameters, and so on.
The first two lines I've got is just to set up our environment and it's -
often we get error messages showing. So this will just hide them.
Some of the libraries we'll be exploring today - the first one is
Matplotlib. We'll be looking at Pandas, one called Seaborn, two web-
based, called Bokeh and Plotly, and the last one is one that's used for
mapping, called Basemap. As I go into each one, we'll talk about them
in detail.
Now, if anybody missed the first talk, that's not a problem, because I'll
explain things as I go along, but a number of these examples are
showing how we created the plots that you may have seen in the first
talk. Also, one of the things I said in the first talk was, when during any
sort of visualisation, it's important to have some sort of story or some
sort of reason, or something we're trying to say with that visualisation.
The first visualisation I'm going to share is actually based on a
problem that I came across just while browsing the web. I'll explain the
Page 3 of 20
problem. Basically, we have a room - there are 50 people in the room,
and each person starts with $100. Each time the clock ticks, each
person takes a random number - picks a card between one and 50,
and let's say their card says 26 - they'll give $1 to person 26, if they've
got some money in their hand. If they've got no money at that point in
time, they don't give anything.
The question was, after a few thousand ticks, let's say a thousand,
how much money - how will the money be distributed? Will it be fairly
evenly distributed among the people? Will some people have a lot of
money and some people a little - and so on.
I found this quite interesting. I wrote a small simulation. The first part
of this code is the simulation, so all that is simulating what happens.
I'm now using the library called Matplotlib. Matplotlib is - as much as
you can say there's a standard plotting library in Python - the standard
plotting library. In Matplotlib, to plot my results that I got requires one
line. So to run this, I just press - I'm going to press control return, and
if everything works, as I hope it will, we get a plot.
So this one line of code gave me a plot. It realised that there were 50
elements and that the values inside those elements were between
zero and 350. However, this plot doesn’t really give us a picture of
what's happened. So what I'm going to do is sort it so that the people
with the least money come up first and the people with the most
money come last. As I said, this is an interactive environment, so all I
have to do is press - make that change, press shift enter again. It runs
it a thousand times. Now I get a very different plot. Here we can see
the a lot of people have below $50, most of them have below $100,
and then very few people have between $250 and $350.
One of the things you will note - I haven't done this yet. I'll just change
this once again, and now what I'm going to do is to save the plot as an
SVG - SVG is scalable vector graphics - and I'll run the plot again. It
will save it, and I can now open it in a web page. So we'll just open
that web page here, and what we see is we've got a nice plot. The
Page 4 of 20
other thing about it that's special - about scalable vector graphics - is
that it's scalable. If I make it bigger - because it's a vector, we don't
get any artefacts. It just scales. It gets bigger, it gets smaller. We don't
lose any quality.
The next thing is we can look at this and say, well, this sort of looks
exponential. How well does it fit an exponential curve? Once again,
Matplotlib allows us to do this. We're not going to rerun the simulation.
We'll just use the data from the last simulation and when we run it,
what we see is we get a nice little - so what I've done is I said, added
a line - a polynomial of third degree that fits those points.
I've just noticed that I haven't reset something from the last time I ran
it, so I'm quickly going to restart this. I'm going to have to rerun the
simulation, unfortunately. When we do this again - okay, which is what
I expected, an orange line.
Initially what we see is we've got this thin orange line, which nicely fits
the points. By changing the plot we can set things like the line width -
equals five - run it again, and we get a thick orange line, which is
much easier to see, but it's crossing over the points. So the next thing
we can do is just say, alpha - which is the transparency - so equals
0.5, which basically says, make it 50 per cent transparent. Now we get
a nice, thick, transparent line running through those points.
So once again Matplotlib gives us the ability to customise this line. If
we wanted Xs instead of circles, we can change the plot of the points
to Xs - and I'll run again, and we get Xs.
So one of the characteristics that we often look for when looking for a
library is this idea that simple things should be simple, complex things
should be possible. In other words, we don't want a very long learning
curve, or have to do a lot of work to get a simple graph, but if we do
want something special, we want to be able to do it. We don't want a
tool that's really simple to use but as soon as we want something a
little bit out of the ordinary, suddenly the tool stops working.
Page 5 of 20
We'll look at a few more examples of Matplotlib. This one actually
comes from the documentation. The easiest thing is to show it. It plots
a polar graph. It's using different colours, and we've just - in this case -
generated some random numbers and made them the size of the
circle and generated a random number for where, around the circle,
we've plotted it, and then depending on the angle we're plotting the
numbers in a different colour. Once again, the code that's doing the
plotting are those two lines. Once we've generated the data, all we
need are two lines of code and it gives us a really nice polar plot.
Here's another example, once again, also from the documentation.
What we're doing with this one is we're going to display it interactively
in Jupyter Notebook, but we're also going to save it. This time we're
going to save it as a PDF. So let's run this, and there we get four
histograms. It's the same data each time, but what it's demonstrating
to us is different ways in which we can create histograms, so we can
have stacked, we can have unfilled, we can have bars with legends, et
cetera.
If we have a look at the PDF that we've generated, there we go - and
once again, because it's a PDF, it's scalable, and as we scale we
don't lose any quality. It just - it's all done with vectors and gives us
really nice output. Unfortunately, I seem to have closed my - oh, there
we go. Where that's useful is if we're doing any kind of publication, it's
really nice to be able to save our output either as SVG or PDF and
include that in a publication.
This is one more, showing the range of plots we can do. This one is
called a hexbin plot. So it's plotting with hexagons, and what we're
doing is it’s a cross between a scatter plot, which plots X values and Y
values, on a graph, where we've got - like we may have the X values
may be how many figs somebody has eaten, and the Y values may be
something like the weight, and we want to plot those two against each
other, but what this is also doing is plotting it against how frequently
those values occur. One of the nice things is it's really easy to do log
scales. What we see here is a nice graph showing us that that white
Page 6 of 20
area in the centre are values that occurred very often, and as we
move out values, occurring less and less often.
The following one is we've - actually, what we're going to do is quickly
look at a comic. Some people may be familiar with the web comic
called XKCD. The person Randall Munroe is a very funny person, but
with quite a scientific bent, and also quite strong computer skills. This
one is called Stove Ownership, and it shows his health before he
realised he could cook bacon whenever he wanted, and afterwards.
The thing about this graph is that it's hand-drawn. While sometimes
we want graphs that look very polished, very professional, there's
often a perception when people see a graph like that that the figures
are very accurate, and this isn't always the case. So what people did
was to create a style, using Matplotlib, that would recreate the look at
feel of the XKCD. This is quite a lot because there's quite a lot. In fact,
they've taken two of his comics and I'll quickly plot that, using
Matplotlib, and what we see is a compute-generated reproduction of
Randall Monroe's graph. That's one of his - this is a histogram, done
in very much the same style, which copies another one of his comics.
So, taking the style, I will replot my simulation - you'll remember the
results from my simulation. So if we run that again, we see we get this
thing, and which once again - so now that it doesn’t look slick and
professional, we see really this is very much a simulation that these
figures aren't accurate, and so on. What this does do for us is it does
give us an idea of the flexibility of Matplotlib.
I'll quickly restart the kernel before going into our next library. The next
library is called Pandas. Pandas is a very useful library for anybody
who's working with spreadsheets, who's working with CSV files, who's
working with data that's coming from an API across the web, and it
also has its own plotting routines built in.
In this code here, the first line, which actually goes over three lines,
I'm reading a file which is dam storage levels. It's a CSV file. Anybody
Page 7 of 20
who watched the last presentation would be familiar that I showed
some examples. You'll see the same examples again today.
The first line reads the file, the second line plots it, using Pandas
plotting, and the third line just adds a legend to it, or sets the label on
the file at [send full]. So we're run this code, and there we did get -
these are Melbourne dams, and this is showing that the Thomson is
about 68 per cent full, and things like Tarago are 95 per cent full. So
what we've done is in one line we've run that CSV file, we've told it
what we want to call the columns. One of the columns is called name
and one of the columns is called Pfull, for percentage full. When we
plot it, because Pandas knows about this thing called DS, all we have
to say, I want to plot the name against the percentage full and I want a
bar chart. I've also said, I want to plot from the value 60 to 100. If I
leave out those values, but the same graph, it will plot from zero to
100 by default.
On this one - part of the thing was to show that even though we got
the same figures in the same graph, it looks different when we start
our scale from zero. Once again, the Thomson is about 68 per cent
full and Tarago is almost 100 per cent full.
The point which is the take-home point here is that to create that plot
took two lines of code.
The third thing that we showed last time was what's really interesting
though, is the gap in volume of these different dams. That gives us a
much better picture of what's happening. So when we run that what
we see is because the Thomson dam is a really big dam, it's got over
200,000 gigalitres of water deficit. So even though these dams on the
right are almost full, altogether they don't even make up that deficit in
the dam.
So that's Matplotlib and its strength is that it pretty much comes
standard with Python. It's flexible, and so on. However, its simplicity
often comes at the cost that it's not the best publication-ready
graphing tool. You can get very nice publication-ready graphs by
Page 8 of 20
doing a bit of work, but what some people have done is to do that
work to make it easier for people to create better graphs.
One of those libraries is called Seaborn. Seaborn basically sits on top
of Matplotlib and simply adds some nice styles. We'll replot that same
plot, this time using Seaborn. All that we're doing here is importing it
and just saying - and initialising it. So we've just added those two
lines. Everything else is exactly as the last example. We run this, and
we see a totally - well, similar graph, but different styling. What
Seaborn has done is to make it quite easy to change the styling.
I'll say set the style to white, run it again, and we'll see we'll get a nice
clean graph, and for example, in the next example what we will do is
we will set the style, but we want a white grid and a muted palette. We
will run that one and we get that white grid with a muted palette.
The next one is just one of the - well, one thing that Seaborn does,
which a few packages are starting to do, is that it actually includes its
own datasets when you install the package, which is really great for
when you're learning, because one of the worst things is you pick up a
package, you try and learn it, but the first thing you've got to do is find
some data to plot and so on. One of the data sets that's Seaborn
comes with is this one called Flights and I really enjoy heatmaps, so
this is just an example of a really simple heatmap, using Seaborn and
some of its inbuilt data - or some of the data that's provided.
What we see over here is these going down the bottom are years.
Across the y-axis are months. So round about 1960, July, there were
lots of flights, and in the earlier years I guess there were fewer flights.
Also during winter there are fewer flights than in summer.
Once again, done with Seaborn and done really with two, three - two
lines of text - two lines of code, which are those lines.
Another dataset that comes with Seaborn is called Tips. It's basically
how much people will tip at restaurants. So the first thing we'll do is to
load the dataset and have a look at the first 10 rows of this dataset.
What we see is we've got a few columns. The first one is the amount
Page 9 of 20
of the total bill - how much tip - what tip was left, the sex of the person
serving, whether or not they were a smoker, what day of the week it
was, whether it was lunch or dinner, and the size of the party. We're
going to use that dataset and have a look at a few Seaborn graphs.
The first one we're going to look at is a box plot. What we've done is
we've said we want for you to be whether they're a smoker or not, so
the purplish colour means they were smokers, the greenish colour
means they weren't. On the left-hand side we've got the size of the
total bill and across the bottom is days. So it does seem that on
Sundays maybe people tip more, and it would look like on Sundays
maybe for some reason, whatever, smokers tip more than non-
smokers.
Another plot that is often used in similar ways to the box plot but
carries a bit more information - encodes a bit more information - is
what's known as a violin plot, and these once again are quite easy in
Seaborn. In this case what we've done is we've used a different view
for male and female. Basically you read this pretty much the same as
the box plot. There's the median. There's one - the top quartile, the
bottom quartile, et cetera. Some of the information is very much the
same. On Sundays people seem to tip the most, and we can see
they've been split this time into male and female.
Those people who were at the last one will remember I demonstrated
something called Anscombe's Quartet. It's four datasets, each with the
same means and linear regression lines, but each dataset looks very
different. Here's a very simple example of it being done in Seaborn.
We'll just have a quick look at that. We see it was quite easy. In this
case, we're sharing the y-axis. Across the bottom we're sharing the x-
axis of the two plots, and all of this was done in a very, very compact
way, using Seaborn.
The next thing we're going to look at is plotting data on maps. This
goes back to a lot of what I do in my substantive job at the Bureau.
The library that we use for a lot of our mapping is - once again, it's a
Page 10 of 20
standard with Python. It's called Basemap. The first one, we're not
actually plotting anything, we're just simply drawing a map, so what
we should see now. It takes a little bit the first time we run it, but we've
plotted a map of the world in a few lines of code. That's pretty much
from there to there.
The story - what we're really interested in at this stage is Australia,
and this projection isn't as useful as what we're going to look at now,
which is a sort of [MICATA], so we'll just change some of the
parameters and this should give us a map of Australia, which is great.
It looks a bit like the ones I draw by hand.
What I'm going to demonstrate now is some more visualisation, but it
goes back to a problem I was given, oh, about a year or two ago. We
have about 112 reference stations around Australia. These are
stations with very high-quality data that have a long record - about 50
years or longer. These are very important in - as reference stations, to
see what's happening with the climate of Australia. One of the
outcomes of this - the reference station set is called Acorn, and we do
a publication where we publish the names of each station. One of the
things we also publish is for each station which are the closest three
stations to that station.
I wrote some code that worked out what the closest three stations
were, to each station. This was the file I was given - once again, I'm
using Pandas to read it. So we've got, for example, Halls Creek, we've
got the latitude, longitude, the altitude and the date it was opened. As
you can see, these all have a very long record.
The first thing is I plot these, so using Matplotlib - the first parts we've
seen. That draws the map. This line, after having read the file, plots
the data on a map, so we'll just quickly plot those stations. The black
dots of course are the stations, and there are 112 of them around
Australia.
The question I was asked is - after saying, okay, here's a list - for each
station these are the closest three stations to that particular station.
Page 11 of 20
Being scientists, they always ask interesting questions. They said, by
going to one of the closest three station from each station, is it
possible to get from any station to any other station?
Now, it may seem that the obvious answer is yes, but the thing is
because - if I'm sitting here, these may be my three closest stations,
but that does not mean that where I'm sitting - which is around
Meekatharra, that it's going to be one of the closest three stations to
this station, because this station's three closest neighbours are maybe
these three stations.
The first thing I did - because I'm very visual - was to try and visualise
it. What I did was to go back to a very old package which is about 30
years old. I first used it probably more than 20 years ago, called
Graphviz. Python includes bindings for Graphviz. We can think of this
as each station is a node and we've got lines connecting it to the three
closest stations. What I've done is to do something that will visualise
that. So we'll just run this code and it creates a PDF.
What we see in this PDF is that - I've simply used the station numbers
to save space - we can see the layout of all the stations and - move
across here - one of the things that we see, for example, is that station
7045, even though it's got three stations that are closest to it, there's
no station for which 7045 is the closest station. We can see it in a few
other places as well. I think over here, we've only got one line going
from 85096 to 91293. If anybody wants to guess, this part is in fact the
stations that are in Tasmania. If we go back to our graph - our map -
we can see how these are all close together - that station is close to
that one, but these are all closer to each other than the main one. So
basically, that graph helped us visualise, and yes, it turns out that after
writing some code, that there is no single path.
The next question - once again, these people being scientists - is,
where would we have to add stations so that the closest - so that
there's always a way that we can get to another station by visiting one
of the closest three?
Page 12 of 20
I came up with a new visualisation, and it's called a Voronoi plot. I'll
run this code. What a Voronoi plot does, is it's not easy to show here,
so I'll show it in a web page that I did. On this page you see the Acorn
sat stations and you see all these polygons. What these polygons are
- every point inside this polygon, for example, is closer to this station
than to any other station that's not inside the polygon. So any point
inside this polygon - this point, for example, is closer to there than it is
to any of the surrounding stations. So basically, it divides the territory
up into areas. In a way it's saying, okay, well, the temperature there,
we could argue, is mostly influenced by this station, so if we've got a
temperature here, and want to check it for accuracy, or whatever,
we're more likely to look here, than one of these other stations.
What does this have to do with where do we build a site? Well, if we
consider this line, any point along this line is the point that's the
furthest point between this station and this station, and any point on
this line is the point that's the furthest one between that station and
that station.
Therefore, if we were going to - ah - so any point on one of these
edges here, these where these lines meet, is the point that's the
furthest from all the adjoining stations. So this point is furthest from
that one, that one and that one - and obviously further than any other.
So what it comes down to is if we're going to build a new station, we
want it on one of these points. On one of these vertices. So it's just
another example of how we can use visualisations to solve some real
problems.
For the moment, that's all we're going to do with maps, and we may
return to it soon.
The next library we're going to quickly have a look at is called Bokeh.
Bokeh is the first library - it works with Python, but its output is
targeted generally at web pages. Once again, you'll remember
Anscombe's Quartet from a previous slide. We'll do it in Bokeh. It's
given us a really nice graph of Anscombe's Quartet. If one sees some
Page 13 of 20
of the original drawings of it, for example, in [Tufter's] book, this is very
close to the original, so it was very easy to - well, it required some
work to make it similar, but we could - it was flexible enough that we
could.
I'll quickly show another one, which is another famous machine
learning data set, which is Irises. This one is plotting. So what we're
plotting is the petal width of different species against the petal length.
We see that some species are down her, some species - the green
ones - are up here, and some over here.
The thing about Bokeh is it allows us some interactivity, so we can do
things like zoom, you can also pan, and if we put the output on a web
page, the web page can have these same tools. There's a wheel
zoom. We can go back to what it looked like initially. So that's Bokeh.
Here's one that also came from that last one, called Joyplots. The
thing about this is we're plotting a whole lot of variables against a
common set of axes.
I'll just for the moment skip over Plotly, because I want to look at a few
tools that are useful in web development - so we're leaving Python for
a moment.
The first one is one that I wrote a few years ago. This is using Google
Maps and I'm putting some data on it. These are the Acorn Sat
stations once again. When we click on one, we get a graph of the
climatology, the average monthly temperature, so let's go to
Melbourne. We're now in April, to the average maximum temperature
for Melbourne is normally 21 degrees. This is the average rainfall for
Melbourne - around 50 millimetres. We can also get a time series and
we can zoom in on the time series.
This graph and the time series were done using a tool called
Highcharts. Highcharts is available free for non-commercial use, but it
does require licence for any kind of commercial use, and government
use is also considered to be commercial. Having said that, if you are
Page 14 of 20
doing web pages and you are looking for a plotting package, it's worth
considering Highcharts.
The next example is another mapping library. This one is Leaflet. In
this case - this is something I did for work. What we're plotting here is
- this data is coming from NetCDF files. Some people will be familiar
and have used NetCDF - and the data's coming straight out of these
NetCDF files.
The main purpose of this slide though is to show this library Leaflet,
which basically allows us to put data on top of maps. In this case it's
gridded data, but we can also put - here we've got some GeoJSON.
We could also be putting shape files and other things. There's things
like utility boundaries, which you can overlay on the maps. So it
basically allows us to overlay data on top of maps.
The third example I'll show is one called OpenLayers, and this was
one of the more complex visualisations I did. Basically what this one is
demonstrating is east coast lows, off the eastern seaboard of
Australia, and all of this was overlaid on this map using - the map was
done OpenLayers. I think that's all I'll talk about maps.
I think finally what we'll do is look at one more library and one more
example. The library we're going to look at is called Vega. Once
again, it's another simulation. I came across this thing called
Parrondo's paradox. For me it was quite mind-blowing, so I just had to
do a visualisation to make sure that I understood it and that it worked.
Basically - I'll try and explain it quickly - you've got three games you
can play. Each of them involves a coin being spun. In game one the
coin is more likely to land on tails. So each time in game one you bet
on heads - in other words, it's a losing strategy. So that's game one.
In game two, we occasionally choose coin one - oh, sorry - we've got
coin two, which most often lands on heads, but we don't choose coin
two all the time. We just - sorry, we don't choose heads all the time.
Sometimes we choose heads, sometimes we choose tails. Most of the
Page 15 of 20
times we choose heads, but two out of three times we choose tails,
and it can be shown, once again, that that's a losing strategy.
In game three what we do is we randomly decide to either play game
one or game two. So if game one, we definitely lose and game two we
definitely lose, we would think that choosing game one and game two
we should also lose, if we just choose randomly between whether to
play game one or game two.
In this one I've used this library called Vega, and I think the first thing
I'll do is just run - so I play this game 10,000 times. I play game A and
plot the results. I play game B 10,000 times, plot the result. Then I do
P3, which is where I randomly choose between game one and game
two and plot the results. We run the simulation. P1 is when I play
game one and we can see I started off with zero dollars - end up with
minus $100. When I played game two, which was also a losing
strategy, I did actually quite badly. I ended up with minus $250, but
when I alternated randomly between the two games, I landed up in the
black with plus $150.
This site or this Python notebook will be included after the talk. You're
welcome to have a look at this and find the mathematical explanation
why it works, or you can also just Google Parrondo's Paradox.
So, what have we found out? Well, I guess one of the questions is if I
want to do visualisations, what's a good tool?
In brief, Matplotlib is a good one to start with. Easy things are easy.
Flexible things are possible. It can do dozens of different
visualisations. It's very good for static plots - in other words, if you're
going to publish your results in a book, or whatever, and it also
integrates well with Python's maths and science toolkits. If you're
familiar with Python, it understands things like NumPy and SciPy, and
they're all tightly integrated.
Seaborn makes it easier to do, let's say publication-ready plots with
Matplotlib.
Page 16 of 20
Bokeh has very nice output. It targets web pages. It's got a slightly
easier learning curve than Matplotlib, and it looks good out of the box.
Plotly, one of the things is it's based on a commercial package and
there's both commercial and non-commercial versions of it available. It
leverages D3 for graphics - D3 is a fantastic JavaScript graphics
library that unfortunately this talk didn’t give us time for - and because
of that, the interaction is more extensive than Bokeh, and also the
range of things.
One thing I didn’t talk about PDVega - or Vega - is that it's got an
interesting way of working in that it defines a language for defining a
graph and it displays it, but when you create a graph with Vega, that
graph includes all the data that was used to create the graph, so if
you're interested in making your publications and your data available -
so it's one thing to get - see a graph in a paper and say, okay, well,
how do I reproduce this graph? It's another thing to say, okay, this is
the graph, and this is all the data that created this graph. So it's really
worth considering if it's important to you to publish the data with the
graphs.
Basemap is based on Matplotlib. It sits on top of Matplotlib. It can be a
bit clunky, but it does the job.
Cartopy is still, I don't think, 100 per cent production-ready, but it
improves on Basemap - makes it easier to use and has some great
features.
Then I'll quickly go through, Leaflet - its advantages were lightweight,
it's quick to learn and use, and supports many formats - most
particularly WMS and GeoJSON.
OpenLayers is more feature full than Leaflet. It used to be a steeper
learning curve than leaflet, but modern versions are actually much
easier - or they've improved the - they've made the learning curve less
steep.
Page 17 of 20
I didn’t get the chance to demonstrate Cesium, but it can utilise built-in
3G capabilities of browsers, and it works just out of the box. You can
install it and immediately you've got a map up and running.
I installed it recently, just to try it out and about an hour later decided
to download some earthquake data from the United States
geographical survey, and within about 15 minutes I was displaying
that data on my map. So it makes it really easy.
What are my recommendations? If you work with Python and you're
not interested in learning a lot of programming and getting deeply into
it, but you do need to work with data and you're doing research, I
recommend learn Pandas - use Pandas for plotting with static plots
and use Vega for the web.
Thanks very much. That…
Gerry Ryder: Well, thank you so much Martin, for such and informative and practical
presentation and bravely, with so many live demos, which we rarely
see. So thank you for that.
Now, we do have time for questions, so if people have questions or
comments, please put them into the question pod. Now's your chance
with Martin online to ask any specific questions about packages or just
some of the things that you've seen today. So please do ask away.
We have got time for a few questions.
Martin, we do have a question from Marlon. What's your opinion on
tools like Tableau - or Tableau - T-A-B-L-E-A-U? Two people have
asked about that one.
Martin Schweitzer: Tableau. So Tableau is what's known as a BI tool, or business
intelligence tool. It's used in the Bureau. It's a commercial tool. I think
I'm correct in saying that it's only commercial. There may be demo
versions available.
From everything I understand, it does what it's designed to do
extremely well. It's very good at building dashboards. I think it often
assumes the idea that there's going to be a data warehouse available
Page 18 of 20
- or at least a data mart. I know previous versions where it was used,
there were some issues with creating websites that were being
presented to the public. This was because it wasn't WCAG compliant -
WCAG is the web accessibility guidelines, and for government work
websites need to be WCAG compliant.
It had some mapping features, but the maps only allowed single
layers, which would have made something like what I demonstrated
with the rainfall maps very tricky, because we had sometimes up to
five or six different layers on those maps.
So I guess, I neither want to recommend or dismiss any packages, but
I think from everything I understand, and I'm not a regular Tableau
user, but it works well for its design purpose and one of the areas
where I know people have really enjoyed using it is where they've
wanted management type dashboards on their desktop to be able to
monitor whatever it was that they were monitoring.
Gerry Ryder: Thanks, Martin. John has popped into the question pod that there is a
free public version of Tableau, Tableau - if I can get my mouth around
that. So if people are interested they could go and check that out for
themselves.
Colin's asked, Martin, why Python, and not Ruby? He also has asked
if MATLAB or R make the grade?
Martin Schweitzer: Okay, the reason Python and not Ruby is because I know Python and
I don't know Ruby particularly well. When Ruby came out, I started
learning it and then other things got in the way. I don't think there's
any good reason why not Ruby, but I can't talk with authority on how
many - I think one of the things is with data science, Python and R
really seem to have taken a lot of that mind share. Between Python
and R, I wouldn't - it's six of one and half a dozen of the other. There
are a lot of people using R. There are a lot of domains where people
really love R. Bioinformatics, I know is one where it's very common.
Every - and as I said in the beginning, most of these visualisations
Page 19 of 20
and that are available in almost any language that people look at or
any popular language.
When people come up with a library like Plotly, they - other people will
create bindings for different languages.
Gerry Ryder: Thanks, Martin. Another question - do you use other mapping tools
like ArcMap? That's another question from Marlon.
Martin Schweitzer: Well, at the Bureau Esri products are very popular. I personally don't
use ArcMap, and probably just because of the nature of the work that
I'm doing, and probably because of the current set of tool chain that
we've got. I do use an open source product called QGIS occasionally,
but even that I don't use often. Most of my work is done in the - well,
of this type - is done using things like JavaScript, and so I just use the
JavaScript libraries that are available.
Gerry Ryder: Okay, and a question from Susan, who's interested in a online tutorial
for beginners in data visualisation. So apart from recordings of your
own webinars, Martin, are there any - anything that you could
recommend to Susan? That might be one to take on notice.
Martin Schweitzer: I think it is, and I'll definitely have a look, but there's a lot of MOOCs,
so might go to places of things like Udacity or EDX, and lately I've
been noticing, particularly with the current flavour of the month being
data science, a lot of these places are offering courses, but yeah,
certainly I'll have a look at maybe we'll put a - in one of our snippets or
something, a beginner's guide to visualisation.
Gerry Ryder: Okay. Thank you, Martin. That's probably a nice segue to plug our
updated web page. Now, we - Martin's kindly spent some time
updating the content of our web page, on the ANDS website. I'm just
showing you the link here. So a lot of the tools and the libraries that
Martin's spoken about in the webinars are available and described
there, so please go and have a look at that. Also, of course, these
webinar recordings will be made available.
We have one last question, Martin, from Sophie, do you recommend
Codeacademy?
Page 20 of 20
Martin Schweitzer: I haven't used Codeacademy. I've got an account, I know, because I
keep getting emails from them, but I think it's pretty much - there's a
lot of good stuff available, so I think it's pretty much try and find
something that suits you.
Gerry Ryder: So that's great timing for the end of our webinar today. Thank you all
for coming along, and a big thanks to Martin for two fantastic webinars
and presentations and making all the materials available through the
presentations and through our updated web page.
We look forward to seeing you at one of our future webinars, and in
the meantime, have a great afternoon. Thank you very much.
END OF TRANSCRIPT

More Related Content

Similar to Transcript - Data Visualisation - Tools and Techniques

Notes About Linden Scripting 2009 Congress
Notes About Linden Scripting 2009 CongressNotes About Linden Scripting 2009 Congress
Notes About Linden Scripting 2009 Congress
draceina
 
Monday Night, Feb 10th Visrhet
Monday Night, Feb 10th VisrhetMonday Night, Feb 10th Visrhet
Monday Night, Feb 10th Visrhet
Miami University
 
Programming methodology lecture26
Programming methodology lecture26Programming methodology lecture26
Programming methodology lecture26
NYversity
 
Machine learning
Machine learningMachine learning
Machine learning
Ashok Masti
 
Programming methodology lecture06
Programming methodology lecture06Programming methodology lecture06
Programming methodology lecture06
NYversity
 
Programming methodology lecture17
Programming methodology lecture17Programming methodology lecture17
Programming methodology lecture17
NYversity
 

Similar to Transcript - Data Visualisation - Tools and Techniques (20)

Scalding at Etsy
Scalding at EtsyScalding at Etsy
Scalding at Etsy
 
Notes About Linden Scripting 2009 Congress
Notes About Linden Scripting 2009 CongressNotes About Linden Scripting 2009 Congress
Notes About Linden Scripting 2009 Congress
 
Paco Viñoly, Designing in a Developer World, WarmGun 2013
Paco Viñoly, Designing in a Developer World, WarmGun 2013Paco Viñoly, Designing in a Developer World, WarmGun 2013
Paco Viñoly, Designing in a Developer World, WarmGun 2013
 
Monday Night, Feb 10th Visrhet
Monday Night, Feb 10th VisrhetMonday Night, Feb 10th Visrhet
Monday Night, Feb 10th Visrhet
 
Convolutional neural network complete guide
Convolutional neural network complete guideConvolutional neural network complete guide
Convolutional neural network complete guide
 
A class action
A class actionA class action
A class action
 
ViziCities - Lessons Learnt Visualising Real-world Cities in 3D
ViziCities - Lessons Learnt Visualising Real-world Cities in 3DViziCities - Lessons Learnt Visualising Real-world Cities in 3D
ViziCities - Lessons Learnt Visualising Real-world Cities in 3D
 
Programming methodology lecture26
Programming methodology lecture26Programming methodology lecture26
Programming methodology lecture26
 
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - Notes
 
Bavpwjs1113
Bavpwjs1113Bavpwjs1113
Bavpwjs1113
 
Machine learning
Machine learningMachine learning
Machine learning
 
I'm Not Here I'm There -- Using a Local Instant Messaging Service in Your Lib...
I'm Not Here I'm There -- Using a Local Instant Messaging Service in Your Lib...I'm Not Here I'm There -- Using a Local Instant Messaging Service in Your Lib...
I'm Not Here I'm There -- Using a Local Instant Messaging Service in Your Lib...
 
Multimedia Fun with OpenOffice Calc
Multimedia Fun with OpenOffice CalcMultimedia Fun with OpenOffice Calc
Multimedia Fun with OpenOffice Calc
 
New Concepts: Timespan and Place Transcript (March 2020)
New Concepts: Timespan and Place Transcript (March 2020)New Concepts: Timespan and Place Transcript (March 2020)
New Concepts: Timespan and Place Transcript (March 2020)
 
Mosaic Fun with OpenOffice Calc
Mosaic Fun with OpenOffice CalcMosaic Fun with OpenOffice Calc
Mosaic Fun with OpenOffice Calc
 
Data Visualization Inspiration: Analysis To Insights To Action, Faster!
Data Visualization Inspiration: Analysis To Insights To Action, Faster!Data Visualization Inspiration: Analysis To Insights To Action, Faster!
Data Visualization Inspiration: Analysis To Insights To Action, Faster!
 
Rubykin
Rubykin Rubykin
Rubykin
 
Computational thinking-illustrated
Computational thinking-illustratedComputational thinking-illustrated
Computational thinking-illustrated
 
Programming methodology lecture06
Programming methodology lecture06Programming methodology lecture06
Programming methodology lecture06
 
Programming methodology lecture17
Programming methodology lecture17Programming methodology lecture17
Programming methodology lecture17
 

More from ARDC

More from ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

Transcript - Data Visualisation - Tools and Techniques

  • 1. [Unclear] words are denoted in brackets Webinar: Data Visualisation Part 2 – Tools and Techniques 12 April 2018 Video & slides available from ANDS website START OF TRANSCRIPT Gerry Ryder: Good afternoon everyone. My name is Gerry Ryder and it's my pleasure to host this webinar today. Now, on to our speaker today - for those that weren't with us for the previous webinar in this series, our speaker is Martin Schweitzer, who is a data technologist with ANDS in our Melbourne office. Martin has a background in computer science and a particular interest in visualisation, data science and user interface design. He has a very professional background, which includes photography, working on large IT systems, lecturing, as well as running workshops and training courses. Martin is currently seconded to ANDS from the Bureau of Meteorology, where he is largely responsible for the climate record of Australia. Today Martin is presenting the second in the series of two webinars on data visualisation and today's focus will be on tools and techniques. So without any further ado, I'll hand over to you, Martin. Martin Schweitzer: Thanks Gerry, and thanks Susannah, who's behind the controls. I hope everybody can see my screen.
  • 2. Page 2 of 20 Today we're going to look at creating visualisations and pretty much everything you see is going to be live. I'm using a tool called Jupyter Notebook. You don't have to be familiar with this tool to follow along. Also, I'll be using Python for my examples, but if you don't know Python, once again, that shouldn’t be a problem, because most of the tools and techniques that I will be showing you will be available in other languages, for example, R and the languages like that. I'll be going through a number of libraries, showing generally the strengths of each library, where they can be used, how they can be used, and as we progress, we'll move from more static to more web- based type environments. Jupyter Notebook runs in a web browser, and so what you see here is my web browser, and I'll just maximise it now - that you know it's a web browser. What it allows us to do is to type in Python code and then execute it immediately. This is great for anybody doing research, because a lot of the work is in experimenting. You try something, you adjust a few parameters, and so on. The first two lines I've got is just to set up our environment and it's - often we get error messages showing. So this will just hide them. Some of the libraries we'll be exploring today - the first one is Matplotlib. We'll be looking at Pandas, one called Seaborn, two web- based, called Bokeh and Plotly, and the last one is one that's used for mapping, called Basemap. As I go into each one, we'll talk about them in detail. Now, if anybody missed the first talk, that's not a problem, because I'll explain things as I go along, but a number of these examples are showing how we created the plots that you may have seen in the first talk. Also, one of the things I said in the first talk was, when during any sort of visualisation, it's important to have some sort of story or some sort of reason, or something we're trying to say with that visualisation. The first visualisation I'm going to share is actually based on a problem that I came across just while browsing the web. I'll explain the
  • 3. Page 3 of 20 problem. Basically, we have a room - there are 50 people in the room, and each person starts with $100. Each time the clock ticks, each person takes a random number - picks a card between one and 50, and let's say their card says 26 - they'll give $1 to person 26, if they've got some money in their hand. If they've got no money at that point in time, they don't give anything. The question was, after a few thousand ticks, let's say a thousand, how much money - how will the money be distributed? Will it be fairly evenly distributed among the people? Will some people have a lot of money and some people a little - and so on. I found this quite interesting. I wrote a small simulation. The first part of this code is the simulation, so all that is simulating what happens. I'm now using the library called Matplotlib. Matplotlib is - as much as you can say there's a standard plotting library in Python - the standard plotting library. In Matplotlib, to plot my results that I got requires one line. So to run this, I just press - I'm going to press control return, and if everything works, as I hope it will, we get a plot. So this one line of code gave me a plot. It realised that there were 50 elements and that the values inside those elements were between zero and 350. However, this plot doesn’t really give us a picture of what's happened. So what I'm going to do is sort it so that the people with the least money come up first and the people with the most money come last. As I said, this is an interactive environment, so all I have to do is press - make that change, press shift enter again. It runs it a thousand times. Now I get a very different plot. Here we can see the a lot of people have below $50, most of them have below $100, and then very few people have between $250 and $350. One of the things you will note - I haven't done this yet. I'll just change this once again, and now what I'm going to do is to save the plot as an SVG - SVG is scalable vector graphics - and I'll run the plot again. It will save it, and I can now open it in a web page. So we'll just open that web page here, and what we see is we've got a nice plot. The
  • 4. Page 4 of 20 other thing about it that's special - about scalable vector graphics - is that it's scalable. If I make it bigger - because it's a vector, we don't get any artefacts. It just scales. It gets bigger, it gets smaller. We don't lose any quality. The next thing is we can look at this and say, well, this sort of looks exponential. How well does it fit an exponential curve? Once again, Matplotlib allows us to do this. We're not going to rerun the simulation. We'll just use the data from the last simulation and when we run it, what we see is we get a nice little - so what I've done is I said, added a line - a polynomial of third degree that fits those points. I've just noticed that I haven't reset something from the last time I ran it, so I'm quickly going to restart this. I'm going to have to rerun the simulation, unfortunately. When we do this again - okay, which is what I expected, an orange line. Initially what we see is we've got this thin orange line, which nicely fits the points. By changing the plot we can set things like the line width - equals five - run it again, and we get a thick orange line, which is much easier to see, but it's crossing over the points. So the next thing we can do is just say, alpha - which is the transparency - so equals 0.5, which basically says, make it 50 per cent transparent. Now we get a nice, thick, transparent line running through those points. So once again Matplotlib gives us the ability to customise this line. If we wanted Xs instead of circles, we can change the plot of the points to Xs - and I'll run again, and we get Xs. So one of the characteristics that we often look for when looking for a library is this idea that simple things should be simple, complex things should be possible. In other words, we don't want a very long learning curve, or have to do a lot of work to get a simple graph, but if we do want something special, we want to be able to do it. We don't want a tool that's really simple to use but as soon as we want something a little bit out of the ordinary, suddenly the tool stops working.
  • 5. Page 5 of 20 We'll look at a few more examples of Matplotlib. This one actually comes from the documentation. The easiest thing is to show it. It plots a polar graph. It's using different colours, and we've just - in this case - generated some random numbers and made them the size of the circle and generated a random number for where, around the circle, we've plotted it, and then depending on the angle we're plotting the numbers in a different colour. Once again, the code that's doing the plotting are those two lines. Once we've generated the data, all we need are two lines of code and it gives us a really nice polar plot. Here's another example, once again, also from the documentation. What we're doing with this one is we're going to display it interactively in Jupyter Notebook, but we're also going to save it. This time we're going to save it as a PDF. So let's run this, and there we get four histograms. It's the same data each time, but what it's demonstrating to us is different ways in which we can create histograms, so we can have stacked, we can have unfilled, we can have bars with legends, et cetera. If we have a look at the PDF that we've generated, there we go - and once again, because it's a PDF, it's scalable, and as we scale we don't lose any quality. It just - it's all done with vectors and gives us really nice output. Unfortunately, I seem to have closed my - oh, there we go. Where that's useful is if we're doing any kind of publication, it's really nice to be able to save our output either as SVG or PDF and include that in a publication. This is one more, showing the range of plots we can do. This one is called a hexbin plot. So it's plotting with hexagons, and what we're doing is it’s a cross between a scatter plot, which plots X values and Y values, on a graph, where we've got - like we may have the X values may be how many figs somebody has eaten, and the Y values may be something like the weight, and we want to plot those two against each other, but what this is also doing is plotting it against how frequently those values occur. One of the nice things is it's really easy to do log scales. What we see here is a nice graph showing us that that white
  • 6. Page 6 of 20 area in the centre are values that occurred very often, and as we move out values, occurring less and less often. The following one is we've - actually, what we're going to do is quickly look at a comic. Some people may be familiar with the web comic called XKCD. The person Randall Munroe is a very funny person, but with quite a scientific bent, and also quite strong computer skills. This one is called Stove Ownership, and it shows his health before he realised he could cook bacon whenever he wanted, and afterwards. The thing about this graph is that it's hand-drawn. While sometimes we want graphs that look very polished, very professional, there's often a perception when people see a graph like that that the figures are very accurate, and this isn't always the case. So what people did was to create a style, using Matplotlib, that would recreate the look at feel of the XKCD. This is quite a lot because there's quite a lot. In fact, they've taken two of his comics and I'll quickly plot that, using Matplotlib, and what we see is a compute-generated reproduction of Randall Monroe's graph. That's one of his - this is a histogram, done in very much the same style, which copies another one of his comics. So, taking the style, I will replot my simulation - you'll remember the results from my simulation. So if we run that again, we see we get this thing, and which once again - so now that it doesn’t look slick and professional, we see really this is very much a simulation that these figures aren't accurate, and so on. What this does do for us is it does give us an idea of the flexibility of Matplotlib. I'll quickly restart the kernel before going into our next library. The next library is called Pandas. Pandas is a very useful library for anybody who's working with spreadsheets, who's working with CSV files, who's working with data that's coming from an API across the web, and it also has its own plotting routines built in. In this code here, the first line, which actually goes over three lines, I'm reading a file which is dam storage levels. It's a CSV file. Anybody
  • 7. Page 7 of 20 who watched the last presentation would be familiar that I showed some examples. You'll see the same examples again today. The first line reads the file, the second line plots it, using Pandas plotting, and the third line just adds a legend to it, or sets the label on the file at [send full]. So we're run this code, and there we did get - these are Melbourne dams, and this is showing that the Thomson is about 68 per cent full, and things like Tarago are 95 per cent full. So what we've done is in one line we've run that CSV file, we've told it what we want to call the columns. One of the columns is called name and one of the columns is called Pfull, for percentage full. When we plot it, because Pandas knows about this thing called DS, all we have to say, I want to plot the name against the percentage full and I want a bar chart. I've also said, I want to plot from the value 60 to 100. If I leave out those values, but the same graph, it will plot from zero to 100 by default. On this one - part of the thing was to show that even though we got the same figures in the same graph, it looks different when we start our scale from zero. Once again, the Thomson is about 68 per cent full and Tarago is almost 100 per cent full. The point which is the take-home point here is that to create that plot took two lines of code. The third thing that we showed last time was what's really interesting though, is the gap in volume of these different dams. That gives us a much better picture of what's happening. So when we run that what we see is because the Thomson dam is a really big dam, it's got over 200,000 gigalitres of water deficit. So even though these dams on the right are almost full, altogether they don't even make up that deficit in the dam. So that's Matplotlib and its strength is that it pretty much comes standard with Python. It's flexible, and so on. However, its simplicity often comes at the cost that it's not the best publication-ready graphing tool. You can get very nice publication-ready graphs by
  • 8. Page 8 of 20 doing a bit of work, but what some people have done is to do that work to make it easier for people to create better graphs. One of those libraries is called Seaborn. Seaborn basically sits on top of Matplotlib and simply adds some nice styles. We'll replot that same plot, this time using Seaborn. All that we're doing here is importing it and just saying - and initialising it. So we've just added those two lines. Everything else is exactly as the last example. We run this, and we see a totally - well, similar graph, but different styling. What Seaborn has done is to make it quite easy to change the styling. I'll say set the style to white, run it again, and we'll see we'll get a nice clean graph, and for example, in the next example what we will do is we will set the style, but we want a white grid and a muted palette. We will run that one and we get that white grid with a muted palette. The next one is just one of the - well, one thing that Seaborn does, which a few packages are starting to do, is that it actually includes its own datasets when you install the package, which is really great for when you're learning, because one of the worst things is you pick up a package, you try and learn it, but the first thing you've got to do is find some data to plot and so on. One of the data sets that's Seaborn comes with is this one called Flights and I really enjoy heatmaps, so this is just an example of a really simple heatmap, using Seaborn and some of its inbuilt data - or some of the data that's provided. What we see over here is these going down the bottom are years. Across the y-axis are months. So round about 1960, July, there were lots of flights, and in the earlier years I guess there were fewer flights. Also during winter there are fewer flights than in summer. Once again, done with Seaborn and done really with two, three - two lines of text - two lines of code, which are those lines. Another dataset that comes with Seaborn is called Tips. It's basically how much people will tip at restaurants. So the first thing we'll do is to load the dataset and have a look at the first 10 rows of this dataset. What we see is we've got a few columns. The first one is the amount
  • 9. Page 9 of 20 of the total bill - how much tip - what tip was left, the sex of the person serving, whether or not they were a smoker, what day of the week it was, whether it was lunch or dinner, and the size of the party. We're going to use that dataset and have a look at a few Seaborn graphs. The first one we're going to look at is a box plot. What we've done is we've said we want for you to be whether they're a smoker or not, so the purplish colour means they were smokers, the greenish colour means they weren't. On the left-hand side we've got the size of the total bill and across the bottom is days. So it does seem that on Sundays maybe people tip more, and it would look like on Sundays maybe for some reason, whatever, smokers tip more than non- smokers. Another plot that is often used in similar ways to the box plot but carries a bit more information - encodes a bit more information - is what's known as a violin plot, and these once again are quite easy in Seaborn. In this case what we've done is we've used a different view for male and female. Basically you read this pretty much the same as the box plot. There's the median. There's one - the top quartile, the bottom quartile, et cetera. Some of the information is very much the same. On Sundays people seem to tip the most, and we can see they've been split this time into male and female. Those people who were at the last one will remember I demonstrated something called Anscombe's Quartet. It's four datasets, each with the same means and linear regression lines, but each dataset looks very different. Here's a very simple example of it being done in Seaborn. We'll just have a quick look at that. We see it was quite easy. In this case, we're sharing the y-axis. Across the bottom we're sharing the x- axis of the two plots, and all of this was done in a very, very compact way, using Seaborn. The next thing we're going to look at is plotting data on maps. This goes back to a lot of what I do in my substantive job at the Bureau. The library that we use for a lot of our mapping is - once again, it's a
  • 10. Page 10 of 20 standard with Python. It's called Basemap. The first one, we're not actually plotting anything, we're just simply drawing a map, so what we should see now. It takes a little bit the first time we run it, but we've plotted a map of the world in a few lines of code. That's pretty much from there to there. The story - what we're really interested in at this stage is Australia, and this projection isn't as useful as what we're going to look at now, which is a sort of [MICATA], so we'll just change some of the parameters and this should give us a map of Australia, which is great. It looks a bit like the ones I draw by hand. What I'm going to demonstrate now is some more visualisation, but it goes back to a problem I was given, oh, about a year or two ago. We have about 112 reference stations around Australia. These are stations with very high-quality data that have a long record - about 50 years or longer. These are very important in - as reference stations, to see what's happening with the climate of Australia. One of the outcomes of this - the reference station set is called Acorn, and we do a publication where we publish the names of each station. One of the things we also publish is for each station which are the closest three stations to that station. I wrote some code that worked out what the closest three stations were, to each station. This was the file I was given - once again, I'm using Pandas to read it. So we've got, for example, Halls Creek, we've got the latitude, longitude, the altitude and the date it was opened. As you can see, these all have a very long record. The first thing is I plot these, so using Matplotlib - the first parts we've seen. That draws the map. This line, after having read the file, plots the data on a map, so we'll just quickly plot those stations. The black dots of course are the stations, and there are 112 of them around Australia. The question I was asked is - after saying, okay, here's a list - for each station these are the closest three stations to that particular station.
  • 11. Page 11 of 20 Being scientists, they always ask interesting questions. They said, by going to one of the closest three station from each station, is it possible to get from any station to any other station? Now, it may seem that the obvious answer is yes, but the thing is because - if I'm sitting here, these may be my three closest stations, but that does not mean that where I'm sitting - which is around Meekatharra, that it's going to be one of the closest three stations to this station, because this station's three closest neighbours are maybe these three stations. The first thing I did - because I'm very visual - was to try and visualise it. What I did was to go back to a very old package which is about 30 years old. I first used it probably more than 20 years ago, called Graphviz. Python includes bindings for Graphviz. We can think of this as each station is a node and we've got lines connecting it to the three closest stations. What I've done is to do something that will visualise that. So we'll just run this code and it creates a PDF. What we see in this PDF is that - I've simply used the station numbers to save space - we can see the layout of all the stations and - move across here - one of the things that we see, for example, is that station 7045, even though it's got three stations that are closest to it, there's no station for which 7045 is the closest station. We can see it in a few other places as well. I think over here, we've only got one line going from 85096 to 91293. If anybody wants to guess, this part is in fact the stations that are in Tasmania. If we go back to our graph - our map - we can see how these are all close together - that station is close to that one, but these are all closer to each other than the main one. So basically, that graph helped us visualise, and yes, it turns out that after writing some code, that there is no single path. The next question - once again, these people being scientists - is, where would we have to add stations so that the closest - so that there's always a way that we can get to another station by visiting one of the closest three?
  • 12. Page 12 of 20 I came up with a new visualisation, and it's called a Voronoi plot. I'll run this code. What a Voronoi plot does, is it's not easy to show here, so I'll show it in a web page that I did. On this page you see the Acorn sat stations and you see all these polygons. What these polygons are - every point inside this polygon, for example, is closer to this station than to any other station that's not inside the polygon. So any point inside this polygon - this point, for example, is closer to there than it is to any of the surrounding stations. So basically, it divides the territory up into areas. In a way it's saying, okay, well, the temperature there, we could argue, is mostly influenced by this station, so if we've got a temperature here, and want to check it for accuracy, or whatever, we're more likely to look here, than one of these other stations. What does this have to do with where do we build a site? Well, if we consider this line, any point along this line is the point that's the furthest point between this station and this station, and any point on this line is the point that's the furthest one between that station and that station. Therefore, if we were going to - ah - so any point on one of these edges here, these where these lines meet, is the point that's the furthest from all the adjoining stations. So this point is furthest from that one, that one and that one - and obviously further than any other. So what it comes down to is if we're going to build a new station, we want it on one of these points. On one of these vertices. So it's just another example of how we can use visualisations to solve some real problems. For the moment, that's all we're going to do with maps, and we may return to it soon. The next library we're going to quickly have a look at is called Bokeh. Bokeh is the first library - it works with Python, but its output is targeted generally at web pages. Once again, you'll remember Anscombe's Quartet from a previous slide. We'll do it in Bokeh. It's given us a really nice graph of Anscombe's Quartet. If one sees some
  • 13. Page 13 of 20 of the original drawings of it, for example, in [Tufter's] book, this is very close to the original, so it was very easy to - well, it required some work to make it similar, but we could - it was flexible enough that we could. I'll quickly show another one, which is another famous machine learning data set, which is Irises. This one is plotting. So what we're plotting is the petal width of different species against the petal length. We see that some species are down her, some species - the green ones - are up here, and some over here. The thing about Bokeh is it allows us some interactivity, so we can do things like zoom, you can also pan, and if we put the output on a web page, the web page can have these same tools. There's a wheel zoom. We can go back to what it looked like initially. So that's Bokeh. Here's one that also came from that last one, called Joyplots. The thing about this is we're plotting a whole lot of variables against a common set of axes. I'll just for the moment skip over Plotly, because I want to look at a few tools that are useful in web development - so we're leaving Python for a moment. The first one is one that I wrote a few years ago. This is using Google Maps and I'm putting some data on it. These are the Acorn Sat stations once again. When we click on one, we get a graph of the climatology, the average monthly temperature, so let's go to Melbourne. We're now in April, to the average maximum temperature for Melbourne is normally 21 degrees. This is the average rainfall for Melbourne - around 50 millimetres. We can also get a time series and we can zoom in on the time series. This graph and the time series were done using a tool called Highcharts. Highcharts is available free for non-commercial use, but it does require licence for any kind of commercial use, and government use is also considered to be commercial. Having said that, if you are
  • 14. Page 14 of 20 doing web pages and you are looking for a plotting package, it's worth considering Highcharts. The next example is another mapping library. This one is Leaflet. In this case - this is something I did for work. What we're plotting here is - this data is coming from NetCDF files. Some people will be familiar and have used NetCDF - and the data's coming straight out of these NetCDF files. The main purpose of this slide though is to show this library Leaflet, which basically allows us to put data on top of maps. In this case it's gridded data, but we can also put - here we've got some GeoJSON. We could also be putting shape files and other things. There's things like utility boundaries, which you can overlay on the maps. So it basically allows us to overlay data on top of maps. The third example I'll show is one called OpenLayers, and this was one of the more complex visualisations I did. Basically what this one is demonstrating is east coast lows, off the eastern seaboard of Australia, and all of this was overlaid on this map using - the map was done OpenLayers. I think that's all I'll talk about maps. I think finally what we'll do is look at one more library and one more example. The library we're going to look at is called Vega. Once again, it's another simulation. I came across this thing called Parrondo's paradox. For me it was quite mind-blowing, so I just had to do a visualisation to make sure that I understood it and that it worked. Basically - I'll try and explain it quickly - you've got three games you can play. Each of them involves a coin being spun. In game one the coin is more likely to land on tails. So each time in game one you bet on heads - in other words, it's a losing strategy. So that's game one. In game two, we occasionally choose coin one - oh, sorry - we've got coin two, which most often lands on heads, but we don't choose coin two all the time. We just - sorry, we don't choose heads all the time. Sometimes we choose heads, sometimes we choose tails. Most of the
  • 15. Page 15 of 20 times we choose heads, but two out of three times we choose tails, and it can be shown, once again, that that's a losing strategy. In game three what we do is we randomly decide to either play game one or game two. So if game one, we definitely lose and game two we definitely lose, we would think that choosing game one and game two we should also lose, if we just choose randomly between whether to play game one or game two. In this one I've used this library called Vega, and I think the first thing I'll do is just run - so I play this game 10,000 times. I play game A and plot the results. I play game B 10,000 times, plot the result. Then I do P3, which is where I randomly choose between game one and game two and plot the results. We run the simulation. P1 is when I play game one and we can see I started off with zero dollars - end up with minus $100. When I played game two, which was also a losing strategy, I did actually quite badly. I ended up with minus $250, but when I alternated randomly between the two games, I landed up in the black with plus $150. This site or this Python notebook will be included after the talk. You're welcome to have a look at this and find the mathematical explanation why it works, or you can also just Google Parrondo's Paradox. So, what have we found out? Well, I guess one of the questions is if I want to do visualisations, what's a good tool? In brief, Matplotlib is a good one to start with. Easy things are easy. Flexible things are possible. It can do dozens of different visualisations. It's very good for static plots - in other words, if you're going to publish your results in a book, or whatever, and it also integrates well with Python's maths and science toolkits. If you're familiar with Python, it understands things like NumPy and SciPy, and they're all tightly integrated. Seaborn makes it easier to do, let's say publication-ready plots with Matplotlib.
  • 16. Page 16 of 20 Bokeh has very nice output. It targets web pages. It's got a slightly easier learning curve than Matplotlib, and it looks good out of the box. Plotly, one of the things is it's based on a commercial package and there's both commercial and non-commercial versions of it available. It leverages D3 for graphics - D3 is a fantastic JavaScript graphics library that unfortunately this talk didn’t give us time for - and because of that, the interaction is more extensive than Bokeh, and also the range of things. One thing I didn’t talk about PDVega - or Vega - is that it's got an interesting way of working in that it defines a language for defining a graph and it displays it, but when you create a graph with Vega, that graph includes all the data that was used to create the graph, so if you're interested in making your publications and your data available - so it's one thing to get - see a graph in a paper and say, okay, well, how do I reproduce this graph? It's another thing to say, okay, this is the graph, and this is all the data that created this graph. So it's really worth considering if it's important to you to publish the data with the graphs. Basemap is based on Matplotlib. It sits on top of Matplotlib. It can be a bit clunky, but it does the job. Cartopy is still, I don't think, 100 per cent production-ready, but it improves on Basemap - makes it easier to use and has some great features. Then I'll quickly go through, Leaflet - its advantages were lightweight, it's quick to learn and use, and supports many formats - most particularly WMS and GeoJSON. OpenLayers is more feature full than Leaflet. It used to be a steeper learning curve than leaflet, but modern versions are actually much easier - or they've improved the - they've made the learning curve less steep.
  • 17. Page 17 of 20 I didn’t get the chance to demonstrate Cesium, but it can utilise built-in 3G capabilities of browsers, and it works just out of the box. You can install it and immediately you've got a map up and running. I installed it recently, just to try it out and about an hour later decided to download some earthquake data from the United States geographical survey, and within about 15 minutes I was displaying that data on my map. So it makes it really easy. What are my recommendations? If you work with Python and you're not interested in learning a lot of programming and getting deeply into it, but you do need to work with data and you're doing research, I recommend learn Pandas - use Pandas for plotting with static plots and use Vega for the web. Thanks very much. That… Gerry Ryder: Well, thank you so much Martin, for such and informative and practical presentation and bravely, with so many live demos, which we rarely see. So thank you for that. Now, we do have time for questions, so if people have questions or comments, please put them into the question pod. Now's your chance with Martin online to ask any specific questions about packages or just some of the things that you've seen today. So please do ask away. We have got time for a few questions. Martin, we do have a question from Marlon. What's your opinion on tools like Tableau - or Tableau - T-A-B-L-E-A-U? Two people have asked about that one. Martin Schweitzer: Tableau. So Tableau is what's known as a BI tool, or business intelligence tool. It's used in the Bureau. It's a commercial tool. I think I'm correct in saying that it's only commercial. There may be demo versions available. From everything I understand, it does what it's designed to do extremely well. It's very good at building dashboards. I think it often assumes the idea that there's going to be a data warehouse available
  • 18. Page 18 of 20 - or at least a data mart. I know previous versions where it was used, there were some issues with creating websites that were being presented to the public. This was because it wasn't WCAG compliant - WCAG is the web accessibility guidelines, and for government work websites need to be WCAG compliant. It had some mapping features, but the maps only allowed single layers, which would have made something like what I demonstrated with the rainfall maps very tricky, because we had sometimes up to five or six different layers on those maps. So I guess, I neither want to recommend or dismiss any packages, but I think from everything I understand, and I'm not a regular Tableau user, but it works well for its design purpose and one of the areas where I know people have really enjoyed using it is where they've wanted management type dashboards on their desktop to be able to monitor whatever it was that they were monitoring. Gerry Ryder: Thanks, Martin. John has popped into the question pod that there is a free public version of Tableau, Tableau - if I can get my mouth around that. So if people are interested they could go and check that out for themselves. Colin's asked, Martin, why Python, and not Ruby? He also has asked if MATLAB or R make the grade? Martin Schweitzer: Okay, the reason Python and not Ruby is because I know Python and I don't know Ruby particularly well. When Ruby came out, I started learning it and then other things got in the way. I don't think there's any good reason why not Ruby, but I can't talk with authority on how many - I think one of the things is with data science, Python and R really seem to have taken a lot of that mind share. Between Python and R, I wouldn't - it's six of one and half a dozen of the other. There are a lot of people using R. There are a lot of domains where people really love R. Bioinformatics, I know is one where it's very common. Every - and as I said in the beginning, most of these visualisations
  • 19. Page 19 of 20 and that are available in almost any language that people look at or any popular language. When people come up with a library like Plotly, they - other people will create bindings for different languages. Gerry Ryder: Thanks, Martin. Another question - do you use other mapping tools like ArcMap? That's another question from Marlon. Martin Schweitzer: Well, at the Bureau Esri products are very popular. I personally don't use ArcMap, and probably just because of the nature of the work that I'm doing, and probably because of the current set of tool chain that we've got. I do use an open source product called QGIS occasionally, but even that I don't use often. Most of my work is done in the - well, of this type - is done using things like JavaScript, and so I just use the JavaScript libraries that are available. Gerry Ryder: Okay, and a question from Susan, who's interested in a online tutorial for beginners in data visualisation. So apart from recordings of your own webinars, Martin, are there any - anything that you could recommend to Susan? That might be one to take on notice. Martin Schweitzer: I think it is, and I'll definitely have a look, but there's a lot of MOOCs, so might go to places of things like Udacity or EDX, and lately I've been noticing, particularly with the current flavour of the month being data science, a lot of these places are offering courses, but yeah, certainly I'll have a look at maybe we'll put a - in one of our snippets or something, a beginner's guide to visualisation. Gerry Ryder: Okay. Thank you, Martin. That's probably a nice segue to plug our updated web page. Now, we - Martin's kindly spent some time updating the content of our web page, on the ANDS website. I'm just showing you the link here. So a lot of the tools and the libraries that Martin's spoken about in the webinars are available and described there, so please go and have a look at that. Also, of course, these webinar recordings will be made available. We have one last question, Martin, from Sophie, do you recommend Codeacademy?
  • 20. Page 20 of 20 Martin Schweitzer: I haven't used Codeacademy. I've got an account, I know, because I keep getting emails from them, but I think it's pretty much - there's a lot of good stuff available, so I think it's pretty much try and find something that suits you. Gerry Ryder: So that's great timing for the end of our webinar today. Thank you all for coming along, and a big thanks to Martin for two fantastic webinars and presentations and making all the materials available through the presentations and through our updated web page. We look forward to seeing you at one of our future webinars, and in the meantime, have a great afternoon. Thank you very much. END OF TRANSCRIPT