14. LiteratePorgramming
“A literate programmer is
an essayist who writes
programs for humans to
understand.”
Knuth, Donald E. "Literate programming." CSLI Lecture Notes, Stanford, CA:
Center for the Study of Language and Information (CSLI), 1992 1 (1992).
15. ReproducibleResearch
“[R]esearch papers with
accompanying software tools that
allow the reader to directly
reproduce the results and employ
the methods that are presented in
the research paper.”
Gentleman, Robert and Temple Lang, Duncan, "Statistical Analyses and Reproducible Research"
(May 2004). Bioconductor Project Working Papers. Working Paper 2.
http://biostats.bepress.com/bioconductor/paper2
From storymaps to notebooks - do your computing one bit at a time.
In this presentation I will review various ways in which we can engage with linear narratives for both explanatory and exploratory/investigative
Purposes.
In the first case, storymaps can be used to visualise a linear explanation of the connections and relations between a set of geotemporally distributed events.
In the second case, interactive computational notebooks provide a powerful way of constructing and interacting with digital resources in a process that might be described as having "a conversation with data”.
We are wired to listen to stories. Narratives serialise and contextualise a series of events.
We listen to stories in linear time. We write our stories in linear time. Stories may relate a linear sequence of events or they may relate a series of out sequence events. In the latter case, narrative devices are used to join one event to another so that the serialised telling of the tale is coherent and makes sense.
I want to consider two sorts of serialised narrative.
Story maps are interactive maps that can be constructed or animated such that the serialisation of the telling is sequenced using “locations”. Locations may be places, or more generally, scenes.
The presentation of a geographical story need not force a unique serialised reading of it. We need to learn how to read such texts.
This famous map in visualisation circles by Charles Minard, popularised by Edward Tufte, who described it as “[p]robably the best statistical graphic ever drawn”, tells the story – if you know how to read it - of Napoleon’s 1812-13 Russian campaign. (There’s at least one good reading of it on Youtube: a search for /numberphile greatest ever infographic/ should turn it up. Another great chart storytelling video on Youtube is Kurt Vonnegut’s “Shapes of Stories”, but that’s part of a slightly different story.
The forward motion in the reading of the chart is to read the brown line from left to right as a progression across space – the coordinate system is a geographical one – through time, followed by the black line from right to left, again, across space and over time. The line thicknesses are also meaningful.
Another way of telling stories through maps is to animate a story through a sequence of scenes that take place in different geographical locations.
Timemapper is an open source online application that takes data hosted in a Google spreadsheet and generates a time map from it.
Each scene is comprised of a location, a date, a title and a description (which may include an image).
The calendar shows the events along a timeline, and on a map. Highlighting an event in the timeline also highlights it one the map.
Events on the timeline are actually ranges, rather than point events. (As an aside, geographical representations can in general – though not in Timemapper – take three forms: point locations, paths (points connected by lines, or regions/shapes (that is, areas bounded by a closed line – one that starts where it ends.)
Other variants of this theme include the Simile Timemap. There, the map display shows locations relating to only those events that are visible in the calendar.
Storymap.js - developed under the auspices of the Knight Foundation – provides a similar mechanic to Timemapper, although this time lines connecting locations in the serlalisation of the story are also displayed.
The world of “data journalism”, in which the Knight Foundation is a key mover, is currently one of the driving areas for the development of data driven storytelling devices (where “device” is meant in the most general sense).
Another application – currently under development, but one I think to watch out for – is Odyssey.js, from online mapping providers CartoDB.
The “slides or scroll” mechanic is something worth bearing in mind when looking for a way of stepping between scenes in a serialised narrative.
A few weeks ago, I got a tweet from @fantasticlife – BBC R&D hacker Michael Smethurst – asking if I knew how to generate “narrative charts” as popularised by a particular cartoon on the XKCD web comic. Michael actually provided the answer in his request – in the form of this example from Canada – but hadn’t read the source code properly.
The narrative chart – and there are easily discovered examples on the web (Star Wars, Lord of the Rings, you get the picture) – sequences a from of time along the horizontal x-axis and a nominal scale on the vertical y-axis representing different characters. Vertical bars represent scenes. Scenes take place at a certain time (in the storyworld) and location and incorporate particular characters.
Michael was interested in the way this sort of representation might be able to support continuity checking in the development of a new radio drama (I think?), and it’s something I think could be worth exploring in more detail, not just for the representation of dramatic texts, but also in support of investigations, for example, criminal/police investigations, or investigative journalism.
Issues we might one to dig into further are how to represent different time scales. For example, telling-time, that is, where a scene happens x minutes through episode 1, or Act 3, or ‘story-time’, twenty years into the future in scene 1, flashback 100 years in scene 2, and so on). I’m not much of a narratologist – if you can tell me what the “proper” way of describing these forms of time is, I’d much appreciate it!
“Sentence Drawing” is a beautiful little technique – if you like that sort of thing – (originated by data artist Stefanie Posavec?) for serialising the turns taken by speakers in a dramatic text.
[There’s an implementation in R at http://trinkerrstuff.wordpress.com/2013/12/08/sentence-drawing-function-vs-art/ ]
In this case, the colours represent the family origins of the speakers in Romeo and Juliet. Turns in the line represent new sentences (though I would like to see them represent changes in speaker, with line length relative to line(s) length… As it is, the length of the line indicates number of words in each sentence.
Sentence drawing represents a macroscopic view over a text.
Whereas microscopes allow you to look at the very small, macroscopes allow you to look at the all in a single, glanceable view.
Notebook computing is my great hope for the future. Notebook computing is like spreadsheet computing, a democratisation of access to and the process of practically based, task oriented computing.
Spreadsheets help you get stuff done, even if you don’t consider yourself to be a programmer. My hope is that the notebook metaphor – and it’s actually quite an old one – can similarly encourage people who don’t consider themselves programmers to do and to use programmy things.
Notebook computing buys us in to two ways of thinking that I think are useful from a pedagogical perspective – that is, pedagogy not just as a way of teaching but also as a way of learning in the sense of learning about something through investigating it.
Here, I’m thinking of an investigation as a form of problem based learning – I’m not up enough on educational or learning theory to know whether there is a body of theory, or even just a school of thought, about “investigative learning”.
These two ways of thinking are literate programming and reproducible research.
In case you haven’t already realised it, code is an expressive medium. Code has its poets, and artists, as well as its architects, engineers and technicians. One of the grand masters of code is Don – Donald – Knuth.
Don Knuth said “A literate programmer is an essayist who writes programs for humans to understand” as part of a longer quote. Here’s that longer quote:
“Literate programming is a programming methodology that combines a programming language with a documentation language, making programs more robust, more portable, and more easily maintained than programs written only in a high-level language.
“Computer programmers already know both kind of languages; they need only learn a few conventions about alternating between languages to create programs that are works of literature. A literate programmer is an essayist who writes programs for humans to understand, instead of primarily writing instructions for machines to follow. When programs are written in the recommended style they can be transformed into documents by a document compiler and into efficient code by an algebraic compiler.”
Notebooks are environments that encourage the programming of writing literate code. Notebooks encourage you to write prose and illustrate it with code – and the outputs associated with executing that code.
In many cases, the code may already exist. The programming is then more a case of applying an existing bit of code to a new bit of data.
That is what you do in a spreadsheet, Oftentimes the code is hidden – or automatically generated – by a menu option selected by graphical user interface. But there is no magic going on (at least, no more magic than is associated with the ability to take electronic representations of text and do something to them that makes them responsible for what appears on a screen, keeps planes flying, and seemingly creates and destroys money on the fly in the world’s financial systems).
Code is an incantation – and when you select a menu option in your spreadsheet you are asking the computer to perform that incantation and execute some code. You can also copy and paste code and then run it and it will have the same effect as selecting that operation from a menu. That’s how it works.
In literate programming, you can see a human description of what you want to achieve by executing the code, then the code, then the result of executing the code, then an interpretation of the result. Introduction. Method. Results. Conclusion. You know this four part structure, particularly if you’ve ever taught – or been taught – how to write a formal practical report.
But you can apply it at an atomic level to. At the level of a particular event. Like a particular scene in a narrative chart, or a particular geotemporal location in a time map.
The other idea that the notebooks buy is into is reproducible research. I love this idea and think you should too. It lets archiving make sense.
Do I really have to say any more than just show that quote?
Now you may say that that’s all very well for, I don’t know, physics or biology, or science, or economics. Or social science in general, where they do all sorts of inexplicable things with statistics and probably should try to keep track of what they doing.
But not the humanities.
But that’s not quite right, because in the digital humanities there are computational tools that you can use. Particularly in the areas of text analysis and visualisation. Such as some of the visualisations we saw in the first part of this presentation.
But you need a tool that democratises access to this technology. You need an environment that the social scientists found in the form of a spreadsheet.
But better.
One that helps you keep track of what you did and that produces a serialisation that can be read back in a linear way that makes sense.
Even if you don’t create it in a linear way.
Even if you did that bit before this bit, but the way you tell it is as this bit before that bit.
Which is one reason why postgrads get the fear that their experiment is going wrong. (Don’t panic! Those published papers you read? The work as described never took place the way it was described. The write-up is a post hoc rationalisation of the bits that worked, retold in such a way that it makes it look as if it was planned that way all along.)
And here’s a another dirty secret – most of the published reports you read that write up one experiment of another are not replicable from that report.
(I also like to think of notebooks as a place where I can have a conversation with data.).
So how do notebooks help?
The tool I want to describe is – are – called IPython Notebooks.
IPython Notebooks let you execute code written in the Python programming language in an interactive way. But they also work with other languages – Javascript, Ruby, R, and so on, as well as other applications. I use a notebook for drawing diagrams using Graphviz, for example.
They also include words – of introduction, of analysis, of conclusion, of reflection.
And they also include the things the code wants to tell u, or that the data wants to tell us via the code. The code outputs.
(Or more correctly, the code+data outputs.)
The first thing notebooks let you do is write text for the non-coding reader. Words. In English. (Or Spanish. Or French. I would say Chinese, but I haven’t checked what character sets are supported, so I can’t say that for definite until I check!)
“Literate programming is a programming methodology that combines a programming language with a documentation language”. That’s what Knuth said. But we can take it further. Past code. Past documentation. To write up. To story.
The medium in which we can write our human words is a simple text markup language called markdown.
If you’ve ever written HTML, it’s not that hard.
If you’ve ever written and email and wrapped asterisks around a word or phrase to emphasise it, or written a list of items down by putting each new item onto a new line and preceding it with a dash, it’s that easy.
Here’s a notebook, and here’s some text.
There’s also some code.
But note the text – we have a header, and then some “human text”.
You might also notice some up and down arrows in the notebook toolbar. These allow us to rearrange the order of the cells in the notebook in a straightforward way.
In a sense, we are encouraged to rearrange the sequence of cells into an order that makes more sense as a narrative for the reader of the document, or in the execution of an investigation.
The downside of this is that we can author a document in a ‘non-linear’ way and then linearise it for final distribution simply by reordering the order in which the cells are presented.
There are constraints though – if a cell computationally depends on the result of, or state change resulting from, the execution of a prior cell, their relative ordering cannot be changed.
As well as human readable text cells – markdown cells or header cells at a variety of levels – there are also code cells.
Code cells allow you to write (or copy and paste in) code and then run it.
Applications give you menu options that in the background copy, paste and execute the code you want to run, or apply to some particular set of data, or text.
Code cells work the same way, but they’re naked. They show you the code.
At this point it’s important to remember that code can call code.
Thousands of lines of code that do really clever and difficult things can be called from a single line of code. Often code with a sensible function name just like a sensible menu item label. A self-describing name that calls the masses of really clever code that someone else has written behind the scenes.
But you know which code because you just called it. Explicitly.
Let’s see an example – not a brilliant example, but an example nonetheless.
Here’s some code.
It’s actually two code cells – in one, I define a function. In the second, I call it.
(Already this is revisionist. I developed the function by not wrapping it in a function. It was just a series of lines of code that wrote to perform a particular task.
But it was a useful task. So I wrapped the lines of code in a function, and now I can call those lines of code just by calling the function name.
I can also hide the function in another file, outside of the notebook, then just include it in any notebook I want to…
…or within a notebook, I could just copy a set of lines of code and repeatedly paste them into the notebook, applying them to a different set of data each time… but that just gets messy, and that’s what being able to call a bunch of lines of coped wrapped up in a function call avoids.
As far as reproducible research goes, the ability of a notebook to execute a code element and display the output from executing that code means that there is a one-to-one binding between a code fragment and the data on which it operates and the output obtained from executing just that code on just that data.
The output of the code is not a human copied and pasted artefact.
The output of the code – in this case, the result of executing a particular function – is only and exactly the output from executing that function on a specified dataset.
The output of a code cell is not limited to the arcane outputs of a computational function.
We can display data table results as data tables.
We can also generate rich HTML outputs – in this case an interactive map overlaid with markers corresponding to locations specified in a dataset, and with lines connecting markers as defined by connections described in the original dataset.
We can also delete the outputs of all the code cells, and then rerun the code, one step – one cell – after the other. Reproducing results becomes simply a matter of rerunning the code in the notebook against the data loaded in by the notebook – and then comparing the code cell outputs to the code cell outputs of the original document.
Tools are also under development that help spot differences between those outputs, at least in cases where the outputs are text based.
To summarise, technologies such as story maps and computational notebooks encourage you to create a story – or analysis – one frame at a time, one cell at a time.
But that is not to say that the result of that construction need necessarily be presented in the same linear order.
Story maps powered by data construct timelines based on timestamps, and may generate connecting lines between locations based on data that either explicitly maps from one location to another (from and to column cells in the same row of a dataset) or that implies a step from location to another (such as moving from a location in one row to the location specified in the next row).
As with all networks constructed from a set of independently stated connections, sometimes the gross level structure and patterns only become evident when you look at everything all at the same time.
As well as constructing stories one step at a time, can they also be read one step at a time.
And if so, how is that sequencing managed? Is the reader lead down a single path?
Are there decision points whey they can change the direction of the story?
Is it obvious even where the starting point of the story reading is, and when the end has been reached?
If your notebook – or story – was constructed in a conversation-like way, does it read back well as one?
To learn more about working with data, as well as finding and telling stories in data, visit the School of Data website at SchoolOfData.org
The website includes a regularly updated blog featuring news, events and stories from the world of data, as well as a growing body of openly licensed free courses and tutorials on working with data.
The School of Data also runs an active fellowship programme for practitioners who regularly work with open data. Visit SchoolOfData.org to learn more.