How we put the BBC World Service radio archive online using machines and crowdsourcing. A talk given to the UK Museums on the Web conference, November 2013.
One of the major challenges of a big digitisation project is you simply swap out an under-used physical archive for its digital equivalent. Without easy ways to navigate the data there's no way for your users to get to the bits they want. We recently worked with the BBC World Service to generate metadata for their radio archive, 50,000 programmes from over 45 years. First using algorithms to generate "good enough" topics to put the archive online and then using crowd-sourcing to improve the data.
Throughout 2013 we have been running this experiment to crowdsource improvements to the metadata that we automatically created. At http://worldservice.prototyping.bbc.co.uk people can search and browse for programmes, listen to them, correct and add new topics.
This talk describes how we went about this and what we've learnt with this massive online multimedia archive - about understanding audio, automatically generating topics and crowdsourcing improvements to the data.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Archives, algorithms and people
1. Archives, algorithms
and people
or
How we put the BBC World Service radio archive online
using machines and crowdsourcing
Tristan Ferne / @tristanf
Executive Producer
BBC Research & Development
21. How good is the data?
Tags are a large and sparse space
When is a tag correct?
When is a programme tagged completely?
How do you measure crowd-sourced data?
22. Who does the work?
10% of people = 98% of edits
10 people = 70% of edits
1 person = 30% of edits
I'm Tristan, from BBC Research & DevelopmentI’m not sure I should be here, I’m not from a museum
We do R&D for the BBC and media industry, amongst other things includes some work with the BBC archive
I’m going to talk about an archive, algorithms and peopleOur team had a challenge - a big radio archive, sparsely described, to put onlineWe’re an R&D department, we like challengeswe even had some solutions looking for problems!The BBC typically puts archives online by editorially curating them, so only a subset is exposed. We don't dump huge collections online.But we thought we wouldOur aim was to put *all of it* online as efficiently as possibleSo we built a prototype and over the past year we've run an experiment
We had an opportunity to work with the archive of the World Service English language radio service
They'd digitised their archive as they were having to move out of their historic home at Bush House, into the new Broadcasting House in central London
The archive contained about 70k radio programmes from over 45 years. Not everything, there's no live news bulletins, they weren't recorded, and just English language service
The graph shows how the archive is distributed in time The spike starts in the 90s where we started to use digital technologies to record things. And stopped recordings over old tapes
The digitisation process created very high quality digital audio of all the programmesBut artefacts of that process (and indeed earlier archiving) meant that the metadata describing the programmes was sparse - often missing fields or having incorrect data.
And if there's no data describing the programmes then no-one will be able to find themIt’s a danger with a big digitisation project - that you simply swap out an under-used physical archive for its digital equivalent.
Without easy ways to navigate the archive there's no way for your users to get to the bits they want. And to navigate the archive you need data
So that was our challenge
We wanted to demonstrate how to create the data needed to put a massive media archive online using algorithms, linked data and crowdsourcing. And this is how we did it
We needed to generate data primarily from what we did have - the digital audio from the radio programmes
We used CMU Sphinx, an open-source speech recognition toolkit, to listen to every programme and convert it to text
Speech recognition can be very good, particularly when trained on a single speakerBut on these radio programmes, with varying recording qualities, many speakers and accents from around the world, it really struggled for accuracy, and we ended up with lots of pretty noisy transcriptsBut we didn't need accurate transcripts, just some good metadata
Our team developed algorithms that could reliably extract tags or keywords from these noisy transcripts
We use Linked Data to provide unique tags (e.g. to disambiguate Paris, France from Paris Texas), to help this topic extraction, to relate tags to one another, and ultimately to link to elsewhere on the web
We actually use dbpedia, a data version of wikipedia, as our reference. So every tag in our archive is linked to a wikipedia page
For every programme, even if there was no metadata to start with, we generated 10-20 tags
We had a lot of data to process, it was about 26k hours of audioDoing this audio transcription and topic extraction would have taken 36k hours on a single computer But using the cloud we could do this all in parallel and we processed it all in 2 weeks at a cost of around $3k
The automatically created tags weren't always correct and we couldn't go through them all to check
Our hypothesis was that they were good enough to bootstrap an online archive for people to use and listen
And then we could ask those people to help correct and add to these tags - to crowdsource the problem
VIDEO DEMO
This is the prototype we built featuring the archiveYou can search and browse for programmes, listen to them, correct the topics and tag them with new topics
The homepage shows some featured programmes, or you can search – Iceland
Filter by decade
Listen to the programme (extract plays)
Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory
These are the automatic topic tags
you can vote them up if they're correct, or down if they're wrong
And you can add a new tag, corresponding to a wikipedia page
It's registration only, but easy to sign up at the URL shown
Homepage
(Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory)
A note about images: we didn’t have any to start with. But people expect images on the web, so we use the tags to find images from Ookaboo, a repository of CC images, and users can choose alternate ones
Users can also directly edit the programme title and synopsis to correct spelling mistakes. It uses a Wiki-model to track changes and small admin interface for us to "clear" them
Here you can see the list of tags for a programme, some automatically generated, some with user votes, and some added by users
Rather than the wiki-model we use a voting model here, more like reddit
As mentioned, everything has to be wikipedia concept
One last bit of the prototype, this is very cool, we are also doing speaker identification and segmentation automatically on some programmes
We can recognise distinct voices within, and across, programmes. The only thing we can't do is identify who it is that is speaking
This shows a magazine programme (From Our Own Correspondent) that the algorithm has divided into the different contributors, users can then label those voices.
And these names then propagate across the archive to wherever else that voice was heard.
To recap
Started with a media archive with little metadata
Process it with machines in the cloud
Use that data to create an online experience
Get people to use it and improve the data
...and we want to feedback these improvements to the machines to help them learn
For example, we can look for tags that are often voted down and then look for patterns in them.
It's been running for about a year, fairly low key until recently. It's an experiment, not part of the BBC's "main" site so we don't get massive traffic
(70,000 programmes in the archive36k listenable programmes, 34k unlistenable (either because of rights issues or because the actual audio is missing but a record was created for a programme)
1 million machine generated tags
Currently 3000 registered users
71,000 edits (some kind of action from a user - either votes on tags, speaker ID, synopsis edits, image votes)
70,000 tag edits (57k tag votes, 13k new tags)
1000 synopsis edits
21% of listenable programmes have an edit, at 9 edits/prog.
36% of listenable progs have been listened to at least once (30k total listens)
listeners even sent in 4 "lost" programmes that they had recorded off-air
Machine-generated tag quality looks OK
Human-edited tags are good, I'd almost never disagree with them
But this is hard to answer objectively. When is a tag correct? In who's opinion? Even harder - When does a programme have a “complete” set of tags?
It's a large and sparse space of data
Currently doing some analysis, doesn’t seem to be much prior work on the quality of crowdsourced tags, please shout if you know of any
Also more work to do to analyse what kind of tags are added
Also interested in whether people listen to the programme before tagging? A bit different to looking at a painting or photograph.
Surprising amount of synopsis editing - spelling, adding comments, adding presenters, one person particularly likes adding episode numbers!
As it turns out, only a few!
1 person (king of radio drama community) has astonishingly done 30% of the edits
10 people have done 70% of the edits
Other crowdsourcing studies have shown that typically 10% of users do the majority of the work
10% of ourusers have done 98% of the work
The internet has a 1% rule - 1% create, 9% modify, 90% just view
About half of our users have done at least one edit
But that doesn't really tell you much they’ve done, or how long do they stick around?
“Active users” - term borrowed from wikipedia - someone who has done some edit action in the last 30 days
Active users currently around 2%
We've noticed particularly groups of people using the prototype.Started with a large community of radio drama enthusiasts who were cataloguing all the drama and playsAnd more recently Frank Zappa fans found some interviews
So do you only need 100 people to do this? Without the archive of programmes and the prototype drawing people in we wouldn't have found the "right" 100 who care enough to help
Some pretty pictures we drew using the data, giving an idea of some of the archive contents and activity
User tags, clustered by the programme they're attached to
Users, clustered by what they listened to - "The Lobster"
As we’ve got Wikipedia-mapped tags we can look for programmes about places...
Links from current news events being talked about on the BBC News channelback to programmes in the archive
My favourite of the things we foundFrom 1957 - The last broadcast from the BBC Danish Service
“Entirely in Danish”
Quality of original metadata was mixed
We've significantly improved it with algorithms and crowdsourcing, adding semantic topics to the programmes
Couldn’t have created a decent online archive otherwise, we just didn’t have the data
Also efficient the initial research & development cost was less than our estimated cost of professionals tagging everything
And this tech cost is a one-off, and re-usable, obviously becomes cheaper the more times we use the tech, it can be used for any media with people speaking
Crowdsourcinga bit stuck in middle of different crowdsourcing approaches
If you know Galaxy Zoo and its projects - these are generally designed to be task focused with particular targets. This wasn't designed like that
It was more of a browsable archive with crowdsourcing features (maybe closer to wikipedia)
We don't know what's right, but we've managed to create quite a lot of data, would be interesting to compare approaches
And we've tried both wikipedia "last edit wins" approach and reddit voting approach
Some things we still don't know:How good are the tags? Like I said, it's difficult to measure objectively
How much volunteer effort do you need? It depends. How big is the archive? How much data do you need per item? How good does the data need to be?
Ultimately, when is your data good enough?
Register for the prototype
Read more on our website and blog
A number of components of the system have been open-sourced on github
In this project we did a lot of work to manage the processing of the audio, we found this so useful that we're turning it into a generic platform, called Comma, for anyone to analyse media, and for any computer scientists to run their analysis algorithms.