Archives, algorithms and people

•Download as PPT, PDF•

0 likes•653 views

How we put the BBC World Service radio archive online using machines and crowdsourcing. A talk given to the UK Museums on the Web conference, November 2013. One of the major challenges of a big digitisation project is you simply swap out an under-used physical archive for its digital equivalent. Without easy ways to navigate the data there's no way for your users to get to the bits they want. We recently worked with the BBC World Service to generate metadata for their radio archive, 50,000 programmes from over 45 years. First using algorithms to generate "good enough" topics to put the archive online and then using crowd-sourcing to improve the data. Throughout 2013 we have been running this experiment to crowdsource improvements to the metadata that we automatically created. At http://worldservice.prototyping.bbc.co.uk people can search and browse for programmes, listen to them, correct and add new topics. This talk describes how we went about this and what we've learnt with this massive online multimedia archive - about understanding audio, automatically generating topics and crowdsourcing improvements to the data.

Technology Education

Archives, algorithms
and people
or
How we put the BBC World Service radio archive online
using machines and crowdsourcing
Tristan Ferne / @tristanf
Executive Producer
BBC Research & Development

The missing metadata
Missing data

Spelling
mistake

Sometimes incorrect data
No semantic data

How much data?
70000
programmes

36000

1m
machine tags

21%

3000
users

listenable
programmes

71000
edits

of programmes tagged

36%

of programmes listened to

70000
tag edits

1000
synopsis edits

How good is the data?
Tags are a large and sparse space
When is a tag correct?
When is a programme tagged completely?
How do you measure crowd-sourced data?

Who does the work?
10% of people = 98% of edits

10 people = 70% of edits

1 person = 30% of edits

The Last Danish
Christmas Broadcast

“Entirely in Danish”

What we’ve learnt
We can significantly improve the data
It’s cost-effective with re-usable technology
A crowdsourcing approach

Open questions
How good are the machine tags?
How much crowdsourcing do you need?
When is your data good enough?

worldservice.prototyping.bbc.co.uk
www.bbc.co.uk/rd
github.com/bbrd
tristan.ferne@bbc.co.uk
@tristanf

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Scaling API-first – The story of a global engineering organizationRadu Cotescu

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Histor y of HAM Radio presentation slidevu2urc

A Call to Action for Generative AI in 2024Results

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

🐬 The future of MySQL is Postgres 🐘RTylerCroy

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Slack Application Development 101 Slidespraypatel2

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Scaling API-first – The story of a global engineering organization

The 7 Things I Know About Cyber Security After 25 Years | April 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Tata AIG General Insurance Company - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Histor y of HAM Radio presentation slide

A Call to Action for Generative AI in 2024

Partners Life - Insurer Innovation Award 2024

Injustice - Developers Among Us (SciFiDevCon 2024)

Exploring the Future Potential of AI-Enabled Smartphone Processors

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

🐬 The future of MySQL is Postgres 🐘

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Data Cloud, More than a CDP by Matt Robison

Slack Application Development 101 Slides

08448380779 Call Girls In Civil Lines Women Seeking Men

A Domino Admins Adventures (Engage 2024)

Featured

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Featured (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Archives, algorithms and people

1. Archives, algorithms and people or How we put the BBC World Service radio archive online using machines and crowdsourcing Tristan Ferne / @tristanf Executive Producer BBC Research & Development

2. The BBC World Service archive

3. 1947-2012

4. The missing metadata Missing data Spelling mistake Sometimes incorrect data No semantic data

5. How it works

6. Listening machines

7. Noisy transcripts

8. Algorithms

9. Algorithms and people

10. The prototype

11. worldservice.prototyping.bbc.co.uk

12.

13. Show Synopsis editing version

14.

15.

16. worldservice.prototyping.bbc.co.uk

17. Machine learning

18. Results

19. How much data? 70000 programmes 36000 1m machine tags 21% 3000 users listenable programmes 71000 edits of programmes tagged 36% of programmes listened to 70000 tag edits 1000 synopsis edits

20. And four lost programmes

21. How good is the data? Tags are a large and sparse space When is a tag correct? When is a programme tagged completely? How do you measure crowd-sourced data?

22. Who does the work? 10% of people = 98% of edits 10 people = 70% of edits 1 person = 30% of edits

23. The shape of the archive

24.

25.

26. Places mentioned

27. Linking from the News

28. The Last Danish Christmas Broadcast “Entirely in Danish”

29. What we’ve learnt We can significantly improve the data It’s cost-effective with re-usable technology A crowdsourcing approach

30. Open questions How good are the machine tags? How much crowdsourcing do you need? When is your data good enough?

31. worldservice.prototyping.bbc.co.uk www.bbc.co.uk/rd github.com/bbrd tristan.ferne@bbc.co.uk @tristanf

Editor's Notes

I'm Tristan, from BBC Research & DevelopmentI’m not sure I should be here, I’m not from a museum We do R&D for the BBC and media industry, amongst other things includes some work with the BBC archive I’m going to talk about an archive, algorithms and peopleOur team had a challenge - a big radio archive, sparsely described, to put onlineWe’re an R&D department, we like challengeswe even had some solutions looking for problems!The BBC typically puts archives online by editorially curating them, so only a subset is exposed. We don't dump huge collections online.But we thought we wouldOur aim was to put *all of it* online as efficiently as possibleSo we built a prototype and over the past year we've run an experiment
We had an opportunity to work with the archive of the World Service English language radio service They'd digitised their archive as they were having to move out of their historic home at Bush House, into the new Broadcasting House in central London The archive contained about 70k radio programmes from over 45 years. Not everything, there's no live news bulletins, they weren't recorded, and just English language service
The graph shows how the archive is distributed in time The spike starts in the 90s where we started to use digital technologies to record things. And stopped recordings over old tapes
The digitisation process created very high quality digital audio of all the programmesBut artefacts of that process (and indeed earlier archiving) meant that the metadata describing the programmes was sparse - often missing fields or having incorrect data. And if there's no data describing the programmes then no-one will be able to find themIt’s a danger with a big digitisation project - that you simply swap out an under-used physical archive for its digital equivalent. Without easy ways to navigate the archive there's no way for your users to get to the bits they want. And to navigate the archive you need data
So that was our challenge We wanted to demonstrate how to create the data needed to put a massive media archive online using algorithms, linked data and crowdsourcing. And this is how we did it
We needed to generate data primarily from what we did have - the digital audio from the radio programmes We used CMU Sphinx, an open-source speech recognition toolkit, to listen to every programme and convert it to text
Speech recognition can be very good, particularly when trained on a single speakerBut on these radio programmes, with varying recording qualities, many speakers and accents from around the world, it really struggled for accuracy, and we ended up with lots of pretty noisy transcriptsBut we didn't need accurate transcripts, just some good metadata
Our team developed algorithms that could reliably extract tags or keywords from these noisy transcripts We use Linked Data to provide unique tags (e.g. to disambiguate Paris, France from Paris Texas), to help this topic extraction, to relate tags to one another, and ultimately to link to elsewhere on the web We actually use dbpedia, a data version of wikipedia, as our reference. So every tag in our archive is linked to a wikipedia page For every programme, even if there was no metadata to start with, we generated 10-20 tags We had a lot of data to process, it was about 26k hours of audioDoing this audio transcription and topic extraction would have taken 36k hours on a single computer But using the cloud we could do this all in parallel and we processed it all in 2 weeks at a cost of around $3k
The automatically created tags weren't always correct and we couldn't go through them all to check Our hypothesis was that they were good enough to bootstrap an online archive for people to use and listen And then we could ask those people to help correct and add to these tags - to crowdsource the problem
VIDEO DEMO This is the prototype we built featuring the archiveYou can search and browse for programmes, listen to them, correct the topics and tag them with new topics The homepage shows some featured programmes, or you can search – Iceland Filter by decade Listen to the programme (extract plays) Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory These are the automatic topic tags you can vote them up if they're correct, or down if they're wrong And you can add a new tag, corresponding to a wikipedia page It's registration only, but easy to sign up at the URL shown
Homepage
(Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory) A note about images: we didn’t have any to start with. But people expect images on the web, so we use the tags to find images from Ookaboo, a repository of CC images, and users can choose alternate ones
Users can also directly edit the programme title and synopsis to correct spelling mistakes. It uses a Wiki-model to track changes and small admin interface for us to "clear" them
Here you can see the list of tags for a programme, some automatically generated, some with user votes, and some added by users Rather than the wiki-model we use a voting model here, more like reddit As mentioned, everything has to be wikipedia concept
One last bit of the prototype, this is very cool, we are also doing speaker identification and segmentation automatically on some programmes We can recognise distinct voices within, and across, programmes. The only thing we can't do is identify who it is that is speaking This shows a magazine programme (From Our Own Correspondent) that the algorithm has divided into the different contributors, users can then label those voices. And these names then propagate across the archive to wherever else that voice was heard.
To recap Started with a media archive with little metadata Process it with machines in the cloud Use that data to create an online experience Get people to use it and improve the data ...and we want to feedback these improvements to the machines to help them learn For example, we can look for tags that are often voted down and then look for patterns in them.
It's been running for about a year, fairly low key until recently. It's an experiment, not part of the BBC's "main" site so we don't get massive traffic
(70,000 programmes in the archive36k listenable programmes, 34k unlistenable (either because of rights issues or because the actual audio is missing but a record was created for a programme) 1 million machine generated tags Currently 3000 registered users 71,000 edits (some kind of action from a user - either votes on tags, speaker ID, synopsis edits, image votes) 70,000 tag edits (57k tag votes, 13k new tags) 1000 synopsis edits 21% of listenable programmes have an edit, at 9 edits/prog. 36% of listenable progs have been listened to at least once (30k total listens)
listeners even sent in 4 "lost" programmes that they had recorded off-air
Machine-generated tag quality looks OK Human-edited tags are good, I'd almost never disagree with them But this is hard to answer objectively. When is a tag correct? In who's opinion? Even harder - When does a programme have a “complete” set of tags? It's a large and sparse space of data Currently doing some analysis, doesn’t seem to be much prior work on the quality of crowdsourced tags, please shout if you know of any Also more work to do to analyse what kind of tags are added Also interested in whether people listen to the programme before tagging? A bit different to looking at a painting or photograph. Surprising amount of synopsis editing - spelling, adding comments, adding presenters, one person particularly likes adding episode numbers!
As it turns out, only a few! 1 person (king of radio drama community) has astonishingly done 30% of the edits 10 people have done 70% of the edits Other crowdsourcing studies have shown that typically 10% of users do the majority of the work 10% of ourusers have done 98% of the work The internet has a 1% rule - 1% create, 9% modify, 90% just view About half of our users have done at least one edit But that doesn't really tell you much they’ve done, or how long do they stick around? “Active users” - term borrowed from wikipedia - someone who has done some edit action in the last 30 days Active users currently around 2% We've noticed particularly groups of people using the prototype.Started with a large community of radio drama enthusiasts who were cataloguing all the drama and playsAnd more recently Frank Zappa fans found some interviews So do you only need 100 people to do this? Without the archive of programmes and the prototype drawing people in we wouldn't have found the "right" 100 who care enough to help
Some pretty pictures we drew using the data, giving an idea of some of the archive contents and activity
User tags, clustered by the programme they're attached to
Users, clustered by what they listened to - "The Lobster"
As we’ve got Wikipedia-mapped tags we can look for programmes about places...
Links from current news events being talked about on the BBC News channelback to programmes in the archive
My favourite of the things we foundFrom 1957 - The last broadcast from the BBC Danish Service “Entirely in Danish”
Quality of original metadata was mixed We've significantly improved it with algorithms and crowdsourcing, adding semantic topics to the programmes Couldn’t have created a decent online archive otherwise, we just didn’t have the data Also efficient the initial research & development cost was less than our estimated cost of professionals tagging everything And this tech cost is a one-off, and re-usable, obviously becomes cheaper the more times we use the tech, it can be used for any media with people speaking Crowdsourcinga bit stuck in middle of different crowdsourcing approaches If you know Galaxy Zoo and its projects - these are generally designed to be task focused with particular targets. This wasn't designed like that It was more of a browsable archive with crowdsourcing features (maybe closer to wikipedia) We don't know what's right, but we've managed to create quite a lot of data, would be interesting to compare approaches And we've tried both wikipedia "last edit wins" approach and reddit voting approach
Some things we still don't know:How good are the tags? Like I said, it's difficult to measure objectively How much volunteer effort do you need? It depends. How big is the archive? How much data do you need per item? How good does the data need to be? Ultimately, when is your data good enough?
Register for the prototype Read more on our website and blog A number of components of the system have been open-sourced on github In this project we did a lot of work to manage the processing of the audio, we found this so useful that we're turning it into a generic platform, called Comma, for anyone to analyse media, and for any computer scientists to run their analysis algorithms.

Archives, algorithms and people

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Archives, algorithms and people

Editor's Notes