SlideShare a Scribd company logo
1 of 31
Archives, algorithms
and people
or
How we put the BBC World Service radio archive online
using machines and crowdsourcing
Tristan Ferne / @tristanf
Executive Producer
BBC Research & Development
The BBC World Service
archive
1947-2012
The missing metadata
Missing data

Spelling
mistake

Sometimes incorrect data
No semantic data
How it works
Listening machines
Noisy transcripts
Algorithms
Algorithms and people
The prototype
worldservice.prototyping.bbc.co.uk
Show Synopsis editing
version
worldservice.prototyping.bbc.co.uk
Machine learning
Results
How much data?
70000
programmes

36000

1m
machine tags

21%

3000
users

listenable
programmes

71000
edits

of programmes tagged

36%

of programmes listened to

70000
tag edits

1000
synopsis edits
And four lost
programmes
How good is the data?
Tags are a large and sparse space
When is a tag correct?
When is a programme tagged completely?
How do you measure crowd-sourced data?
Who does the work?
10% of people = 98% of edits

10 people = 70% of edits

1 person = 30% of edits
The shape of the
archive
Places mentioned
Linking from the News
The Last Danish
Christmas Broadcast

“Entirely in Danish”
What we’ve learnt
We can significantly improve the data
It’s cost-effective with re-usable technology
A crowdsourcing approach
Open questions
How good are the machine tags?
How much crowdsourcing do you need?
When is your data good enough?
worldservice.prototyping.bbc.co.uk
www.bbc.co.uk/rd
github.com/bbrd
tristan.ferne@bbc.co.uk
@tristanf

More Related Content

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Archives, algorithms and people

Editor's Notes

  1. I'm Tristan, from BBC Research & DevelopmentI’m not sure I should be here, I’m not from a museum We do R&D for the BBC and media industry, amongst other things includes some work with the BBC archive I’m going to talk about an archive, algorithms and peopleOur team had a challenge - a big radio archive, sparsely described, to put onlineWe’re an R&D department, we like challengeswe even had some solutions looking for problems!The BBC typically puts archives online by editorially curating them, so only a subset is exposed. We don't dump huge collections online.But we thought we wouldOur aim was to put *all of it* online as efficiently as possibleSo we built a prototype and over the past year we've run an experiment
  2. We had an opportunity to work with the archive of the World Service English language radio service They'd digitised their archive as they were having to move out of their historic home at Bush House, into the new Broadcasting House in central London The archive contained about 70k radio programmes from over 45 years. Not everything, there's no live news bulletins, they weren't recorded, and just English language service
  3. The graph shows how the archive is distributed in time The spike starts in the 90s where we started to use digital technologies to record things. And stopped recordings over old tapes
  4. The digitisation process created very high quality digital audio of all the programmesBut artefacts of that process (and indeed earlier archiving) meant that the metadata describing the programmes was sparse - often missing fields or having incorrect data. And if there's no data describing the programmes then no-one will be able to find themIt’s a danger with a big digitisation project - that you simply swap out an under-used physical archive for its digital equivalent. Without easy ways to navigate the archive there's no way for your users to get to the bits they want. And to navigate the archive you need data
  5. So that was our challenge We wanted to demonstrate how to create the data needed to put a massive media archive online using algorithms, linked data and crowdsourcing. And this is how we did it
  6. We needed to generate data primarily from what we did have - the digital audio from the radio programmes We used CMU Sphinx, an open-source speech recognition toolkit, to listen to every programme and convert it to text
  7. Speech recognition can be very good, particularly when trained on a single speakerBut on these radio programmes, with varying recording qualities, many speakers and accents from around the world, it really struggled for accuracy, and we ended up with lots of pretty noisy transcriptsBut we didn't need accurate transcripts, just some good metadata
  8. Our team developed algorithms that could reliably extract tags or keywords from these noisy transcripts We use Linked Data to provide unique tags (e.g. to disambiguate Paris, France from Paris Texas), to help this topic extraction, to relate tags to one another, and ultimately to link to elsewhere on the web We actually use dbpedia, a data version of wikipedia, as our reference. So every tag in our archive is linked to a wikipedia page For every programme, even if there was no metadata to start with, we generated 10-20 tags We had a lot of data to process, it was about 26k hours of audioDoing this audio transcription and topic extraction would have taken 36k hours on a single computer But using the cloud we could do this all in parallel and we processed it all in 2 weeks at a cost of around $3k
  9. The automatically created tags weren't always correct and we couldn't go through them all to check Our hypothesis was that they were good enough to bootstrap an online archive for people to use and listen And then we could ask those people to help correct and add to these tags - to crowdsource the problem
  10. VIDEO DEMO This is the prototype we built featuring the archiveYou can search and browse for programmes, listen to them, correct the topics and tag them with new topics The homepage shows some featured programmes, or you can search – Iceland Filter by decade Listen to the programme (extract plays) Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory These are the automatic topic tags you can vote them up if they're correct, or down if they're wrong And you can add a new tag, corresponding to a wikipedia page It's registration only, but easy to sign up at the URL shown
  11. Homepage
  12. (Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory) A note about images: we didn’t have any to start with. But people expect images on the web, so we use the tags to find images from Ookaboo, a repository of CC images, and users can choose alternate ones
  13. Users can also directly edit the programme title and synopsis to correct spelling mistakes. It uses a Wiki-model to track changes and small admin interface for us to "clear" them
  14. Here you can see the list of tags for a programme, some automatically generated, some with user votes, and some added by users Rather than the wiki-model we use a voting model here, more like reddit As mentioned, everything has to be wikipedia concept
  15. One last bit of the prototype, this is very cool, we are also doing speaker identification and segmentation automatically on some programmes We can recognise distinct voices within, and across, programmes. The only thing we can't do is identify who it is that is speaking This shows a magazine programme (From Our Own Correspondent) that the algorithm has divided into the different contributors, users can then label those voices. And these names then propagate across the archive to wherever else that voice was heard.
  16. To recap Started with a media archive with little metadata Process it with machines in the cloud Use that data to create an online experience Get people to use it and improve the data ...and we want to feedback these improvements to the machines to help them learn For example, we can look for tags that are often voted down and then look for patterns in them.
  17. It's been running for about a year, fairly low key until recently. It's an experiment, not part of the BBC's "main" site so we don't get massive traffic
  18. (70,000 programmes in the archive36k listenable programmes, 34k unlistenable (either because of rights issues or because the actual audio is missing but a record was created for a programme) 1 million machine generated tags Currently 3000 registered users 71,000 edits (some kind of action from a user - either votes on tags, speaker ID, synopsis edits, image votes) 70,000 tag edits (57k tag votes, 13k new tags) 1000 synopsis edits 21% of listenable programmes have an edit, at 9 edits/prog. 36% of listenable progs have been listened to at least once (30k total listens)
  19. listeners even sent in 4 "lost" programmes that they had recorded off-air
  20. Machine-generated tag quality looks OK Human-edited tags are good, I'd almost never disagree with them But this is hard to answer objectively. When is a tag correct? In who's opinion? Even harder - When does a programme have a “complete” set of tags? It's a large and sparse space of data Currently doing some analysis, doesn’t seem to be much prior work on the quality of crowdsourced tags, please shout if you know of any Also more work to do to analyse what kind of tags are added Also interested in whether people listen to the programme before tagging? A bit different to looking at a painting or photograph. Surprising amount of synopsis editing - spelling, adding comments, adding presenters, one person particularly likes adding episode numbers!
  21. As it turns out, only a few! 1 person (king of radio drama community) has astonishingly done 30% of the edits 10 people have done 70% of the edits Other crowdsourcing studies have shown that typically 10% of users do the majority of the work 10% of ourusers have done 98% of the work The internet has a 1% rule - 1% create, 9% modify, 90% just view About half of our users have done at least one edit But that doesn't really tell you much they’ve done, or how long do they stick around? “Active users” - term borrowed from wikipedia - someone who has done some edit action in the last 30 days Active users currently around 2% We've noticed particularly groups of people using the prototype.Started with a large community of radio drama enthusiasts who were cataloguing all the drama and playsAnd more recently Frank Zappa fans found some interviews So do you only need 100 people to do this? Without the archive of programmes and the prototype drawing people in we wouldn't have found the "right" 100 who care enough to help
  22. Some pretty pictures we drew using the data, giving an idea of some of the archive contents and activity
  23. User tags, clustered by the programme they're attached to
  24. Users, clustered by what they listened to - "The Lobster"
  25. As we’ve got Wikipedia-mapped tags we can look for programmes about places...
  26. Links from current news events being talked about on the BBC News channelback to programmes in the archive
  27. My favourite of the things we foundFrom 1957 - The last broadcast from the BBC Danish Service “Entirely in Danish”
  28. Quality of original metadata was mixed We've significantly improved it with algorithms and crowdsourcing, adding semantic topics to the programmes Couldn’t have created a decent online archive otherwise, we just didn’t have the data Also efficient the initial research & development cost was less than our estimated cost of professionals tagging everything And this tech cost is a one-off, and re-usable, obviously becomes cheaper the more times we use the tech, it can be used for any media with people speaking Crowdsourcinga bit stuck in middle of different crowdsourcing approaches If you know Galaxy Zoo and its projects - these are generally designed to be task focused with particular targets. This wasn't designed like that It was more of a browsable archive with crowdsourcing features (maybe closer to wikipedia) We don't know what's right, but we've managed to create quite a lot of data, would be interesting to compare approaches And we've tried both wikipedia "last edit wins" approach and reddit voting approach
  29. Some things we still don't know:How good are the tags? Like I said, it's difficult to measure objectively How much volunteer effort do you need? It depends. How big is the archive? How much data do you need per item? How good does the data need to be? Ultimately, when is your data good enough?
  30. Register for the prototype Read more on our website and blog A number of components of the system have been open-sourced on github In this project we did a lot of work to manage the processing of the audio, we found this so useful that we're turning it into a generic platform, called Comma, for anyone to analyse media, and for any computer scientists to run their analysis algorithms.