SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
1
Fraunhofer IAIS Audio Mining:
Automatic meta data generation of audio streams
FIAT/IFTA Media Management Seminar, Lugano 2017
Dr. Joachim Köhler
Head of Department NetMedia
Fraunhofer-Institut for Intelligent Analysis and
Information Systems
© Fraunhofer IAIS
Fraunhofer is the largest organization for applied research
in Europe
 More than 80 research institutions, including
69 Fraunhofer institutes
 More than 24,500 employees, the majority
educated in the natural sciences or engineering
 An annual research volume of 2.1 billion euros,
of which 1.9 billion euros is generated through
contract research
 2/3 of this research revenue derives from contracts with
industry and from publicly financed research projects.
 1/3 is contributed by the German federal government
and the Länder governments in the form of
institutional financing.
 International collaboration through representative
offices in Europe, the US, Asia and the Middle East
3
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 5
Fraunhofer Institute Centre Schloss Birlinghoven
International Research in Big Data and Cognitive Computing
600 interdisciplinary scientists – 3 Institutes
 Fraunhofer Institute for Applied
Information Technology FIT
 Fraunhofer Institute for Intelligent
Analysis and Information Systems IAIS
 Fraunhofer Institute for Algorithms
and Scientific Computing SCAI
One of the largest research locations for
applied computer science and
mathematics in Germany
Close cooperation with
regional universities
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 8
Fraunhofer & digital archiving and broadcasting
 Several Fraunhofer Institutes have contributed to
many seminars of the German VFM on automatic
metadata generation
 Fraunhofer IAIS generated a study on Future
Technologies for media archives & concept for an
innovative archive system: Media Data Hub
 Participation in many European research projects
(LIVE, AXES, CubRIK, LinkedTV, MiCO)
 Workshop with directors of broadcast archives 2012
 Technology portfolio
 Music & Video Analytics (IDMT)
 Audio Mining (IAIS)
 Media Data Hub (IAIS)
 Quality Control and fingerprinting (IDMT)
Activities, portfolio, networking
VFM Technology
workshop 2012
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 9
The Future of Media Archives: Strategic & Conceptual
Native crossmedia
• Crossmedia from data model to UI
• Using graph-based data models (e.g. Europeana)
Media Data Hub
• Linking and integration of data silos
• Bringing all metadata sources into one application (archive, legal, )
Massive automation of documentation
• Manual annotation will be reduced, process-optimized
• Future: up to 100% automatic annotation (like in press archives)
Near to production environment
• Search and access immediately after production process
• Interfacing to production systems (OpenMedia, Avid, etc.)
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 10
Mining Technologies for Media Archiving
( Report »Archiv system of the future – Strategic innovation concept«,
Fraunhofer 2014) ; Technology readiness level (TRL)
 Text Mining
 Audio Mining
 Video Mining
 Object & face recognition
 Video OCR
 Image Similarity
 Audio- and video fingerprinting
 Recommendation technologies
 Interactive data visualization
 Personalization and contextualization
 Facetted Search
 Linking of information items
Anwendung Kriterium
Unterstützte
Dossiererstellung
Reifegrad 4-5
Integrations- und
Betriebsaufwand
3-5
Mehrwert 5
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 11
Results of the 2nd FIAT/IFTA MAM Survey
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Fraunhofer IAIS Audio Mining
 B2B Speech Recognition Solution for the Media Industry
 Key Facts
 Large Vocabulary Continuous Speech Recognition (1.000.000 words)
optimized for media content
 Automatic structuring of audio-visual content
 Applications along the Media Asset Chain
 Archive: Indexing and transcription of media archive content
 Online: Search functionalities for media portals (e.g. InClip-Search) and
content-based recommendation
 TV-Distribution: Subtitling for TV content
 SocialTV: Second Screen information enrichment
 Advertising:/Marketing Video Search Engine Optimization (VSEO) and
contextualized advertisement
© Fraunhofer
SPEECH TECHNOLOGY AND
SOLUTION
Audio Mining Solution
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Audiomining
powered by Fraunhofer IAIS
Feature Advantage for Customer
Automatic Speech Segmentation
Fast browsing through long videos
Finding relevant segments quickly
Speaker Clustering / Speaker Detection
Searching for segments with specific speaker
Searching for statements by person
Speech Recognition
Search for relevant videos
Search within videos for relevant section
Keyword Generation
Generate Tag Cloud
Get a rough summary of the video
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Speaker Diarization
 Unstructured audio recording
 Homogeneous segments
Speech Speech Detection of speech Speech
Male Voice Male Voice Detection of gender Female Voice
Speaker 1 Speaker 1 Speaker recognition Speaker 2
 Jingle recognition
(e.g. programm)
Start of News Show
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Automatic Speech Recognition
 Converts speech signal into written text
 Prerequisite for further steps (text mining)
 Based on statistical models to be trained
by large amount of data
 Three components:
 Acoustic model
(How do phonemes sound?)
 Lexicon
(How are words pronounced?)
 Speech language
(Which words are probable?)
 Automatic speech recognition computes most
probable word sequence
Language
model
Lexicon
Acoustic
model
recognized text
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Progress in Speech Recognition
 Massive Usage of Deep Learning
Technology:
 Improvement of acoustic
modelling (many speakers, many
speaking styles, etc. )
 Gaussian Mixtures (GMM) =>
Deep Neural Networks (DNN)
Microsoft Research
 Dahl, Deng, Acero (2012): Context-
Dependent Pre-Trained Deep Neural
Networks for Large-Vocabulary Speech
Recognition
 Reduction of error rate from 23% to
13%
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS
19Dr. Joachim Köhler
DNNs for Speech Recognition
Dr. Joachim Köhler
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS
20Dr. Joachim Köhler
Speech Recognition is currently one of the Top Technolgoies
DNN based applications from Amzon, Microsoft, Google & co
Dr. Joachim Köhler
Amazon Alexa Echo 2016 Apple: Siri 2015 Google Now: 2015
Microsoft: Cortana 2016
© Fraunhofer
Deep Learning
 Speech recognition
 Image recognition
 Text understanding
 Machine translations
 Breast cancer diagnostics
 Game play
A game changer towards artificial intelligence
big data
+ machine learning
= progress in AI Quelle: Y. Bengio, ML tutorial, KDD 2014
Quelle: S. Jones, nvidia blog, 2014
Quelle: Microsoft Research, 2014
Quelle: Ciresan et al., Proc MICCAI, 2013
Quelle: Mnih et al., Nature, 2015
Quelle: Xiong et al., Science 2015
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Speech Recognition System Setup (German)
powered by Fraunhofer IAIS
 Acoustic Training Data: GER-TV 1000h (LREC 2014)
 Language Model Training Data: 71.8 M words (news domain)
 Competetive on the German market, English system in progress
 Using deep neural networks (DNNs) for acoustic modelling
(instead of Gaussian Mixtures Models)
 stable, continuous improvement, integration of up-to-date research results
GMM Gaussian Mixture Model, DNN Deep Neural Network
Jahr Acoustic
Model
Language
Model
Training
data [h]
WER [%]
planned
WER [%]
Spontaneous
2012/13 1. GMM 3gram, 200k 105 26.4 33.5
2013/14 2. GMM 3gram, 200k 323 24.0 31.1
2014 3. DNN 3gram, 200k 323 18.4 22.6
2015 4. DNN 5gram, 510k 1005 13.3 16.5
2016 5. RNN 5gram, 510k 1005 11.9 14.5
© Fraunhofer
Ongoing Research on RNN-CTC
 RNN-CTC: Connectionist Temporal Classification. What's new: solve speech
recognition as an end-to-end machine learning task, everything is a (deep)
recurrent neural network (RNN)
 1000h speech corpus, ~2 weeks training time on GPU cluster.
 About ~10% relative reduction on average in WER with RNN-CTC
Beyond HMM, HMM-DNN Approaches
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Speaker Recogntion using iVectors
2,5 -3,9 -1,6 -2,8 4,3 3,2 0,9 0,2 3,3 -0,5 1,7 -2,3 -0,5 -3,3 -1,7 0,3 -3,0 -1,8 -0,2 2,0
0,1 0,4 -0,3 0,5 -0,1 0,6 2,2 -1,6 0,3 -0,8 -2,4 -1,4 0,3 1,4 -1,7 -0,6 -1,3 -1,0 -1,9 0,0
-1,3 0,8 -1,3 -0,4 1,2 2,4 -0,1 1,8 0,6 -0,4 -1,2 -1,3 -1,4 1,0 -2,1 -0,1 0,1 -1,3 0,4 1,2
-0,1 -1,3 -0,9 -0,2 -2,1 0,6 -0,6 0,2 0,9 0,0 0,0 -0,6 0,5 -2,0 -0,5 1,3 0,2 0,4 1,3 0,8
0,0 -0,6 -0,8 -0,3 -0,9 -1,4 1,4 0,0 0,7 0,9 -0,5 0,4 1,2 0,2 0,7 -0,8 -0,3 -3,3 -0,4 -1,1
-1,1 1,4 -0,2 -0,3 -1,0 -0,1 -0,1 -1,1 0,8 0,4 -0,2 -1,5 -0,3 -0,7 -0,2 -0,6 -0,3 -0,2 -0,2 0,7
0,3 1,7 -0,6 1,4 -1,5 -0,1 0,3 -0,9 0,1 -0,6 -0,4 -0,4 -0,3 0,3 0,6 -0,3 0,0 0,8 0,8 -0,3
0,2 0,2 -0,5 0,9 0,4 1,1 0,5 0,0 -0,2 0,9 -1,2 -0,8 0,2 -1,0 -0,7 0,6 -0,7 0,2 0,9 -0,9
-0,2 2,6 1,0 -0,2 0,4 -0,2 1,0 0,1 -1,0 0,8 0,1 -1,4 0,6 -0,2 -0,5 0,9 -0,3 0,2 1,2 0,4
-0,1 0,6 0,6 0,5 -0,7 -0,2 1,9 0,7 0,4 -1,3 -1,6 0,1 -0,6 0,1 1,4 0,0 -0,6 0,4 -0,2 0,5
1,7 0,6 0,3 0,2 0,3 -0,1 -0,4 -0,3 -0,3 0,4 0,2 0,3 1,4 0,1 0,5 -0,6 -0,4 -0,5 2,0 0,2
0,7 1,6 -0,8 -1,2 0,2 -0,4 -0,5 1,1 -0,1 0,1 -0,2 -2,2 0,2 0,8 -0,2 2,0 -0,9 0,5 -1,2 1,0
-0,1 0,2 0,4 0,6 0,1 0,2 -0,9 -0,1 -0,2 -0,1 -0,4 1,2 -0,1 -1,2 0,0 0,6 1,9 -1,6 0,5 1,1
1,6 0,2 1,6 -0,4 -0,1 1,1 -0,4 0,1 0,4 -0,2 0,8 1,3 1,4 1,5 -0,4 -0,9 -0,4 -0,1 -0,6 -0,1
0,1 -0,6 -1,1 1,2 0,2 -1,3 0,4 -0,5 -1,7 0,4 0,9 -0,1 -1,2 -0,2 -0,6 0,8 -0,2 -1,3 0,8 -0,3
2,3 -0,7 -0,2 -0,1 -0,2 -0,3 0,1 1,0 1,5 0,7 0,0 0,8 -1,0 -0,2 -0,9 -0,7 -0,8 0,8 1,6 -0,1
0,7 -0,1 1,0 -0,5 1,5 -1,4 1,6 0,4 0,8 1,2 -0,5 0,7 -1,0 -1,3 -0,2 0,6 0,6 0,8 0,6 0,6
0,0 1,1 0,0 0,1 0,5 -0,2 0,9 0,5 -0,7 -0,2 -0,2 0,4 -0,6 -0,7 -0,4 1,2 0,0 -0,2 0,1 0,2
0,3 0,6 0,1 -1,1 0,6 1,1 0,3 -0,1 -0,7 0,8 0,1 -0,2 -0,1 0,5 -0,9 -0,2 0,2 0,4 -0,9 0,1
-1,6 -0,2 0,6 -0,8 -1,3 -1,1 1,0 -0,6 -0,6 -0,8 -0,7 -0,8 1,6 0,3 -0,4 0,6 -0,6 0,5 -0,1 0,5
-1,3 1,6 0,3 7,3 8,2 1,3 1,4 -0,1 0,3 -0,9 2,9 -3,9 -0,4 -5,6 -2,0 -0,3 0,6 -0,9 -0,3 -2,6
-0,1 -0,2 -0,4 -0,4 0,0 -0,5 1,5 -4,0 -0,5 -0,9 8,6 -1,8 -0,2 -1,0 -1,2 1,0 -2,2 -1,5 -0,2 0,0
-1,7 -1,2 0,1 1,0 0,6 4,3 0,0 1,3 -0,2 -1,0 1,3 -0,3 2,8 -1,6 1,1 0,0 -0,1 -1,2 -0,5 -0,4
-0,2 0,1 0,0 0,4 -3,4 -1,9 0,3 -0,1 1,3 0,0 0,0 0,3 0,0 0,2 -0,8 0,4 0,2 0,6 -1,0 -1,2
0,0 -0,1 0,5 -0,1 -0,6 0,1 -2,4 0,0 -0,4 0,3 0,7 0,2 2,9 0,0 0,0 0,0 0,2 -3,3 0,6 0,9
-0,8 0,0 0,0 0,4 0,4 0,0 0,1 0,7 1,1 0,3 -0,2 -0,6 -0,2 1,3 0,1 -0,1 0,2 0,0 0,2 0,9
0,1 -2,0 0,4 -2,1 0,0 0,0 0,2 -0,7 0,1 -0,5 0,0 -0,1 0,1 0,2 -0,2 0,1 0,0 0,6 0,5 -0,4
-0,2 -0,2 0,8 -0,3 -0,2 1,0 0,2 0,0 -0,1 0,4 2,0 -0,5 -0,2 0,0 0,4 0,7 0,1 -0,4 1,4 -0,8
0,2 -1,8 1,5 -0,1 1,0 -0,4 1,3 0,0 0,4 -1,3 0,0 -0,3 -0,5 0,1 0,5 0,4 -0,6 -0,1 2,0 -1,0
-0,2 0,7 -1,7 0,2 0,4 -0,2 -1,3 1,1 -0,1 0,9 -0,3 0,2 0,8 0,1 -1,5 0,0 -0,2 -0,2 0,3 0,2
-1,0 -0,5 -0,4 -0,1 -0,2 0,0 0,0 0,0 0,2 0,1 -0,4 -0,1 3,4 -0,1 0,6 -0,1 -0,2 0,4 -3,0 0,1
1,7 0,0 1,1 -1,7 0,0 -0,2 0,5 -2,1 -0,1 0,1 0,1 -2,0 -0,1 0,9 0,3 -3,6 -0,3 0,3 0,0 0,3
0,1 -0,2 0,4 -0,6 0,0 0,0 0,8 0,2 0,1 -0,1 0,2 -0,7 0,2 1,1 0,0 0,2 3,0 1,1 -1,0 1,7
0,2 0,0 1,3 0,2 -0,1 0,7 -0,2 -0,1 0,2 -0,1 0,6 -3,1 0,3 0,5 0,4 0,3 -0,2 0,0 -0,2 0,0
0,0 0,5 0,7 -1,0 -0,2 -0,3 0,0 0,3 0,7 -0,1 -0,5 -0,1 -0,5 0,3 0,2 1,1 0,1 0,0 0,2 -0,3
0,7 0,1 0,0 0,1 0,0 0,2 0,0 0,3 1,4 -0,3 0,0 -0,3 0,2 -0,4 1,1 0,0 0,2 -0,1 0,5 0,1
0,4 0,0 -1,0 1,1 2,3 0,6 0,5 -0,5 -0,2 -0,2 -0,1 -0,1 -0,3 0,1 0,1 0,2 -0,5 1,7 0,4 0,4
0,0 0,7 0,0 0,0 0,3 0,2 -0,2 0,6 -0,1 -0,1 0,0 0,2 0,0 -0,2 -0,1 -1,1 0,0 0,1 0,0 0,3
-0,1 0,0 0,1 0,3 -0,5 1,9 0,0 0,0 -0,6 -0,1 -0,1 0,1 0,0 0,8 -0,9 0,0 0,1 0,0 0,3 0,0
0,0 -0,2 -0,5 0,2 0,1 -0,7 1,4 -0,5 0,6 0,9 0,4 0,0 2,2 0,1 0,2 0,3 -0,2 -0,1 0,0 -0,3
iVector Comparison
 Sebastian Kurz
 Confidence: 0,05
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Fraunhofer IAIS Audio Mininig: Technology
 Speaker diarization to structure
recordings automatically (e.g. speaker
information)
 ASR System based on KALDI open
source package
 Using Deep Neural Networks
 Completely speaker independent
 Real-time processing
 Trained on 1000 hours large-scale
German broadcast database
 Service-orientated architecture to
control and run the recognition engine
Web services
Messaging
Audio
Mining
core
Audio
Mining
Monitor
AudioMining
iFinder
Structural
Analysis
Structural
Analysis
Structural
Analysis
Automatic
Speech
Recognitio
n
Automatic
Speech
Recognitio
n
Automatic
Speech
Recognitio
n
© Fraunhofer
USER INTERFACE
Audio Mining Solution
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
GUI: Media Search Interface
Search functionality:
Find audio and video files with
specific keywords, specific words
in the title or the transcript, or
with a specific series name.
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
GUI: Segmentation, Sub-Titles, Preview
Preview functionality:
Select a media file from the right-
hand side to watch it or listen to it.
Subtitles:
Audio Mining creates subtitles
based on the transcript and the
structural analysis results.
Segmentation/Speaker
clustering:
Audio Mining detects whenever
the speaker changes and divides
the media file into multiple
segments. Jump to a specific
segment by clicking the timeline.
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
GUI: Word Positioning, Snippets
Advanced search functionality:
You are also able to search for a
specific word inside the transcript.
Word occurrences:
Marks indicate the occurrences of
the search term. Click on a mark
to jump to the corresponding
position in the video.
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
GUI: Keywords
Keywords:
Audio Mining generates keywords
for every media file, based on
particular relevant words in the
transcript.
Again, marks indicate the
occurrences of the keyword. Click
on a mark to jump to the
corresponding position in the
video.
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
GUI: Full transcript
Transcript:
Audio Mining provides a transcript
for every media file. Again, the
video or audio file is divided into
segments. Different colours
indicate different speakers.
You are able to export the
transcript to different file formats.
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
GUI: Recommendation
Recommendations:
You have just watched an exciting
video and are now looking for a
similar one? No problem! Audio
Mining recommends related media
files, based on the similarity of
their keywords.
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Audio Mining: Status
Demo System: https://nm-demo.iais.fraunhofer.de/customer_demo/
 Fraunhofer IAIS provide web-based test account for interested customers
 https://nm-demo.iais.fraunhofer.de/$TV-station
 HR, SWR, BR, RBB, ZDF, …
 Easy to use, simple upload functionality
 Positive feedback
 Segmentation and speaker diarization very useful (improvement possible)
 ASR quality for many types of radio and TV program good
 Keyword search and keyword access is very positive
 Full transcript is useful
 Keyword generation as interesting alternative for summary and fixed semantic
vocabulary
 Export in several formats possible
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Audio Mining: Challenges and Research Issues
Feedback from media archive professionals of ARD
 Overlapping speech segments, voice over
 Short speaker turns are difficult to detect
 Overlapping speech segments reduces ASR quality (“talk show”)
 Voice over: Start in language 1, continue with language 2
 Hard to solve
 Background noise, noisy conditions
 Noise degrades ASR quality
 Solutions: data augmentation, speech enhancement
 Very open domains, unlimited vocabulary, Out-Of Vocabulary Problem, Names
 Regular update of the language models required (e.g. “Incirlik“, „James Comey“)
 Mixed/multiple languages
 Foreign names (ARD pronunciation dictionary)
 Dialects
 BR provides several dialects of the German language for research work
 Punctuation mark are required
© Fraunhofer
SYSTEM ARCHITECTURE
Audio Mining Solution
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
System architecture
Audio Mining
core
Audio Mining
Monitor
Audio Mining
core
iFinder
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
AudioMining
Analysis
requests
↓
↑
Analysis
results
← Analysis priorities
Asset details, .
processing updates, .
deletion updates →
Analysis
priorities
↓
↑
Asset details,
processing updates,
deletion updates
Import, analysis,
status and deletion
requests
↓
↑
Asset status,
details
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
AudioMining
System architecture
Audio Mining
Monitor
Audio Mining
core
iFinder
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
Audio Mining core
Audio Mining
Data
base
Search
index
File
system
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
System architecture
Audio Mining
core
Audio Mining
Monitor
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
AudioMining
iFinder
Structural
Analysis
Structural
Analysis
Structural
Analysis
Automatic
Speech
Recognition
Automatic
Speech
Recognition
Automatic
Speech
Recognition
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Audio Mining Monitor
System architecture
Audio Mining
core
Audio Mining
core
iFinder
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
AudioMining
Data
baseAudio Mining
Monitor
HTTP
Server
Messaging
Server
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Infrastructure and Scalability
Server (1): Scheduling and Media Repository
 VM, ≥ 2 Cores (≥ 2 GHz, 64-bit), 30 GB RAM
 SLES, JRE 8, MySQL, Bash 4
 Server (2): Audio-Analyses
 Processing capacity per core (AMD Opteron 6234):
17 h Audiomaterial am Tag
4 GB RAM
 For 20 h Audio data per day:
 ≥ 2 Cores (≥ 2 GHz, 64-bit), 8 GB RAM
 SLES
 Audio processing is fully scalable
 Tested on 480 cores to process several thousands hours/day
© Fraunhofer
REFERENCE PROJECTS
Audio Mining Solution
© Fraunhofer
Speech Recognition for Media Archiving
powered by Fraunhofer IAIS
Customer: WDR, German Broadcaster
(Archive Department)
Project facts:
 Integration of Fraunhofer IAIS Audio-
Mining system into the WDR IT
environment (ARCHIMEDES und IVZ)
 Content mining of large amounts of AV-
data, immediately!
 Better navigation and segmentation of
radio and TV material
 Search in spoken utterances
 Full transcription and keyword generation
Technology provided by Fraunhofer:
 Broadcast speech recognition
 Automatic speech segmentation
Strukturierte Aufbereitung
Speech Recognition
Structured Segmentation
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Content Analytics for ARD Mediathek
Artifical Intelligence powered by Fraunhofer
 Content analytics of 200.000 media
assets
 Advanced search and retrieval
capabilities
 Full transcription of multimedia content
 Daily processing of 2000 new media
assets from radio and TV
 Core technology for recommendation
and personalization services
 Link: http://www.ardmediathek.de
© Fraunhofer
Speech Recognition for the „ARD-Mediathek“
powered by Fraunhofer IAIS
Customer: SWR/ Redaktion ARD.de (Link: www.ardmediathek.de), 2014/15
Project facts:
 Processing of 200.000 media assets (average duration 15 minutes/asset)
 Service based (crawling, processing, metadata transfer)
 Daily amount: 2000 assets (update mechanism every 60 minutes)
Technology provided by Fraunhofer:
 Speaker diarization, speech recognition, key word extraction)
© Fraunhofer
real-time analysis of heterogeneous news streams
News-Stream
Objectives
 Big data infrastructure for efficient and real-time analysis of
heterogeneous news streams
 Semantic analysis of multimodal and unstructred news data
 Piloting in real-life scenarios
Technologies and Applications
 Real-time speaker recognition
 Audio „citation“ search
 Heatmap & Social Media Monitoring, …
 Project duration: 09/2014 bis 12/2017
http://newsstreamproject.org/
49
© Fraunhofer
KA3: Cologne Centre for Analysis and Archiving of AV Data
Centre Project of the German BMBF eHumanities Program
 Project objectives
 Creation of a centre for the e-
Humanities Research in Germany with
the focus on AV data
 Contribution of Fraunhofer IAIS
 Development and providing tools for
automatic analytics of speech and
audio recordings (oral history scenario,
interaction scenario)
 Use Case 1: Oral History
 Use Case 2: Interaction Scenario
 Duration: 10/2015 – 09/2018
 Partners : Univ. Köln, MPI for
Psycholinguistics, Fernuniversität in Hagen
© Fraunhofer
KA3: Use Case Interaction Scenario
Challenges:
 Very fast dialogues, short
speaker turns
 Backchanncel sounds
(„mmh“, „hmm“, „ja“, …)
 Overlapping speech
segments
Technologies:
 Improved speaker clustering
 Speech/non speech
segmentation with deep
learning
 Overlapping speech
segments with RNN
 Automatic segmentation of speech recordings
Arbitrary # of
speakers :
max. 2 Sprecher:
2 speakers :
references:
© Fraunhofer
KA3: Use Case Oral History
Speech Recognition: Reference & ASR Ouput
Example: Kruse (clean recording)
zwischendrin hatte ich natürlich auch versucht noch
mit bei der Medizin zu landen das war aber damals
deswegen so schwierig weil das glaube ich ein Jahr
war bevor der Numerus clausus in der Medizin
eingeführt wurde und man musste so mit
sechshundert Anfängern ungefähr um sechs Uhr auf
der Treppe sitzen damit man um acht Uhr in die
Vorlesung kam und das war für mich
zwischendrin hatte ich natürlich auch versucht sich
noch beim bei der Medizin zu landen das war aber
damals deswegen so schwierig weil das glaube ich
ein Jahr war bevor der Numerus clausus in der
Medizin eingeführt wurde und man musste somit
sechshundert Anfängern ungefähr um sechs Uhr auf
der Treppe sitzen damit man um acht Uhr in der
Vorlesung kam und das war für mich
dann habe ich dieses Studium abgeschlossen und
hatte mich kurz auch mal dafür interessiert in eine
Berufstätigkeit im Entwicklungsdienst deutscher
Entwicklungsdienst hieß das glaube ich einzusteigen
hatte aber auch gleichzeitig so einen Hiwi-Job am
Institut und so blieb ich dann hängen und hatte
eben einfach die Chance weil man dann auch
gefördert wird oder die Chance hat in einem
bestimmten Projekt zu arbeiten dass ich dann daran
gedacht habe zu promovieren
dann habe ich dieses Studium abgeschlossen und
hatte mich kurz auch mal dafür interessiert in eine
Berufstätigkeit im indem ein Entwicklungsland
Deutscher Entwicklungsdienst hieß das glaube ich
einzusteigen hatte aber auch gleichzeitig so ein ein
Hiwi Job am Institut und so lieblich dann hängen
und hatte eben einfach die Chance weil man dann
auch gefördert wird oder die Chance hat _ einen
bestimmten Projekt zu arbeiten dass ich dann daran
gedacht habe zu promovieren
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
KA3/Newsstream: Forced Alignment & Editing of Transcripts
 If a complete and almost perfect transcription text is availalbe, the missing time
code will be generated by forced alignment
 Input: audio file, transcript
 Output: segmentation file (MPEG-7, ELAN)
 Part of iFinder 3.0
© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Summary and Outlook
Summary
 Deep Learning and large corpora have led to massive progress for Speech2Text
 Speech2Text provides good transcription quality for broadcast speech (about 10% error),
however not perfect
 Audio Mining more then S2T: speech segmentation, speaker recognition, citations, …
 Many advantages: annotation costs, immediate availability , more details and time codes
 Some disadvantages: Challenging recording conditions, explosion of metadata
 Conclusion: Acceptance for Audio Mining/S2T is given !!!!
 Test Account possible: https://nm-demo.iais.fraunhofer.de/customer_demo
Outlook
 Several research issues are still open (dialects, overlapping speech segments, …)
 Further improvement is expected (evaluation of Deep Learning, more data, engineering)
 Important issue: Integration into MAM workflows
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 59
Let‘s do more with your data!
Fraunhofer Institute for Intelligent Analysis and
Information Systems IAIS
www.iais.fraunhofer.de
Link: https://www.iais.fraunhofer.de/audiomining.html
Contact
Dr. Joachim Köhler
Head of Image Processing
+49 (0)2241 14-1900
joachim.koehler@iais.fraunhofer.de
© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 60
Disclaimer
Copyright © by
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Hansastraße 27 c, 80686 Munich, Germany
All rights reserved.
Responsible contact is: Katrin Berkler | Silke Loh | Public Relations | pr@iais.fraunhofer.de
All copyrights for this presentation and their content are owned in full by the Fraunhofer-Gesellschaft, unless
expressly indicated otherwise.
Each presentation may be used for personal editorial purposes only. Modifications of images and text are not
permitted. Any download or printed copy of this presentation material shall not be distributed or used for
commercial purposes without prior consent of the Fraunhofer-Gesellschaft.
Notwithstanding the above mentioned, the presentation may only be used for reporting on Fraunhofer-
Gesellschaft and its institutes free of charge provided source references to Fraunhofer’s copyright shall be included
correctly and provided that two free copies of the publication shall be sent to the above mentioned address.
The Fraunhofer-Gesellschaft undertakes reasonable efforts to ensure that the contents of its presentations are
accurate, complete and kept up to date. Nevertheless, the possibility of errors cannot be entirely ruled out. The
Fraunhofer-Gesellschaft does not take any warranty in respect of the timeliness, accuracy or completeness of
material published in its presentations, and disclaims all liability for (material or non-material) loss or damage
arising from the use of content obtained from the presentations. The afore mentioned disclaimer includes damages
of third parties.
Registered trademarks, names, and copyrighted text and images are not generally indicated as such in the
presentations of the Fraunhofer-Gesellschaft. However, the absence of such indications in no way implies that
these names, images or text belong to the public domain and may be used unrestrictedly with regard to trademark
or copyright law.

Contenu connexe

Similaire à Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams

Lime recommendation
Lime recommendationLime recommendation
Lime recommendationJohn Pereira
 
FME World Tour 2016: INSPIRE data harmonisation with FME (GIM)
FME World Tour 2016:  INSPIRE data harmonisation with FME (GIM)FME World Tour 2016:  INSPIRE data harmonisation with FME (GIM)
FME World Tour 2016: INSPIRE data harmonisation with FME (GIM)GIM_nv
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondBenoit Pauwels
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondULB - Bibliothèques
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingIOSR Journals
 
Review On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep LearningReview On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep LearningIRJET Journal
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologyAamir-sheriff
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Sandro D'Elia
 
resume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientist
resume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientistresume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientist
resume Jose Perez-Macias Biomedical Engineer Machine Audition Health ScientistJose Maria Perez-Macias
 
SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware ac.uk
 
Audiometry A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...
Audiometry  A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...Audiometry  A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...
Audiometry A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...Amanda Summers
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptxJhalakDashora
 
Extract the Audio from Video by using python
Extract the Audio from Video by using pythonExtract the Audio from Video by using python
Extract the Audio from Video by using pythonIRJET Journal
 
Research Careers in Applied Computer Science
Research Careers in Applied Computer ScienceResearch Careers in Applied Computer Science
Research Careers in Applied Computer ScienceChristoph Lange
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED
 
NLP BASED INTERVIEW ASSESSMENT SYSTEM
NLP BASED INTERVIEW ASSESSMENT SYSTEMNLP BASED INTERVIEW ASSESSMENT SYSTEM
NLP BASED INTERVIEW ASSESSMENT SYSTEMvivatechijri
 

Similaire à Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams (20)

Lime recommendation
Lime recommendationLime recommendation
Lime recommendation
 
FME World Tour 2016: INSPIRE data harmonisation with FME (GIM)
FME World Tour 2016:  INSPIRE data harmonisation with FME (GIM)FME World Tour 2016:  INSPIRE data harmonisation with FME (GIM)
FME World Tour 2016: INSPIRE data harmonisation with FME (GIM)
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the Pond
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the Pond
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And Recording
 
Review On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep LearningReview On Speech Recognition using Deep Learning
Review On Speech Recognition using Deep Learning
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708
 
Industrial Natural Language Processing and Information Extraction
Industrial Natural Language Processing and Information ExtractionIndustrial Natural Language Processing and Information Extraction
Industrial Natural Language Processing and Information Extraction
 
resume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientist
resume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientistresume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientist
resume Jose Perez-Macias Biomedical Engineer Machine Audition Health Scientist
 
SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers
 
Audiometry A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...
Audiometry  A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...Audiometry  A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...
Audiometry A Model-View-Viewmodel (MVVM) Application Framework For Hearing I...
 
ilke_Master_Thesis
ilke_Master_Thesisilke_Master_Thesis
ilke_Master_Thesis
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptx
 
Speech Data Collection: Unlocking the Potential of Voice Technology
Speech Data Collection: Unlocking the Potential of Voice TechnologySpeech Data Collection: Unlocking the Potential of Voice Technology
Speech Data Collection: Unlocking the Potential of Voice Technology
 
Extract the Audio from Video by using python
Extract the Audio from Video by using pythonExtract the Audio from Video by using python
Extract the Audio from Video by using python
 
Research Careers in Applied Computer Science
Research Careers in Applied Computer ScienceResearch Careers in Applied Computer Science
Research Careers in Applied Computer Science
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
IJSRED-V2I2P5
IJSRED-V2I2P5IJSRED-V2I2P5
IJSRED-V2I2P5
 
NLP BASED INTERVIEW ASSESSMENT SYSTEM
NLP BASED INTERVIEW ASSESSMENT SYSTEMNLP BASED INTERVIEW ASSESSMENT SYSTEM
NLP BASED INTERVIEW ASSESSMENT SYSTEM
 

Plus de FIAT/IFTA

2021 FIAT/IFTA Timeline Survey
2021 FIAT/IFTA Timeline Survey2021 FIAT/IFTA Timeline Survey
2021 FIAT/IFTA Timeline SurveyFIAT/IFTA
 
20211021 FIAT/IFTA Most Wanted List
20211021 FIAT/IFTA Most Wanted List20211021 FIAT/IFTA Most Wanted List
20211021 FIAT/IFTA Most Wanted ListFIAT/IFTA
 
WARBURTON FIAT/IFTA Timeline Survey results 2020
WARBURTON FIAT/IFTA Timeline Survey results 2020WARBURTON FIAT/IFTA Timeline Survey results 2020
WARBURTON FIAT/IFTA Timeline Survey results 2020FIAT/IFTA
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVFIAT/IFTA
 
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)FIAT/IFTA
 
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉCULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉFIAT/IFTA
 
HULSENBECK Value Use and Copyright Comission initiatives
HULSENBECK Value Use and Copyright Comission initiativesHULSENBECK Value Use and Copyright Comission initiatives
HULSENBECK Value Use and Copyright Comission initiativesFIAT/IFTA
 
WILSON Film digitisation at BBC Scotland
WILSON Film digitisation at BBC ScotlandWILSON Film digitisation at BBC Scotland
WILSON Film digitisation at BBC ScotlandFIAT/IFTA
 
GOLODNOFF We need to make our past accessible!
GOLODNOFF We need to make our past accessible!GOLODNOFF We need to make our past accessible!
GOLODNOFF We need to make our past accessible!FIAT/IFTA
 
LORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal depositLORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal depositFIAT/IFTA
 
BIRATUNGANYE Shock of formats
BIRATUNGANYE Shock of formatsBIRATUNGANYE Shock of formats
BIRATUNGANYE Shock of formatsFIAT/IFTA
 
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...FIAT/IFTA
 
BERGER RIPPON BBC Music memories
BERGER RIPPON BBC Music memoriesBERGER RIPPON BBC Music memories
BERGER RIPPON BBC Music memoriesFIAT/IFTA
 
AOIBHINN and CHOISTIN Rehash your archive
AOIBHINN and CHOISTIN Rehash your archiveAOIBHINN and CHOISTIN Rehash your archive
AOIBHINN and CHOISTIN Rehash your archiveFIAT/IFTA
 
HULSENBECK BLOM A blast from the past open up
HULSENBECK BLOM A blast from the past open upHULSENBECK BLOM A blast from the past open up
HULSENBECK BLOM A blast from the past open upFIAT/IFTA
 
PERVIZ Automated evolvable media console systems in digital archives
PERVIZ Automated evolvable media console systems in digital archivesPERVIZ Automated evolvable media console systems in digital archives
PERVIZ Automated evolvable media console systems in digital archivesFIAT/IFTA
 
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AIAICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AIFIAT/IFTA
 
VINSON Accuracy and cost assessment for archival video transcription methods
VINSON Accuracy and cost assessment for archival video transcription methodsVINSON Accuracy and cost assessment for archival video transcription methods
VINSON Accuracy and cost assessment for archival video transcription methodsFIAT/IFTA
 
LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?FIAT/IFTA
 
AZIZ BABBUCCI Let's play with the archive
AZIZ BABBUCCI Let's play with the archiveAZIZ BABBUCCI Let's play with the archive
AZIZ BABBUCCI Let's play with the archiveFIAT/IFTA
 

Plus de FIAT/IFTA (20)

2021 FIAT/IFTA Timeline Survey
2021 FIAT/IFTA Timeline Survey2021 FIAT/IFTA Timeline Survey
2021 FIAT/IFTA Timeline Survey
 
20211021 FIAT/IFTA Most Wanted List
20211021 FIAT/IFTA Most Wanted List20211021 FIAT/IFTA Most Wanted List
20211021 FIAT/IFTA Most Wanted List
 
WARBURTON FIAT/IFTA Timeline Survey results 2020
WARBURTON FIAT/IFTA Timeline Survey results 2020WARBURTON FIAT/IFTA Timeline Survey results 2020
WARBURTON FIAT/IFTA Timeline Survey results 2020
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTV
 
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
 
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉCULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
 
HULSENBECK Value Use and Copyright Comission initiatives
HULSENBECK Value Use and Copyright Comission initiativesHULSENBECK Value Use and Copyright Comission initiatives
HULSENBECK Value Use and Copyright Comission initiatives
 
WILSON Film digitisation at BBC Scotland
WILSON Film digitisation at BBC ScotlandWILSON Film digitisation at BBC Scotland
WILSON Film digitisation at BBC Scotland
 
GOLODNOFF We need to make our past accessible!
GOLODNOFF We need to make our past accessible!GOLODNOFF We need to make our past accessible!
GOLODNOFF We need to make our past accessible!
 
LORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal depositLORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal deposit
 
BIRATUNGANYE Shock of formats
BIRATUNGANYE Shock of formatsBIRATUNGANYE Shock of formats
BIRATUNGANYE Shock of formats
 
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
 
BERGER RIPPON BBC Music memories
BERGER RIPPON BBC Music memoriesBERGER RIPPON BBC Music memories
BERGER RIPPON BBC Music memories
 
AOIBHINN and CHOISTIN Rehash your archive
AOIBHINN and CHOISTIN Rehash your archiveAOIBHINN and CHOISTIN Rehash your archive
AOIBHINN and CHOISTIN Rehash your archive
 
HULSENBECK BLOM A blast from the past open up
HULSENBECK BLOM A blast from the past open upHULSENBECK BLOM A blast from the past open up
HULSENBECK BLOM A blast from the past open up
 
PERVIZ Automated evolvable media console systems in digital archives
PERVIZ Automated evolvable media console systems in digital archivesPERVIZ Automated evolvable media console systems in digital archives
PERVIZ Automated evolvable media console systems in digital archives
 
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AIAICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
 
VINSON Accuracy and cost assessment for archival video transcription methods
VINSON Accuracy and cost assessment for archival video transcription methodsVINSON Accuracy and cost assessment for archival video transcription methods
VINSON Accuracy and cost assessment for archival video transcription methods
 
LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?LYCKE Artificial intelligence, hype or hope?
LYCKE Artificial intelligence, hype or hope?
 
AZIZ BABBUCCI Let's play with the archive
AZIZ BABBUCCI Let's play with the archiveAZIZ BABBUCCI Let's play with the archive
AZIZ BABBUCCI Let's play with the archive
 

Dernier

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Dernier (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams

  • 1. 1 Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams FIAT/IFTA Media Management Seminar, Lugano 2017 Dr. Joachim Köhler Head of Department NetMedia Fraunhofer-Institut for Intelligent Analysis and Information Systems
  • 2. © Fraunhofer IAIS Fraunhofer is the largest organization for applied research in Europe  More than 80 research institutions, including 69 Fraunhofer institutes  More than 24,500 employees, the majority educated in the natural sciences or engineering  An annual research volume of 2.1 billion euros, of which 1.9 billion euros is generated through contract research  2/3 of this research revenue derives from contracts with industry and from publicly financed research projects.  1/3 is contributed by the German federal government and the Länder governments in the form of institutional financing.  International collaboration through representative offices in Europe, the US, Asia and the Middle East 3
  • 3. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 5 Fraunhofer Institute Centre Schloss Birlinghoven International Research in Big Data and Cognitive Computing 600 interdisciplinary scientists – 3 Institutes  Fraunhofer Institute for Applied Information Technology FIT  Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS  Fraunhofer Institute for Algorithms and Scientific Computing SCAI One of the largest research locations for applied computer science and mathematics in Germany Close cooperation with regional universities
  • 4. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 8 Fraunhofer & digital archiving and broadcasting  Several Fraunhofer Institutes have contributed to many seminars of the German VFM on automatic metadata generation  Fraunhofer IAIS generated a study on Future Technologies for media archives & concept for an innovative archive system: Media Data Hub  Participation in many European research projects (LIVE, AXES, CubRIK, LinkedTV, MiCO)  Workshop with directors of broadcast archives 2012  Technology portfolio  Music & Video Analytics (IDMT)  Audio Mining (IAIS)  Media Data Hub (IAIS)  Quality Control and fingerprinting (IDMT) Activities, portfolio, networking VFM Technology workshop 2012
  • 5. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 9 The Future of Media Archives: Strategic & Conceptual Native crossmedia • Crossmedia from data model to UI • Using graph-based data models (e.g. Europeana) Media Data Hub • Linking and integration of data silos • Bringing all metadata sources into one application (archive, legal, ) Massive automation of documentation • Manual annotation will be reduced, process-optimized • Future: up to 100% automatic annotation (like in press archives) Near to production environment • Search and access immediately after production process • Interfacing to production systems (OpenMedia, Avid, etc.)
  • 6. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 10 Mining Technologies for Media Archiving ( Report »Archiv system of the future – Strategic innovation concept«, Fraunhofer 2014) ; Technology readiness level (TRL)  Text Mining  Audio Mining  Video Mining  Object & face recognition  Video OCR  Image Similarity  Audio- and video fingerprinting  Recommendation technologies  Interactive data visualization  Personalization and contextualization  Facetted Search  Linking of information items Anwendung Kriterium Unterstützte Dossiererstellung Reifegrad 4-5 Integrations- und Betriebsaufwand 3-5 Mehrwert 5
  • 7. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 11 Results of the 2nd FIAT/IFTA MAM Survey
  • 8. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Fraunhofer IAIS Audio Mining  B2B Speech Recognition Solution for the Media Industry  Key Facts  Large Vocabulary Continuous Speech Recognition (1.000.000 words) optimized for media content  Automatic structuring of audio-visual content  Applications along the Media Asset Chain  Archive: Indexing and transcription of media archive content  Online: Search functionalities for media portals (e.g. InClip-Search) and content-based recommendation  TV-Distribution: Subtitling for TV content  SocialTV: Second Screen information enrichment  Advertising:/Marketing Video Search Engine Optimization (VSEO) and contextualized advertisement
  • 9. © Fraunhofer SPEECH TECHNOLOGY AND SOLUTION Audio Mining Solution
  • 10. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audiomining powered by Fraunhofer IAIS Feature Advantage for Customer Automatic Speech Segmentation Fast browsing through long videos Finding relevant segments quickly Speaker Clustering / Speaker Detection Searching for segments with specific speaker Searching for statements by person Speech Recognition Search for relevant videos Search within videos for relevant section Keyword Generation Generate Tag Cloud Get a rough summary of the video
  • 11. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Speaker Diarization  Unstructured audio recording  Homogeneous segments Speech Speech Detection of speech Speech Male Voice Male Voice Detection of gender Female Voice Speaker 1 Speaker 1 Speaker recognition Speaker 2  Jingle recognition (e.g. programm) Start of News Show
  • 12. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Automatic Speech Recognition  Converts speech signal into written text  Prerequisite for further steps (text mining)  Based on statistical models to be trained by large amount of data  Three components:  Acoustic model (How do phonemes sound?)  Lexicon (How are words pronounced?)  Speech language (Which words are probable?)  Automatic speech recognition computes most probable word sequence Language model Lexicon Acoustic model recognized text
  • 13. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Progress in Speech Recognition  Massive Usage of Deep Learning Technology:  Improvement of acoustic modelling (many speakers, many speaking styles, etc. )  Gaussian Mixtures (GMM) => Deep Neural Networks (DNN) Microsoft Research  Dahl, Deng, Acero (2012): Context- Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition  Reduction of error rate from 23% to 13%
  • 14. © Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS 19Dr. Joachim Köhler DNNs for Speech Recognition Dr. Joachim Köhler
  • 15. © Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS 20Dr. Joachim Köhler Speech Recognition is currently one of the Top Technolgoies DNN based applications from Amzon, Microsoft, Google & co Dr. Joachim Köhler Amazon Alexa Echo 2016 Apple: Siri 2015 Google Now: 2015 Microsoft: Cortana 2016
  • 16. © Fraunhofer Deep Learning  Speech recognition  Image recognition  Text understanding  Machine translations  Breast cancer diagnostics  Game play A game changer towards artificial intelligence big data + machine learning = progress in AI Quelle: Y. Bengio, ML tutorial, KDD 2014 Quelle: S. Jones, nvidia blog, 2014 Quelle: Microsoft Research, 2014 Quelle: Ciresan et al., Proc MICCAI, 2013 Quelle: Mnih et al., Nature, 2015 Quelle: Xiong et al., Science 2015
  • 17. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Speech Recognition System Setup (German) powered by Fraunhofer IAIS  Acoustic Training Data: GER-TV 1000h (LREC 2014)  Language Model Training Data: 71.8 M words (news domain)  Competetive on the German market, English system in progress  Using deep neural networks (DNNs) for acoustic modelling (instead of Gaussian Mixtures Models)  stable, continuous improvement, integration of up-to-date research results GMM Gaussian Mixture Model, DNN Deep Neural Network Jahr Acoustic Model Language Model Training data [h] WER [%] planned WER [%] Spontaneous 2012/13 1. GMM 3gram, 200k 105 26.4 33.5 2013/14 2. GMM 3gram, 200k 323 24.0 31.1 2014 3. DNN 3gram, 200k 323 18.4 22.6 2015 4. DNN 5gram, 510k 1005 13.3 16.5 2016 5. RNN 5gram, 510k 1005 11.9 14.5
  • 18. © Fraunhofer Ongoing Research on RNN-CTC  RNN-CTC: Connectionist Temporal Classification. What's new: solve speech recognition as an end-to-end machine learning task, everything is a (deep) recurrent neural network (RNN)  1000h speech corpus, ~2 weeks training time on GPU cluster.  About ~10% relative reduction on average in WER with RNN-CTC Beyond HMM, HMM-DNN Approaches
  • 19. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Speaker Recogntion using iVectors 2,5 -3,9 -1,6 -2,8 4,3 3,2 0,9 0,2 3,3 -0,5 1,7 -2,3 -0,5 -3,3 -1,7 0,3 -3,0 -1,8 -0,2 2,0 0,1 0,4 -0,3 0,5 -0,1 0,6 2,2 -1,6 0,3 -0,8 -2,4 -1,4 0,3 1,4 -1,7 -0,6 -1,3 -1,0 -1,9 0,0 -1,3 0,8 -1,3 -0,4 1,2 2,4 -0,1 1,8 0,6 -0,4 -1,2 -1,3 -1,4 1,0 -2,1 -0,1 0,1 -1,3 0,4 1,2 -0,1 -1,3 -0,9 -0,2 -2,1 0,6 -0,6 0,2 0,9 0,0 0,0 -0,6 0,5 -2,0 -0,5 1,3 0,2 0,4 1,3 0,8 0,0 -0,6 -0,8 -0,3 -0,9 -1,4 1,4 0,0 0,7 0,9 -0,5 0,4 1,2 0,2 0,7 -0,8 -0,3 -3,3 -0,4 -1,1 -1,1 1,4 -0,2 -0,3 -1,0 -0,1 -0,1 -1,1 0,8 0,4 -0,2 -1,5 -0,3 -0,7 -0,2 -0,6 -0,3 -0,2 -0,2 0,7 0,3 1,7 -0,6 1,4 -1,5 -0,1 0,3 -0,9 0,1 -0,6 -0,4 -0,4 -0,3 0,3 0,6 -0,3 0,0 0,8 0,8 -0,3 0,2 0,2 -0,5 0,9 0,4 1,1 0,5 0,0 -0,2 0,9 -1,2 -0,8 0,2 -1,0 -0,7 0,6 -0,7 0,2 0,9 -0,9 -0,2 2,6 1,0 -0,2 0,4 -0,2 1,0 0,1 -1,0 0,8 0,1 -1,4 0,6 -0,2 -0,5 0,9 -0,3 0,2 1,2 0,4 -0,1 0,6 0,6 0,5 -0,7 -0,2 1,9 0,7 0,4 -1,3 -1,6 0,1 -0,6 0,1 1,4 0,0 -0,6 0,4 -0,2 0,5 1,7 0,6 0,3 0,2 0,3 -0,1 -0,4 -0,3 -0,3 0,4 0,2 0,3 1,4 0,1 0,5 -0,6 -0,4 -0,5 2,0 0,2 0,7 1,6 -0,8 -1,2 0,2 -0,4 -0,5 1,1 -0,1 0,1 -0,2 -2,2 0,2 0,8 -0,2 2,0 -0,9 0,5 -1,2 1,0 -0,1 0,2 0,4 0,6 0,1 0,2 -0,9 -0,1 -0,2 -0,1 -0,4 1,2 -0,1 -1,2 0,0 0,6 1,9 -1,6 0,5 1,1 1,6 0,2 1,6 -0,4 -0,1 1,1 -0,4 0,1 0,4 -0,2 0,8 1,3 1,4 1,5 -0,4 -0,9 -0,4 -0,1 -0,6 -0,1 0,1 -0,6 -1,1 1,2 0,2 -1,3 0,4 -0,5 -1,7 0,4 0,9 -0,1 -1,2 -0,2 -0,6 0,8 -0,2 -1,3 0,8 -0,3 2,3 -0,7 -0,2 -0,1 -0,2 -0,3 0,1 1,0 1,5 0,7 0,0 0,8 -1,0 -0,2 -0,9 -0,7 -0,8 0,8 1,6 -0,1 0,7 -0,1 1,0 -0,5 1,5 -1,4 1,6 0,4 0,8 1,2 -0,5 0,7 -1,0 -1,3 -0,2 0,6 0,6 0,8 0,6 0,6 0,0 1,1 0,0 0,1 0,5 -0,2 0,9 0,5 -0,7 -0,2 -0,2 0,4 -0,6 -0,7 -0,4 1,2 0,0 -0,2 0,1 0,2 0,3 0,6 0,1 -1,1 0,6 1,1 0,3 -0,1 -0,7 0,8 0,1 -0,2 -0,1 0,5 -0,9 -0,2 0,2 0,4 -0,9 0,1 -1,6 -0,2 0,6 -0,8 -1,3 -1,1 1,0 -0,6 -0,6 -0,8 -0,7 -0,8 1,6 0,3 -0,4 0,6 -0,6 0,5 -0,1 0,5 -1,3 1,6 0,3 7,3 8,2 1,3 1,4 -0,1 0,3 -0,9 2,9 -3,9 -0,4 -5,6 -2,0 -0,3 0,6 -0,9 -0,3 -2,6 -0,1 -0,2 -0,4 -0,4 0,0 -0,5 1,5 -4,0 -0,5 -0,9 8,6 -1,8 -0,2 -1,0 -1,2 1,0 -2,2 -1,5 -0,2 0,0 -1,7 -1,2 0,1 1,0 0,6 4,3 0,0 1,3 -0,2 -1,0 1,3 -0,3 2,8 -1,6 1,1 0,0 -0,1 -1,2 -0,5 -0,4 -0,2 0,1 0,0 0,4 -3,4 -1,9 0,3 -0,1 1,3 0,0 0,0 0,3 0,0 0,2 -0,8 0,4 0,2 0,6 -1,0 -1,2 0,0 -0,1 0,5 -0,1 -0,6 0,1 -2,4 0,0 -0,4 0,3 0,7 0,2 2,9 0,0 0,0 0,0 0,2 -3,3 0,6 0,9 -0,8 0,0 0,0 0,4 0,4 0,0 0,1 0,7 1,1 0,3 -0,2 -0,6 -0,2 1,3 0,1 -0,1 0,2 0,0 0,2 0,9 0,1 -2,0 0,4 -2,1 0,0 0,0 0,2 -0,7 0,1 -0,5 0,0 -0,1 0,1 0,2 -0,2 0,1 0,0 0,6 0,5 -0,4 -0,2 -0,2 0,8 -0,3 -0,2 1,0 0,2 0,0 -0,1 0,4 2,0 -0,5 -0,2 0,0 0,4 0,7 0,1 -0,4 1,4 -0,8 0,2 -1,8 1,5 -0,1 1,0 -0,4 1,3 0,0 0,4 -1,3 0,0 -0,3 -0,5 0,1 0,5 0,4 -0,6 -0,1 2,0 -1,0 -0,2 0,7 -1,7 0,2 0,4 -0,2 -1,3 1,1 -0,1 0,9 -0,3 0,2 0,8 0,1 -1,5 0,0 -0,2 -0,2 0,3 0,2 -1,0 -0,5 -0,4 -0,1 -0,2 0,0 0,0 0,0 0,2 0,1 -0,4 -0,1 3,4 -0,1 0,6 -0,1 -0,2 0,4 -3,0 0,1 1,7 0,0 1,1 -1,7 0,0 -0,2 0,5 -2,1 -0,1 0,1 0,1 -2,0 -0,1 0,9 0,3 -3,6 -0,3 0,3 0,0 0,3 0,1 -0,2 0,4 -0,6 0,0 0,0 0,8 0,2 0,1 -0,1 0,2 -0,7 0,2 1,1 0,0 0,2 3,0 1,1 -1,0 1,7 0,2 0,0 1,3 0,2 -0,1 0,7 -0,2 -0,1 0,2 -0,1 0,6 -3,1 0,3 0,5 0,4 0,3 -0,2 0,0 -0,2 0,0 0,0 0,5 0,7 -1,0 -0,2 -0,3 0,0 0,3 0,7 -0,1 -0,5 -0,1 -0,5 0,3 0,2 1,1 0,1 0,0 0,2 -0,3 0,7 0,1 0,0 0,1 0,0 0,2 0,0 0,3 1,4 -0,3 0,0 -0,3 0,2 -0,4 1,1 0,0 0,2 -0,1 0,5 0,1 0,4 0,0 -1,0 1,1 2,3 0,6 0,5 -0,5 -0,2 -0,2 -0,1 -0,1 -0,3 0,1 0,1 0,2 -0,5 1,7 0,4 0,4 0,0 0,7 0,0 0,0 0,3 0,2 -0,2 0,6 -0,1 -0,1 0,0 0,2 0,0 -0,2 -0,1 -1,1 0,0 0,1 0,0 0,3 -0,1 0,0 0,1 0,3 -0,5 1,9 0,0 0,0 -0,6 -0,1 -0,1 0,1 0,0 0,8 -0,9 0,0 0,1 0,0 0,3 0,0 0,0 -0,2 -0,5 0,2 0,1 -0,7 1,4 -0,5 0,6 0,9 0,4 0,0 2,2 0,1 0,2 0,3 -0,2 -0,1 0,0 -0,3 iVector Comparison  Sebastian Kurz  Confidence: 0,05
  • 20. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Fraunhofer IAIS Audio Mininig: Technology  Speaker diarization to structure recordings automatically (e.g. speaker information)  ASR System based on KALDI open source package  Using Deep Neural Networks  Completely speaker independent  Real-time processing  Trained on 1000 hours large-scale German broadcast database  Service-orientated architecture to control and run the recognition engine Web services Messaging Audio Mining core Audio Mining Monitor AudioMining iFinder Structural Analysis Structural Analysis Structural Analysis Automatic Speech Recognitio n Automatic Speech Recognitio n Automatic Speech Recognitio n
  • 22. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Media Search Interface Search functionality: Find audio and video files with specific keywords, specific words in the title or the transcript, or with a specific series name.
  • 23. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Segmentation, Sub-Titles, Preview Preview functionality: Select a media file from the right- hand side to watch it or listen to it. Subtitles: Audio Mining creates subtitles based on the transcript and the structural analysis results. Segmentation/Speaker clustering: Audio Mining detects whenever the speaker changes and divides the media file into multiple segments. Jump to a specific segment by clicking the timeline.
  • 24. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Word Positioning, Snippets Advanced search functionality: You are also able to search for a specific word inside the transcript. Word occurrences: Marks indicate the occurrences of the search term. Click on a mark to jump to the corresponding position in the video.
  • 25. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Keywords Keywords: Audio Mining generates keywords for every media file, based on particular relevant words in the transcript. Again, marks indicate the occurrences of the keyword. Click on a mark to jump to the corresponding position in the video.
  • 26. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Full transcript Transcript: Audio Mining provides a transcript for every media file. Again, the video or audio file is divided into segments. Different colours indicate different speakers. You are able to export the transcript to different file formats.
  • 27. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Recommendation Recommendations: You have just watched an exciting video and are now looking for a similar one? No problem! Audio Mining recommends related media files, based on the similarity of their keywords.
  • 28. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audio Mining: Status Demo System: https://nm-demo.iais.fraunhofer.de/customer_demo/  Fraunhofer IAIS provide web-based test account for interested customers  https://nm-demo.iais.fraunhofer.de/$TV-station  HR, SWR, BR, RBB, ZDF, …  Easy to use, simple upload functionality  Positive feedback  Segmentation and speaker diarization very useful (improvement possible)  ASR quality for many types of radio and TV program good  Keyword search and keyword access is very positive  Full transcript is useful  Keyword generation as interesting alternative for summary and fixed semantic vocabulary  Export in several formats possible
  • 29. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audio Mining: Challenges and Research Issues Feedback from media archive professionals of ARD  Overlapping speech segments, voice over  Short speaker turns are difficult to detect  Overlapping speech segments reduces ASR quality (“talk show”)  Voice over: Start in language 1, continue with language 2  Hard to solve  Background noise, noisy conditions  Noise degrades ASR quality  Solutions: data augmentation, speech enhancement  Very open domains, unlimited vocabulary, Out-Of Vocabulary Problem, Names  Regular update of the language models required (e.g. “Incirlik“, „James Comey“)  Mixed/multiple languages  Foreign names (ARD pronunciation dictionary)  Dialects  BR provides several dialects of the German language for research work  Punctuation mark are required
  • 31. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de System architecture Audio Mining core Audio Mining Monitor Audio Mining core iFinder Web services Messaging Clients (e.g. AREMA) Web interface AudioMining Analysis requests ↓ ↑ Analysis results ← Analysis priorities Asset details, . processing updates, . deletion updates → Analysis priorities ↓ ↑ Asset details, processing updates, deletion updates Import, analysis, status and deletion requests ↓ ↑ Asset status, details
  • 32. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de AudioMining System architecture Audio Mining Monitor Audio Mining core iFinder Web services Messaging Clients (e.g. AREMA) Web interface Audio Mining core Audio Mining Data base Search index File system
  • 33. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de System architecture Audio Mining core Audio Mining Monitor Web services Messaging Clients (e.g. AREMA) Web interface AudioMining iFinder Structural Analysis Structural Analysis Structural Analysis Automatic Speech Recognition Automatic Speech Recognition Automatic Speech Recognition
  • 34. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audio Mining Monitor System architecture Audio Mining core Audio Mining core iFinder Web services Messaging Clients (e.g. AREMA) Web interface AudioMining Data baseAudio Mining Monitor HTTP Server Messaging Server
  • 35. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Infrastructure and Scalability Server (1): Scheduling and Media Repository  VM, ≥ 2 Cores (≥ 2 GHz, 64-bit), 30 GB RAM  SLES, JRE 8, MySQL, Bash 4  Server (2): Audio-Analyses  Processing capacity per core (AMD Opteron 6234): 17 h Audiomaterial am Tag 4 GB RAM  For 20 h Audio data per day:  ≥ 2 Cores (≥ 2 GHz, 64-bit), 8 GB RAM  SLES  Audio processing is fully scalable  Tested on 480 cores to process several thousands hours/day
  • 37. © Fraunhofer Speech Recognition for Media Archiving powered by Fraunhofer IAIS Customer: WDR, German Broadcaster (Archive Department) Project facts:  Integration of Fraunhofer IAIS Audio- Mining system into the WDR IT environment (ARCHIMEDES und IVZ)  Content mining of large amounts of AV- data, immediately!  Better navigation and segmentation of radio and TV material  Search in spoken utterances  Full transcription and keyword generation Technology provided by Fraunhofer:  Broadcast speech recognition  Automatic speech segmentation Strukturierte Aufbereitung Speech Recognition Structured Segmentation
  • 38. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Content Analytics for ARD Mediathek Artifical Intelligence powered by Fraunhofer  Content analytics of 200.000 media assets  Advanced search and retrieval capabilities  Full transcription of multimedia content  Daily processing of 2000 new media assets from radio and TV  Core technology for recommendation and personalization services  Link: http://www.ardmediathek.de
  • 39. © Fraunhofer Speech Recognition for the „ARD-Mediathek“ powered by Fraunhofer IAIS Customer: SWR/ Redaktion ARD.de (Link: www.ardmediathek.de), 2014/15 Project facts:  Processing of 200.000 media assets (average duration 15 minutes/asset)  Service based (crawling, processing, metadata transfer)  Daily amount: 2000 assets (update mechanism every 60 minutes) Technology provided by Fraunhofer:  Speaker diarization, speech recognition, key word extraction)
  • 40. © Fraunhofer real-time analysis of heterogeneous news streams News-Stream Objectives  Big data infrastructure for efficient and real-time analysis of heterogeneous news streams  Semantic analysis of multimodal and unstructred news data  Piloting in real-life scenarios Technologies and Applications  Real-time speaker recognition  Audio „citation“ search  Heatmap & Social Media Monitoring, …  Project duration: 09/2014 bis 12/2017 http://newsstreamproject.org/ 49
  • 41. © Fraunhofer KA3: Cologne Centre for Analysis and Archiving of AV Data Centre Project of the German BMBF eHumanities Program  Project objectives  Creation of a centre for the e- Humanities Research in Germany with the focus on AV data  Contribution of Fraunhofer IAIS  Development and providing tools for automatic analytics of speech and audio recordings (oral history scenario, interaction scenario)  Use Case 1: Oral History  Use Case 2: Interaction Scenario  Duration: 10/2015 – 09/2018  Partners : Univ. Köln, MPI for Psycholinguistics, Fernuniversität in Hagen
  • 42. © Fraunhofer KA3: Use Case Interaction Scenario Challenges:  Very fast dialogues, short speaker turns  Backchanncel sounds („mmh“, „hmm“, „ja“, …)  Overlapping speech segments Technologies:  Improved speaker clustering  Speech/non speech segmentation with deep learning  Overlapping speech segments with RNN  Automatic segmentation of speech recordings Arbitrary # of speakers : max. 2 Sprecher: 2 speakers : references:
  • 43. © Fraunhofer KA3: Use Case Oral History Speech Recognition: Reference & ASR Ouput Example: Kruse (clean recording) zwischendrin hatte ich natürlich auch versucht noch mit bei der Medizin zu landen das war aber damals deswegen so schwierig weil das glaube ich ein Jahr war bevor der Numerus clausus in der Medizin eingeführt wurde und man musste so mit sechshundert Anfängern ungefähr um sechs Uhr auf der Treppe sitzen damit man um acht Uhr in die Vorlesung kam und das war für mich zwischendrin hatte ich natürlich auch versucht sich noch beim bei der Medizin zu landen das war aber damals deswegen so schwierig weil das glaube ich ein Jahr war bevor der Numerus clausus in der Medizin eingeführt wurde und man musste somit sechshundert Anfängern ungefähr um sechs Uhr auf der Treppe sitzen damit man um acht Uhr in der Vorlesung kam und das war für mich dann habe ich dieses Studium abgeschlossen und hatte mich kurz auch mal dafür interessiert in eine Berufstätigkeit im Entwicklungsdienst deutscher Entwicklungsdienst hieß das glaube ich einzusteigen hatte aber auch gleichzeitig so einen Hiwi-Job am Institut und so blieb ich dann hängen und hatte eben einfach die Chance weil man dann auch gefördert wird oder die Chance hat in einem bestimmten Projekt zu arbeiten dass ich dann daran gedacht habe zu promovieren dann habe ich dieses Studium abgeschlossen und hatte mich kurz auch mal dafür interessiert in eine Berufstätigkeit im indem ein Entwicklungsland Deutscher Entwicklungsdienst hieß das glaube ich einzusteigen hatte aber auch gleichzeitig so ein ein Hiwi Job am Institut und so lieblich dann hängen und hatte eben einfach die Chance weil man dann auch gefördert wird oder die Chance hat _ einen bestimmten Projekt zu arbeiten dass ich dann daran gedacht habe zu promovieren
  • 44. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de KA3/Newsstream: Forced Alignment & Editing of Transcripts  If a complete and almost perfect transcription text is availalbe, the missing time code will be generated by forced alignment  Input: audio file, transcript  Output: segmentation file (MPEG-7, ELAN)  Part of iFinder 3.0
  • 45. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Summary and Outlook Summary  Deep Learning and large corpora have led to massive progress for Speech2Text  Speech2Text provides good transcription quality for broadcast speech (about 10% error), however not perfect  Audio Mining more then S2T: speech segmentation, speaker recognition, citations, …  Many advantages: annotation costs, immediate availability , more details and time codes  Some disadvantages: Challenging recording conditions, explosion of metadata  Conclusion: Acceptance for Audio Mining/S2T is given !!!!  Test Account possible: https://nm-demo.iais.fraunhofer.de/customer_demo Outlook  Several research issues are still open (dialects, overlapping speech segments, …)  Further improvement is expected (evaluation of Deep Learning, more data, engineering)  Important issue: Integration into MAM workflows
  • 46. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 59 Let‘s do more with your data! Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS www.iais.fraunhofer.de Link: https://www.iais.fraunhofer.de/audiomining.html Contact Dr. Joachim Köhler Head of Image Processing +49 (0)2241 14-1900 joachim.koehler@iais.fraunhofer.de
  • 47. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 60 Disclaimer Copyright © by Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Hansastraße 27 c, 80686 Munich, Germany All rights reserved. Responsible contact is: Katrin Berkler | Silke Loh | Public Relations | pr@iais.fraunhofer.de All copyrights for this presentation and their content are owned in full by the Fraunhofer-Gesellschaft, unless expressly indicated otherwise. Each presentation may be used for personal editorial purposes only. Modifications of images and text are not permitted. Any download or printed copy of this presentation material shall not be distributed or used for commercial purposes without prior consent of the Fraunhofer-Gesellschaft. Notwithstanding the above mentioned, the presentation may only be used for reporting on Fraunhofer- Gesellschaft and its institutes free of charge provided source references to Fraunhofer’s copyright shall be included correctly and provided that two free copies of the publication shall be sent to the above mentioned address. The Fraunhofer-Gesellschaft undertakes reasonable efforts to ensure that the contents of its presentations are accurate, complete and kept up to date. Nevertheless, the possibility of errors cannot be entirely ruled out. The Fraunhofer-Gesellschaft does not take any warranty in respect of the timeliness, accuracy or completeness of material published in its presentations, and disclaims all liability for (material or non-material) loss or damage arising from the use of content obtained from the presentations. The afore mentioned disclaimer includes damages of third parties. Registered trademarks, names, and copyrighted text and images are not generally indicated as such in the presentations of the Fraunhofer-Gesellschaft. However, the absence of such indications in no way implies that these names, images or text belong to the public domain and may be used unrestrictedly with regard to trademark or copyright law.