Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams

1
Fraunhofer IAIS Audio Mining:
Automatic meta data generation of audio streams
FIAT/IFTA Media Management Seminar, Lugano 2017
Dr. Joachim Köhler
Head of Department NetMedia
Fraunhofer-Institut for Intelligent Analysis and
Information Systems

© Fraunhofer IAIS
Fraunhofer is the largest organization for applied research
in Europe
 More than 80 research institutions, including
69 Fraunhofer institutes
 More than 24,500 employees, the majority
educated in the natural sciences or engineering
 An annual research volume of 2.1 billion euros,
of which 1.9 billion euros is generated through
contract research
 2/3 of this research revenue derives from contracts with
industry and from publicly financed research projects.
 1/3 is contributed by the German federal government
and the Länder governments in the form of
institutional financing.
 International collaboration through representative
offices in Europe, the US, Asia and the Middle East
3

© Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 5
Fraunhofer Institute Centre Schloss Birlinghoven
International Research in Big Data and Cognitive Computing
600 interdisciplinary scientists – 3 Institutes
 Fraunhofer Institute for Applied
Information Technology FIT
 Fraunhofer Institute for Intelligent
Analysis and Information Systems IAIS
 Fraunhofer Institute for Algorithms
and Scientific Computing SCAI
One of the largest research locations for
applied computer science and
mathematics in Germany
Close cooperation with
regional universities

Fraunhofer & digital archiving and broadcasting
 Several Fraunhofer Institutes have contributed to
many seminars of the German VFM on automatic
metadata generation
 Fraunhofer IAIS generated a study on Future
Technologies for media archives & concept for an
innovative archive system: Media Data Hub
 Participation in many European research projects
(LIVE, AXES, CubRIK, LinkedTV, MiCO)
 Workshop with directors of broadcast archives 2012
 Technology portfolio
 Music & Video Analytics (IDMT)
 Audio Mining (IAIS)
 Media Data Hub (IAIS)
 Quality Control and fingerprinting (IDMT)
Activities, portfolio, networking
VFM Technology
workshop 2012

The Future of Media Archives: Strategic & Conceptual
Native crossmedia
• Crossmedia from data model to UI
• Using graph-based data models (e.g. Europeana)
Media Data Hub
• Linking and integration of data silos
• Bringing all metadata sources into one application (archive, legal, )
Massive automation of documentation
• Manual annotation will be reduced, process-optimized
• Future: up to 100% automatic annotation (like in press archives)
Near to production environment
• Search and access immediately after production process
• Interfacing to production systems (OpenMedia, Avid, etc.)

Mining Technologies for Media Archiving
( Report »Archiv system of the future – Strategic innovation concept«,
Fraunhofer 2014) ; Technology readiness level (TRL)
 Text Mining
 Audio Mining
 Video Mining
 Object & face recognition
 Video OCR
 Image Similarity
 Audio- and video fingerprinting
 Recommendation technologies
 Interactive data visualization
 Personalization and contextualization
 Facetted Search
 Linking of information items
Anwendung Kriterium
Unterstützte
Dossiererstellung
Reifegrad 4-5
Integrations- und
Betriebsaufwand
3-5
Mehrwert 5

Results of the 2nd FIAT/IFTA MAM Survey

© Fraunhofer
Joachim.Koehler@iais.fraunhofer.de
Fraunhofer IAIS Audio Mining
 B2B Speech Recognition Solution for the Media Industry
 Key Facts
 Large Vocabulary Continuous Speech Recognition (1.000.000 words)
optimized for media content
 Automatic structuring of audio-visual content
 Applications along the Media Asset Chain
 Archive: Indexing and transcription of media archive content
 Online: Search functionalities for media portals (e.g. InClip-Search) and
content-based recommendation
 TV-Distribution: Subtitling for TV content
 SocialTV: Second Screen information enrichment
 Advertising:/Marketing Video Search Engine Optimization (VSEO) and
contextualized advertisement

© Fraunhofer
SPEECH TECHNOLOGY AND
SOLUTION
Audio Mining Solution

© Fraunhofer
Audiomining
powered by Fraunhofer IAIS
Feature Advantage for Customer
Automatic Speech Segmentation
Fast browsing through long videos
Finding relevant segments quickly
Speaker Clustering / Speaker Detection
Searching for segments with specific speaker
Searching for statements by person
Speech Recognition
Search for relevant videos
Search within videos for relevant section
Keyword Generation
Generate Tag Cloud
Get a rough summary of the video

© Fraunhofer
Speaker Diarization
 Unstructured audio recording
 Homogeneous segments
Speech Speech Detection of speech Speech
Male Voice Male Voice Detection of gender Female Voice
Speaker 1 Speaker 1 Speaker recognition Speaker 2
 Jingle recognition
(e.g. programm)
Start of News Show

© Fraunhofer
Automatic Speech Recognition
 Converts speech signal into written text
 Prerequisite for further steps (text mining)
 Based on statistical models to be trained
by large amount of data
 Three components:
 Acoustic model
(How do phonemes sound?)
 Lexicon
(How are words pronounced?)
 Speech language
(Which words are probable?)
 Automatic speech recognition computes most
probable word sequence
Language
model
Lexicon
Acoustic
model
recognized text

© Fraunhofer
Progress in Speech Recognition
 Massive Usage of Deep Learning
Technology:
 Improvement of acoustic
modelling (many speakers, many
speaking styles, etc. )
 Gaussian Mixtures (GMM) =>
Deep Neural Networks (DNN)
Microsoft Research
 Dahl, Deng, Acero (2012): Context-
Dependent Pre-Trained Deep Neural
Networks for Large-Vocabulary Speech
Recognition
 Reduction of error rate from 23% to
13%

© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS
19Dr. Joachim Köhler
DNNs for Speech Recognition
Dr. Joachim Köhler

© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS
20Dr. Joachim Köhler
Speech Recognition is currently one of the Top Technolgoies
DNN based applications from Amzon, Microsoft, Google & co
Dr. Joachim Köhler
Amazon Alexa Echo 2016 Apple: Siri 2015 Google Now: 2015
Microsoft: Cortana 2016

© Fraunhofer
Deep Learning
 Speech recognition
 Image recognition
 Text understanding
 Machine translations
 Breast cancer diagnostics
 Game play
A game changer towards artificial intelligence
big data
+ machine learning
= progress in AI Quelle: Y. Bengio, ML tutorial, KDD 2014
Quelle: S. Jones, nvidia blog, 2014
Quelle: Microsoft Research, 2014
Quelle: Ciresan et al., Proc MICCAI, 2013
Quelle: Mnih et al., Nature, 2015
Quelle: Xiong et al., Science 2015

© Fraunhofer
Speech Recognition System Setup (German)
 Acoustic Training Data: GER-TV 1000h (LREC 2014)
 Language Model Training Data: 71.8 M words (news domain)
 Competetive on the German market, English system in progress
 Using deep neural networks (DNNs) for acoustic modelling
(instead of Gaussian Mixtures Models)
 stable, continuous improvement, integration of up-to-date research results
GMM Gaussian Mixture Model, DNN Deep Neural Network
Jahr Acoustic
Model
Language
Model
Training
data [h]
WER [%]
planned
WER [%]
Spontaneous
2012/13 1. GMM 3gram, 200k 105 26.4 33.5
2013/14 2. GMM 3gram, 200k 323 24.0 31.1
2014 3. DNN 3gram, 200k 323 18.4 22.6
2015 4. DNN 5gram, 510k 1005 13.3 16.5
2016 5. RNN 5gram, 510k 1005 11.9 14.5

© Fraunhofer
Ongoing Research on RNN-CTC
 RNN-CTC: Connectionist Temporal Classification. What's new: solve speech
recognition as an end-to-end machine learning task, everything is a (deep)
recurrent neural network (RNN)
 1000h speech corpus, ~2 weeks training time on GPU cluster.
 About ~10% relative reduction on average in WER with RNN-CTC
Beyond HMM, HMM-DNN Approaches

© Fraunhofer
Speaker Recogntion using iVectors
2,5 -3,9 -1,6 -2,8 4,3 3,2 0,9 0,2 3,3 -0,5 1,7 -2,3 -0,5 -3,3 -1,7 0,3 -3,0 -1,8 -0,2 2,0
0,1 0,4 -0,3 0,5 -0,1 0,6 2,2 -1,6 0,3 -0,8 -2,4 -1,4 0,3 1,4 -1,7 -0,6 -1,3 -1,0 -1,9 0,0
-1,3 0,8 -1,3 -0,4 1,2 2,4 -0,1 1,8 0,6 -0,4 -1,2 -1,3 -1,4 1,0 -2,1 -0,1 0,1 -1,3 0,4 1,2
-0,1 -1,3 -0,9 -0,2 -2,1 0,6 -0,6 0,2 0,9 0,0 0,0 -0,6 0,5 -2,0 -0,5 1,3 0,2 0,4 1,3 0,8
0,0 -0,6 -0,8 -0,3 -0,9 -1,4 1,4 0,0 0,7 0,9 -0,5 0,4 1,2 0,2 0,7 -0,8 -0,3 -3,3 -0,4 -1,1
-1,1 1,4 -0,2 -0,3 -1,0 -0,1 -0,1 -1,1 0,8 0,4 -0,2 -1,5 -0,3 -0,7 -0,2 -0,6 -0,3 -0,2 -0,2 0,7
0,3 1,7 -0,6 1,4 -1,5 -0,1 0,3 -0,9 0,1 -0,6 -0,4 -0,4 -0,3 0,3 0,6 -0,3 0,0 0,8 0,8 -0,3
0,2 0,2 -0,5 0,9 0,4 1,1 0,5 0,0 -0,2 0,9 -1,2 -0,8 0,2 -1,0 -0,7 0,6 -0,7 0,2 0,9 -0,9
-0,2 2,6 1,0 -0,2 0,4 -0,2 1,0 0,1 -1,0 0,8 0,1 -1,4 0,6 -0,2 -0,5 0,9 -0,3 0,2 1,2 0,4
-0,1 0,6 0,6 0,5 -0,7 -0,2 1,9 0,7 0,4 -1,3 -1,6 0,1 -0,6 0,1 1,4 0,0 -0,6 0,4 -0,2 0,5
1,7 0,6 0,3 0,2 0,3 -0,1 -0,4 -0,3 -0,3 0,4 0,2 0,3 1,4 0,1 0,5 -0,6 -0,4 -0,5 2,0 0,2
0,7 1,6 -0,8 -1,2 0,2 -0,4 -0,5 1,1 -0,1 0,1 -0,2 -2,2 0,2 0,8 -0,2 2,0 -0,9 0,5 -1,2 1,0
-0,1 0,2 0,4 0,6 0,1 0,2 -0,9 -0,1 -0,2 -0,1 -0,4 1,2 -0,1 -1,2 0,0 0,6 1,9 -1,6 0,5 1,1
1,6 0,2 1,6 -0,4 -0,1 1,1 -0,4 0,1 0,4 -0,2 0,8 1,3 1,4 1,5 -0,4 -0,9 -0,4 -0,1 -0,6 -0,1
0,1 -0,6 -1,1 1,2 0,2 -1,3 0,4 -0,5 -1,7 0,4 0,9 -0,1 -1,2 -0,2 -0,6 0,8 -0,2 -1,3 0,8 -0,3
2,3 -0,7 -0,2 -0,1 -0,2 -0,3 0,1 1,0 1,5 0,7 0,0 0,8 -1,0 -0,2 -0,9 -0,7 -0,8 0,8 1,6 -0,1
0,7 -0,1 1,0 -0,5 1,5 -1,4 1,6 0,4 0,8 1,2 -0,5 0,7 -1,0 -1,3 -0,2 0,6 0,6 0,8 0,6 0,6
0,0 1,1 0,0 0,1 0,5 -0,2 0,9 0,5 -0,7 -0,2 -0,2 0,4 -0,6 -0,7 -0,4 1,2 0,0 -0,2 0,1 0,2
0,3 0,6 0,1 -1,1 0,6 1,1 0,3 -0,1 -0,7 0,8 0,1 -0,2 -0,1 0,5 -0,9 -0,2 0,2 0,4 -0,9 0,1
-1,6 -0,2 0,6 -0,8 -1,3 -1,1 1,0 -0,6 -0,6 -0,8 -0,7 -0,8 1,6 0,3 -0,4 0,6 -0,6 0,5 -0,1 0,5
-1,3 1,6 0,3 7,3 8,2 1,3 1,4 -0,1 0,3 -0,9 2,9 -3,9 -0,4 -5,6 -2,0 -0,3 0,6 -0,9 -0,3 -2,6
-0,1 -0,2 -0,4 -0,4 0,0 -0,5 1,5 -4,0 -0,5 -0,9 8,6 -1,8 -0,2 -1,0 -1,2 1,0 -2,2 -1,5 -0,2 0,0
-1,7 -1,2 0,1 1,0 0,6 4,3 0,0 1,3 -0,2 -1,0 1,3 -0,3 2,8 -1,6 1,1 0,0 -0,1 -1,2 -0,5 -0,4
-0,2 0,1 0,0 0,4 -3,4 -1,9 0,3 -0,1 1,3 0,0 0,0 0,3 0,0 0,2 -0,8 0,4 0,2 0,6 -1,0 -1,2
0,0 -0,1 0,5 -0,1 -0,6 0,1 -2,4 0,0 -0,4 0,3 0,7 0,2 2,9 0,0 0,0 0,0 0,2 -3,3 0,6 0,9
-0,8 0,0 0,0 0,4 0,4 0,0 0,1 0,7 1,1 0,3 -0,2 -0,6 -0,2 1,3 0,1 -0,1 0,2 0,0 0,2 0,9
0,1 -2,0 0,4 -2,1 0,0 0,0 0,2 -0,7 0,1 -0,5 0,0 -0,1 0,1 0,2 -0,2 0,1 0,0 0,6 0,5 -0,4
-0,2 -0,2 0,8 -0,3 -0,2 1,0 0,2 0,0 -0,1 0,4 2,0 -0,5 -0,2 0,0 0,4 0,7 0,1 -0,4 1,4 -0,8
0,2 -1,8 1,5 -0,1 1,0 -0,4 1,3 0,0 0,4 -1,3 0,0 -0,3 -0,5 0,1 0,5 0,4 -0,6 -0,1 2,0 -1,0
-0,2 0,7 -1,7 0,2 0,4 -0,2 -1,3 1,1 -0,1 0,9 -0,3 0,2 0,8 0,1 -1,5 0,0 -0,2 -0,2 0,3 0,2
-1,0 -0,5 -0,4 -0,1 -0,2 0,0 0,0 0,0 0,2 0,1 -0,4 -0,1 3,4 -0,1 0,6 -0,1 -0,2 0,4 -3,0 0,1
1,7 0,0 1,1 -1,7 0,0 -0,2 0,5 -2,1 -0,1 0,1 0,1 -2,0 -0,1 0,9 0,3 -3,6 -0,3 0,3 0,0 0,3
0,1 -0,2 0,4 -0,6 0,0 0,0 0,8 0,2 0,1 -0,1 0,2 -0,7 0,2 1,1 0,0 0,2 3,0 1,1 -1,0 1,7
0,2 0,0 1,3 0,2 -0,1 0,7 -0,2 -0,1 0,2 -0,1 0,6 -3,1 0,3 0,5 0,4 0,3 -0,2 0,0 -0,2 0,0
0,0 0,5 0,7 -1,0 -0,2 -0,3 0,0 0,3 0,7 -0,1 -0,5 -0,1 -0,5 0,3 0,2 1,1 0,1 0,0 0,2 -0,3
0,7 0,1 0,0 0,1 0,0 0,2 0,0 0,3 1,4 -0,3 0,0 -0,3 0,2 -0,4 1,1 0,0 0,2 -0,1 0,5 0,1
0,4 0,0 -1,0 1,1 2,3 0,6 0,5 -0,5 -0,2 -0,2 -0,1 -0,1 -0,3 0,1 0,1 0,2 -0,5 1,7 0,4 0,4
0,0 0,7 0,0 0,0 0,3 0,2 -0,2 0,6 -0,1 -0,1 0,0 0,2 0,0 -0,2 -0,1 -1,1 0,0 0,1 0,0 0,3
-0,1 0,0 0,1 0,3 -0,5 1,9 0,0 0,0 -0,6 -0,1 -0,1 0,1 0,0 0,8 -0,9 0,0 0,1 0,0 0,3 0,0
0,0 -0,2 -0,5 0,2 0,1 -0,7 1,4 -0,5 0,6 0,9 0,4 0,0 2,2 0,1 0,2 0,3 -0,2 -0,1 0,0 -0,3
iVector Comparison
 Sebastian Kurz
 Confidence: 0,05

© Fraunhofer
Fraunhofer IAIS Audio Mininig: Technology
 Speaker diarization to structure
recordings automatically (e.g. speaker
information)
 ASR System based on KALDI open
source package
 Using Deep Neural Networks
 Completely speaker independent
 Real-time processing
 Trained on 1000 hours large-scale
German broadcast database
 Service-orientated architecture to
control and run the recognition engine
Web services
Messaging
Audio
Mining
core
Audio
Mining
Monitor
AudioMining
iFinder
Structural
Analysis
Structural
Analysis
Structural
Analysis
Automatic
Speech
Recognitio
n
Automatic
Speech
Recognitio
n
Automatic
Speech
Recognitio
n

© Fraunhofer
USER INTERFACE

© Fraunhofer
GUI: Media Search Interface
Search functionality:
Find audio and video files with
specific keywords, specific words
in the title or the transcript, or
with a specific series name.

© Fraunhofer
GUI: Segmentation, Sub-Titles, Preview
Preview functionality:
Select a media file from the right-
hand side to watch it or listen to it.
Subtitles:
Audio Mining creates subtitles
based on the transcript and the
structural analysis results.
Segmentation/Speaker
clustering:
Audio Mining detects whenever
the speaker changes and divides
the media file into multiple
segments. Jump to a specific
segment by clicking the timeline.

© Fraunhofer
GUI: Word Positioning, Snippets
Advanced search functionality:
You are also able to search for a
specific word inside the transcript.
Word occurrences:
Marks indicate the occurrences of
the search term. Click on a mark
to jump to the corresponding
position in the video.

© Fraunhofer
GUI: Keywords
Keywords:
Audio Mining generates keywords
for every media file, based on
particular relevant words in the
transcript.
Again, marks indicate the
occurrences of the keyword. Click
on a mark to jump to the
corresponding position in the
video.

© Fraunhofer
GUI: Full transcript
Transcript:
Audio Mining provides a transcript
for every media file. Again, the
video or audio file is divided into
segments. Different colours
indicate different speakers.
You are able to export the
transcript to different file formats.

© Fraunhofer
GUI: Recommendation
Recommendations:
You have just watched an exciting
video and are now looking for a
similar one? No problem! Audio
Mining recommends related media
files, based on the similarity of
their keywords.

© Fraunhofer
Audio Mining: Status
Demo System: https://nm-demo.iais.fraunhofer.de/customer_demo/
 Fraunhofer IAIS provide web-based test account for interested customers
 https://nm-demo.iais.fraunhofer.de/$TV-station
 HR, SWR, BR, RBB, ZDF, …
 Easy to use, simple upload functionality
 Positive feedback
 Segmentation and speaker diarization very useful (improvement possible)
 ASR quality for many types of radio and TV program good
 Keyword search and keyword access is very positive
 Full transcript is useful
 Keyword generation as interesting alternative for summary and fixed semantic
vocabulary
 Export in several formats possible

© Fraunhofer
Audio Mining: Challenges and Research Issues
Feedback from media archive professionals of ARD
 Overlapping speech segments, voice over
 Short speaker turns are difficult to detect
 Overlapping speech segments reduces ASR quality (“talk show”)
 Voice over: Start in language 1, continue with language 2
 Hard to solve
 Background noise, noisy conditions
 Noise degrades ASR quality
 Solutions: data augmentation, speech enhancement
 Very open domains, unlimited vocabulary, Out-Of Vocabulary Problem, Names
 Regular update of the language models required (e.g. “Incirlik“, „James Comey“)
 Mixed/multiple languages
 Foreign names (ARD pronunciation dictionary)
 Dialects
 BR provides several dialects of the German language for research work
 Punctuation mark are required

© Fraunhofer
SYSTEM ARCHITECTURE

© Fraunhofer
System architecture
Audio Mining
core
Audio Mining
Monitor
Audio Mining
core
iFinder
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
AudioMining
Analysis
requests
↓
↑
Analysis
results
← Analysis priorities
Asset details, .
processing updates, .
deletion updates →
Analysis
priorities
↓
↑
Asset details,
processing updates,
deletion updates
Import, analysis,
status and deletion
requests
↓
↑
Asset status,
details

© Fraunhofer
AudioMining
System architecture
Audio Mining
Monitor
Audio Mining
core
iFinder
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
Audio Mining core
Audio Mining
Data
base
Search
index
File
system

© Fraunhofer
System architecture
Audio Mining
core
Audio Mining
Monitor
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
AudioMining
iFinder
Structural
Analysis
Structural
Analysis
Structural
Analysis
Automatic
Speech
Recognition
Automatic
Speech
Recognition
Automatic
Speech
Recognition

© Fraunhofer
Audio Mining Monitor
System architecture
Audio Mining
core
Audio Mining
core
iFinder
Web services
Messaging
Clients
(e.g. AREMA)
Web interface
AudioMining
Data
baseAudio Mining
Monitor
HTTP
Server
Messaging
Server

© Fraunhofer
Infrastructure and Scalability
Server (1): Scheduling and Media Repository
 VM, ≥ 2 Cores (≥ 2 GHz, 64-bit), 30 GB RAM
 SLES, JRE 8, MySQL, Bash 4
 Server (2): Audio-Analyses
 Processing capacity per core (AMD Opteron 6234):
17 h Audiomaterial am Tag
4 GB RAM
 For 20 h Audio data per day:
 ≥ 2 Cores (≥ 2 GHz, 64-bit), 8 GB RAM
 SLES
 Audio processing is fully scalable
 Tested on 480 cores to process several thousands hours/day

© Fraunhofer
Speech Recognition for Media Archiving
Customer: WDR, German Broadcaster
(Archive Department)
Project facts:
 Integration of Fraunhofer IAIS Audio-
Mining system into the WDR IT
environment (ARCHIMEDES und IVZ)
 Content mining of large amounts of AV-
data, immediately!
 Better navigation and segmentation of
radio and TV material
 Search in spoken utterances
 Full transcription and keyword generation
Technology provided by Fraunhofer:
 Broadcast speech recognition
 Automatic speech segmentation
Strukturierte Aufbereitung
Speech Recognition
Structured Segmentation

© Fraunhofer
Content Analytics for ARD Mediathek
Artifical Intelligence powered by Fraunhofer
 Content analytics of 200.000 media
assets
 Advanced search and retrieval
capabilities
 Full transcription of multimedia content
 Daily processing of 2000 new media
assets from radio and TV
 Core technology for recommendation
and personalization services
 Link: http://www.ardmediathek.de

© Fraunhofer
Speech Recognition for the „ARD-Mediathek“
Customer: SWR/ Redaktion ARD.de (Link: www.ardmediathek.de), 2014/15
Project facts:
 Processing of 200.000 media assets (average duration 15 minutes/asset)
 Service based (crawling, processing, metadata transfer)
 Daily amount: 2000 assets (update mechanism every 60 minutes)
Technology provided by Fraunhofer:
 Speaker diarization, speech recognition, key word extraction)

© Fraunhofer
real-time analysis of heterogeneous news streams
News-Stream
Objectives
 Big data infrastructure for efficient and real-time analysis of
heterogeneous news streams
 Semantic analysis of multimodal and unstructred news data
 Piloting in real-life scenarios
Technologies and Applications
 Real-time speaker recognition
 Audio „citation“ search
 Heatmap & Social Media Monitoring, …
 Project duration: 09/2014 bis 12/2017
http://newsstreamproject.org/
49

© Fraunhofer
KA3: Cologne Centre for Analysis and Archiving of AV Data
Centre Project of the German BMBF eHumanities Program
 Project objectives
 Creation of a centre for the e-
Humanities Research in Germany with
the focus on AV data
 Contribution of Fraunhofer IAIS
 Development and providing tools for
automatic analytics of speech and
audio recordings (oral history scenario,
interaction scenario)
 Use Case 1: Oral History
 Use Case 2: Interaction Scenario
 Duration: 10/2015 – 09/2018
 Partners : Univ. Köln, MPI for
Psycholinguistics, Fernuniversität in Hagen

© Fraunhofer
KA3: Use Case Interaction Scenario
Challenges:
 Very fast dialogues, short
speaker turns
 Backchanncel sounds
(„mmh“, „hmm“, „ja“, …)
 Overlapping speech
segments
Technologies:
 Improved speaker clustering
 Speech/non speech
segmentation with deep
learning
 Overlapping speech
segments with RNN
 Automatic segmentation of speech recordings
Arbitrary # of
speakers :
max. 2 Sprecher:
2 speakers :
references:

© Fraunhofer
KA3: Use Case Oral History
Speech Recognition: Reference & ASR Ouput
Example: Kruse (clean recording)
zwischendrin hatte ich natürlich auch versucht noch
mit bei der Medizin zu landen das war aber damals
deswegen so schwierig weil das glaube ich ein Jahr
war bevor der Numerus clausus in der Medizin
eingeführt wurde und man musste so mit
sechshundert Anfängern ungefähr um sechs Uhr auf
der Treppe sitzen damit man um acht Uhr in die
Vorlesung kam und das war für mich
zwischendrin hatte ich natürlich auch versucht sich
noch beim bei der Medizin zu landen das war aber
damals deswegen so schwierig weil das glaube ich
ein Jahr war bevor der Numerus clausus in der
Medizin eingeführt wurde und man musste somit
sechshundert Anfängern ungefähr um sechs Uhr auf
der Treppe sitzen damit man um acht Uhr in der
Vorlesung kam und das war für mich
dann habe ich dieses Studium abgeschlossen und
hatte mich kurz auch mal dafür interessiert in eine
Berufstätigkeit im Entwicklungsdienst deutscher
Entwicklungsdienst hieß das glaube ich einzusteigen
hatte aber auch gleichzeitig so einen Hiwi-Job am
Institut und so blieb ich dann hängen und hatte
eben einfach die Chance weil man dann auch
gefördert wird oder die Chance hat in einem
bestimmten Projekt zu arbeiten dass ich dann daran
gedacht habe zu promovieren
dann habe ich dieses Studium abgeschlossen und
hatte mich kurz auch mal dafür interessiert in eine
Berufstätigkeit im indem ein Entwicklungsland
Deutscher Entwicklungsdienst hieß das glaube ich
einzusteigen hatte aber auch gleichzeitig so ein ein
Hiwi Job am Institut und so lieblich dann hängen
und hatte eben einfach die Chance weil man dann
auch gefördert wird oder die Chance hat _ einen
bestimmten Projekt zu arbeiten dass ich dann daran
gedacht habe zu promovieren

© Fraunhofer
KA3/Newsstream: Forced Alignment & Editing of Transcripts
 If a complete and almost perfect transcription text is availalbe, the missing time
code will be generated by forced alignment
 Input: audio file, transcript
 Output: segmentation file (MPEG-7, ELAN)
 Part of iFinder 3.0

© Fraunhofer
Summary and Outlook
Summary
 Deep Learning and large corpora have led to massive progress for Speech2Text
 Speech2Text provides good transcription quality for broadcast speech (about 10% error),
however not perfect
 Audio Mining more then S2T: speech segmentation, speaker recognition, citations, …
 Many advantages: annotation costs, immediate availability , more details and time codes
 Some disadvantages: Challenging recording conditions, explosion of metadata
 Conclusion: Acceptance for Audio Mining/S2T is given !!!!
 Test Account possible: https://nm-demo.iais.fraunhofer.de/customer_demo
Outlook
 Several research issues are still open (dialects, overlapping speech segments, …)
 Further improvement is expected (evaluation of Deep Learning, more data, engineering)
 Important issue: Integration into MAM workflows

Let‘s do more with your data!
Fraunhofer Institute for Intelligent Analysis and
Information Systems IAIS
www.iais.fraunhofer.de
Link: https://www.iais.fraunhofer.de/audiomining.html
Contact
Dr. Joachim Köhler
Head of Image Processing
+49 (0)2241 14-1900
joachim.koehler@iais.fraunhofer.de

Disclaimer
Copyright © by
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Hansastraße 27 c, 80686 Munich, Germany
All rights reserved.
Responsible contact is: Katrin Berkler | Silke Loh | Public Relations | pr@iais.fraunhofer.de
All copyrights for this presentation and their content are owned in full by the Fraunhofer-Gesellschaft, unless
expressly indicated otherwise.
Each presentation may be used for personal editorial purposes only. Modifications of images and text are not
permitted. Any download or printed copy of this presentation material shall not be distributed or used for
commercial purposes without prior consent of the Fraunhofer-Gesellschaft.
Notwithstanding the above mentioned, the presentation may only be used for reporting on Fraunhofer-
Gesellschaft and its institutes free of charge provided source references to Fraunhofer’s copyright shall be included
correctly and provided that two free copies of the publication shall be sent to the above mentioned address.
The Fraunhofer-Gesellschaft undertakes reasonable efforts to ensure that the contents of its presentations are
accurate, complete and kept up to date. Nevertheless, the possibility of errors cannot be entirely ruled out. The
Fraunhofer-Gesellschaft does not take any warranty in respect of the timeliness, accuracy or completeness of
material published in its presentations, and disclaims all liability for (material or non-material) loss or damage
arising from the use of content obtained from the presentations. The afore mentioned disclaimer includes damages
of third parties.
Registered trademarks, names, and copyrighted text and images are not generally indicated as such in the
presentations of the Fraunhofer-Gesellschaft. However, the absence of such indications in no way implies that
these names, images or text belong to the public domain and may be used unrestrictedly with regard to trademark
or copyright law.

Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams

Recommandé

Recommandé

Contenu connexe

Similaire à Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams

Similaire à Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams (20)

Plus de FIAT/IFTA

Plus de FIAT/IFTA (20)

Dernier

Dernier (20)

Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams