SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Czech Malach Cross-lingual
Speech Retrieval Test Collection
Petra Galuščáková
galuscakova@ufal.mff.cuni.cz
Institute of Formal and Applied Linguistics
Charles University in Prague
5. 3. 2016
2
USC Shoah Foundation's
Visual History Archive
● Established to collect and preserve
the testimonies of survivors and other witnesses of the
Holocaust
● Founded in 1994 by Steven Spielberg
● Interviews with the Jewish survivors, Roma and Sinti survivors,
liberators, survivors of the eugenics policies, political prisoners,
aid providers, homosexual survivors, war crimes trials
participants, ...
● Almost 52 000 videotaped testimonies in 56 countries and 32
languages collected between 1994 and 2000
● One of the largest available audio-visual archives
● http://sfi.usc.edu/
3
Malach Centre
for Visual History
● Provides local access to the
digital archives of the USC Shoah Foundation
● Need to retrieve relevant segments of interviews
● Provide a test collection for the retrieval system
created in the Malach project
● http://ufal.mff.cuni.cz/cvhm
4
Czech Malach Cross-lingual
Speech Retrieval Test Collection
● 353 audio recordings (592 hours of audio) randomly
selected from the set of Czech interviews
● Four automatic transcripts by different provides
● Manual topical annotations
● Manually entered metadata (PIQ, Thesaurus)
● Planned to be published in April 2016
● http://ufal.mff.cuni.cz/malach-test-collection
5
Audience
● Historians, teachers, students
● Information Retrieval (IR)
● Cross-lingual IR
● CLEF 2006, 2007 Cross-Language Speech Retrieval
Track
● Speech processing
● Sentiment analysis
● Machine translation
● Social studies
...
6
Collection
● Form of interviews
● Average length: 1 hour
and 41 minutes
● Recorded on tapes
(~ 30 minutes long),
which were digitalized
7
Transcripts
● Provided by IBM (2003), The Johns Hopkins
University (2004, 2006) and
University of West Bohemia (2013)
● In 1-best, MLF and XML format
● Lattices available for 2013 transcripts
● XML transcripts are morphologically tagged
8
Topics
● Annotators manually marked topically coherent segments
and assigned a single topic to each detected segment.
● The set of topics created for the annotation of the VHA.
● Topics for Czech collection were selected.
● Some of the topics were adapted to better react the Czech
realities.
● 5,375 annotations for 118 topics by 6 annotators (librarians
and historians)
● Divided into training, test and excluded sets
● All topics are in Czech and English
● Some topics are also in French, German and Spanish
9
Topic Examples I
Number Name Description Narrative
1173 Children's
art in
Terezin
We are looking for the
description of the art-
related activities of
children in Terezin such as
music, plays, paintings,
writings and poetry
The relevant material
should include
discussions of such
activities and how
they influenced the
survival and following
life of the children.
Any episodes where
the interviewee
demonstrates
examples of such an
art are highly relevant.
1286 Music in the
Holocaust
Tell us if music helped
(spiritually or otherwise)
or hindered the prisoners
interned in concentration
camps
Descriptions of what
role music played in
the life of the
prisoners.
10
Topic Examples II
● Daily life in Terezin
● Jewish children in schools
● The liberation of Buchenwald and Dachau
● Jewish partisans in Italy
● Strengthening faith
● Hidden children and rescuers
● Bombing of Birkenau and Buchenwald
● Minsk ghetto underground
...
11
Annotations I
● Several topics annotated dually
● 2 topics annotated by all annotators
● Search Guided Relevance Assessments
● Set of possible relevant segments was automatically
restricted by an IR system, Thesaurus keywords, and PIQ
● Annotators entered queries and watched the retrieved
parts of recordings
● Each topic was processed in approximately 20 hours
● Highly-ranked Assessments
● Annotators manually evaluated runs submitted to the CLEF
campaign.
12
Annotations II
● Average segment length is 167 second
● For each topic 44 relevant segments were found
in average.
13
Thesaurus
● English Thesaurus with 60,000 keywords
● Terms are hierarchically organized
● Label, definition and scope
● Alternative labels (synonyms)
● Czech Thesaurus
● Labels were translated manually
● Part of the definitions (e.g. complete categories Culture,
Daily Life, Discrimination, Liberation) and scope
translated manually
● The rest of the Thesaurus was translated automatically
14
Conclusion
15
Conclusion
● Czech Malach Collection
● Cleared manual annotations of topics of segments
in recordings
● Translations of topics
● Partially manually translated Thesaurus
● Cross-Language Speech Retrieval
16
Thank you
http://ufal.mff.cuni.cz/malach-test-collection

Contenu connexe

En vedette

Time table media studies
Time table media studiesTime table media studies
Time table media studiesCrystalbeth
 
Características de los niños de 8 y 9 años
Características de los niños de 8 y 9 añosCaracterísticas de los niños de 8 y 9 años
Características de los niños de 8 y 9 añosjanetdinora
 
Help for fm1
Help for fm1Help for fm1
Help for fm1sparkly
 
InfiltrateCon 2016 - Why Nation-State Hack Telco Networks
InfiltrateCon 2016 - Why Nation-State Hack Telco NetworksInfiltrateCon 2016 - Why Nation-State Hack Telco Networks
InfiltrateCon 2016 - Why Nation-State Hack Telco NetworksOmer Coskun
 

En vedette (6)

Time table media studies
Time table media studiesTime table media studies
Time table media studies
 
Lix Builes
Lix BuilesLix Builes
Lix Builes
 
Características de los niños de 8 y 9 años
Características de los niños de 8 y 9 añosCaracterísticas de los niños de 8 y 9 años
Características de los niños de 8 y 9 años
 
Help for fm1
Help for fm1Help for fm1
Help for fm1
 
InfiltrateCon 2016 - Why Nation-State Hack Telco Networks
InfiltrateCon 2016 - Why Nation-State Hack Telco NetworksInfiltrateCon 2016 - Why Nation-State Hack Telco Networks
InfiltrateCon 2016 - Why Nation-State Hack Telco Networks
 
Recursividad
RecursividadRecursividad
Recursividad
 

Similaire à Czech Malach Cross-lingual Speech Collection

19th century linguistics
19th century linguistics19th century linguistics
19th century linguisticsVenus Withers
 
Introduction to memory studies
Introduction to memory studiesIntroduction to memory studies
Introduction to memory studiesManshi Yadav
 
Translation of indonesian cultural aspects into english
Translation of indonesian cultural aspects into englishTranslation of indonesian cultural aspects into english
Translation of indonesian cultural aspects into englishRidwan Arifin
 
Conference & simultaneous interpreting
Conference & simultaneous interpreting Conference & simultaneous interpreting
Conference & simultaneous interpreting Jorge Chavez
 
The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015
The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015
The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015University College Cork
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...cneudecker
 
Rmtc durham symposium ppt tuesday 14 october
Rmtc durham symposium ppt tuesday 14 octoberRmtc durham symposium ppt tuesday 14 october
Rmtc durham symposium ppt tuesday 14 octoberRMBorders
 
Presentation On Structural Linguistics.pptx
Presentation On Structural Linguistics.pptxPresentation On Structural Linguistics.pptx
Presentation On Structural Linguistics.pptxmuntasirdurjoy
 
Paper Presentation: A Pendulum Swung Too Far
Paper Presentation: A Pendulum Swung Too FarPaper Presentation: A Pendulum Swung Too Far
Paper Presentation: A Pendulum Swung Too FarSagar Ahire
 
A Brief History of Archiving in Language Documentation, With an Annotated Bib...
A Brief History of Archiving in Language Documentation, With an Annotated Bib...A Brief History of Archiving in Language Documentation, With an Annotated Bib...
A Brief History of Archiving in Language Documentation, With an Annotated Bib...Tiffany Daniels
 
открытый урок тарас шнвченко
открытый урок тарас шнвченкооткрытый урок тарас шнвченко
открытый урок тарас шнвченкоtank1975
 
RMTC Hub Presentation (P. Holmes)
RMTC Hub Presentation (P. Holmes)RMTC Hub Presentation (P. Holmes)
RMTC Hub Presentation (P. Holmes)RMBorders
 
Visual History Archive
Visual History ArchiveVisual History Archive
Visual History ArchiveProQuest
 
MannCV(3-page résumé)
MannCV(3-page résumé)MannCV(3-page résumé)
MannCV(3-page résumé)D. Brian Mann
 
History in your hands Class 1 (online version).pptx
History in your hands Class 1 (online version).pptxHistory in your hands Class 1 (online version).pptx
History in your hands Class 1 (online version).pptxEilsONeill
 
B04 elhanan adler_memory_world
B04 elhanan adler_memory_worldB04 elhanan adler_memory_world
B04 elhanan adler_memory_worldevaminerva
 
B04 elhanan adler_memory_world
B04 elhanan adler_memory_worldB04 elhanan adler_memory_world
B04 elhanan adler_memory_worldevaminerva
 
The study of the linguistic worldview in constructed languages on the example...
The study of the linguistic worldview in constructed languages on the example...The study of the linguistic worldview in constructed languages on the example...
The study of the linguistic worldview in constructed languages on the example...Ida Stria
 

Similaire à Czech Malach Cross-lingual Speech Collection (20)

19th century linguistics
19th century linguistics19th century linguistics
19th century linguistics
 
Introduction to memory studies
Introduction to memory studiesIntroduction to memory studies
Introduction to memory studies
 
Translation of indonesian cultural aspects into english
Translation of indonesian cultural aspects into englishTranslation of indonesian cultural aspects into english
Translation of indonesian cultural aspects into english
 
Conference & simultaneous interpreting
Conference & simultaneous interpreting Conference & simultaneous interpreting
Conference & simultaneous interpreting
 
The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015
The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015
The O'Neill Henebry Wax Cylinder Project HEAnet 2015 Crónán Ó Doibhlin 2015
 
Professor M Tanaka presentation
Professor M Tanaka presentationProfessor M Tanaka presentation
Professor M Tanaka presentation
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
 
Rmtc durham symposium ppt tuesday 14 october
Rmtc durham symposium ppt tuesday 14 octoberRmtc durham symposium ppt tuesday 14 october
Rmtc durham symposium ppt tuesday 14 october
 
Presentation On Structural Linguistics.pptx
Presentation On Structural Linguistics.pptxPresentation On Structural Linguistics.pptx
Presentation On Structural Linguistics.pptx
 
Paper Presentation: A Pendulum Swung Too Far
Paper Presentation: A Pendulum Swung Too FarPaper Presentation: A Pendulum Swung Too Far
Paper Presentation: A Pendulum Swung Too Far
 
A Brief History of Archiving in Language Documentation, With an Annotated Bib...
A Brief History of Archiving in Language Documentation, With an Annotated Bib...A Brief History of Archiving in Language Documentation, With an Annotated Bib...
A Brief History of Archiving in Language Documentation, With an Annotated Bib...
 
открытый урок тарас шнвченко
открытый урок тарас шнвченкооткрытый урок тарас шнвченко
открытый урок тарас шнвченко
 
RMTC Hub Presentation (P. Holmes)
RMTC Hub Presentation (P. Holmes)RMTC Hub Presentation (P. Holmes)
RMTC Hub Presentation (P. Holmes)
 
Colloquium Talk
Colloquium TalkColloquium Talk
Colloquium Talk
 
Visual History Archive
Visual History ArchiveVisual History Archive
Visual History Archive
 
MannCV(3-page résumé)
MannCV(3-page résumé)MannCV(3-page résumé)
MannCV(3-page résumé)
 
History in your hands Class 1 (online version).pptx
History in your hands Class 1 (online version).pptxHistory in your hands Class 1 (online version).pptx
History in your hands Class 1 (online version).pptx
 
B04 elhanan adler_memory_world
B04 elhanan adler_memory_worldB04 elhanan adler_memory_world
B04 elhanan adler_memory_world
 
B04 elhanan adler_memory_world
B04 elhanan adler_memory_worldB04 elhanan adler_memory_world
B04 elhanan adler_memory_world
 
The study of the linguistic worldview in constructed languages on the example...
The study of the linguistic worldview in constructed languages on the example...The study of the linguistic worldview in constructed languages on the example...
The study of the linguistic worldview in constructed languages on the example...
 

Plus de Petra Galuscakova

Combining Evidence for Cross-language Information Retrieval
Combining Evidence for Cross-language Information RetrievalCombining Evidence for Cross-language Information Retrieval
Combining Evidence for Cross-language Information RetrievalPetra Galuscakova
 
Multimodal Features for Linking Television Content
Multimodal Features for Linking Television ContentMultimodal Features for Linking Television Content
Multimodal Features for Linking Television ContentPetra Galuscakova
 
Audio Information for Hyperlinking of TV Content
Audio Information for Hyperlinking of TV ContentAudio Information for Hyperlinking of TV Content
Audio Information for Hyperlinking of TV ContentPetra Galuscakova
 
Multimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video ContentMultimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video ContentPetra Galuscakova
 
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkachEvaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkachPetra Galuscakova
 
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskCUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskPetra Galuscakova
 
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...Petra Galuscakova
 
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmiČesko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmiPetra Galuscakova
 
Application of Topic Segmentation in Audiovisual Information Retrieval
Application of Topic Segmentation in Audiovisual Information RetrievalApplication of Topic Segmentation in Audiovisual Information Retrieval
Application of Topic Segmentation in Audiovisual Information RetrievalPetra Galuscakova
 
Penalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
Penalty Functions for Evaluation Measures of Unsegmented Speech RetrievalPenalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
Penalty Functions for Evaluation Measures of Unsegmented Speech RetrievalPetra Galuscakova
 

Plus de Petra Galuscakova (10)

Combining Evidence for Cross-language Information Retrieval
Combining Evidence for Cross-language Information RetrievalCombining Evidence for Cross-language Information Retrieval
Combining Evidence for Cross-language Information Retrieval
 
Multimodal Features for Linking Television Content
Multimodal Features for Linking Television ContentMultimodal Features for Linking Television Content
Multimodal Features for Linking Television Content
 
Audio Information for Hyperlinking of TV Content
Audio Information for Hyperlinking of TV ContentAudio Information for Hyperlinking of TV Content
Audio Information for Hyperlinking of TV Content
 
Multimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video ContentMultimodal Features for Search and Hyperlinking of Video Content
Multimodal Features for Search and Hyperlinking of Video Content
 
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkachEvaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
Evaluácia tematického vyhľadávania v audiovizuálnych nahrávkach
 
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskCUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
 
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...
 
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmiČesko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
 
Application of Topic Segmentation in Audiovisual Information Retrieval
Application of Topic Segmentation in Audiovisual Information RetrievalApplication of Topic Segmentation in Audiovisual Information Retrieval
Application of Topic Segmentation in Audiovisual Information Retrieval
 
Penalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
Penalty Functions for Evaluation Measures of Unsegmented Speech RetrievalPenalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
Penalty Functions for Evaluation Measures of Unsegmented Speech Retrieval
 

Dernier

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Dernier (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Czech Malach Cross-lingual Speech Collection

  • 1. Czech Malach Cross-lingual Speech Retrieval Test Collection Petra Galuščáková galuscakova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Charles University in Prague 5. 3. 2016
  • 2. 2 USC Shoah Foundation's Visual History Archive ● Established to collect and preserve the testimonies of survivors and other witnesses of the Holocaust ● Founded in 1994 by Steven Spielberg ● Interviews with the Jewish survivors, Roma and Sinti survivors, liberators, survivors of the eugenics policies, political prisoners, aid providers, homosexual survivors, war crimes trials participants, ... ● Almost 52 000 videotaped testimonies in 56 countries and 32 languages collected between 1994 and 2000 ● One of the largest available audio-visual archives ● http://sfi.usc.edu/
  • 3. 3 Malach Centre for Visual History ● Provides local access to the digital archives of the USC Shoah Foundation ● Need to retrieve relevant segments of interviews ● Provide a test collection for the retrieval system created in the Malach project ● http://ufal.mff.cuni.cz/cvhm
  • 4. 4 Czech Malach Cross-lingual Speech Retrieval Test Collection ● 353 audio recordings (592 hours of audio) randomly selected from the set of Czech interviews ● Four automatic transcripts by different provides ● Manual topical annotations ● Manually entered metadata (PIQ, Thesaurus) ● Planned to be published in April 2016 ● http://ufal.mff.cuni.cz/malach-test-collection
  • 5. 5 Audience ● Historians, teachers, students ● Information Retrieval (IR) ● Cross-lingual IR ● CLEF 2006, 2007 Cross-Language Speech Retrieval Track ● Speech processing ● Sentiment analysis ● Machine translation ● Social studies ...
  • 6. 6 Collection ● Form of interviews ● Average length: 1 hour and 41 minutes ● Recorded on tapes (~ 30 minutes long), which were digitalized
  • 7. 7 Transcripts ● Provided by IBM (2003), The Johns Hopkins University (2004, 2006) and University of West Bohemia (2013) ● In 1-best, MLF and XML format ● Lattices available for 2013 transcripts ● XML transcripts are morphologically tagged
  • 8. 8 Topics ● Annotators manually marked topically coherent segments and assigned a single topic to each detected segment. ● The set of topics created for the annotation of the VHA. ● Topics for Czech collection were selected. ● Some of the topics were adapted to better react the Czech realities. ● 5,375 annotations for 118 topics by 6 annotators (librarians and historians) ● Divided into training, test and excluded sets ● All topics are in Czech and English ● Some topics are also in French, German and Spanish
  • 9. 9 Topic Examples I Number Name Description Narrative 1173 Children's art in Terezin We are looking for the description of the art- related activities of children in Terezin such as music, plays, paintings, writings and poetry The relevant material should include discussions of such activities and how they influenced the survival and following life of the children. Any episodes where the interviewee demonstrates examples of such an art are highly relevant. 1286 Music in the Holocaust Tell us if music helped (spiritually or otherwise) or hindered the prisoners interned in concentration camps Descriptions of what role music played in the life of the prisoners.
  • 10. 10 Topic Examples II ● Daily life in Terezin ● Jewish children in schools ● The liberation of Buchenwald and Dachau ● Jewish partisans in Italy ● Strengthening faith ● Hidden children and rescuers ● Bombing of Birkenau and Buchenwald ● Minsk ghetto underground ...
  • 11. 11 Annotations I ● Several topics annotated dually ● 2 topics annotated by all annotators ● Search Guided Relevance Assessments ● Set of possible relevant segments was automatically restricted by an IR system, Thesaurus keywords, and PIQ ● Annotators entered queries and watched the retrieved parts of recordings ● Each topic was processed in approximately 20 hours ● Highly-ranked Assessments ● Annotators manually evaluated runs submitted to the CLEF campaign.
  • 12. 12 Annotations II ● Average segment length is 167 second ● For each topic 44 relevant segments were found in average.
  • 13. 13 Thesaurus ● English Thesaurus with 60,000 keywords ● Terms are hierarchically organized ● Label, definition and scope ● Alternative labels (synonyms) ● Czech Thesaurus ● Labels were translated manually ● Part of the definitions (e.g. complete categories Culture, Daily Life, Discrimination, Liberation) and scope translated manually ● The rest of the Thesaurus was translated automatically
  • 15. 15 Conclusion ● Czech Malach Collection ● Cleared manual annotations of topics of segments in recordings ● Translations of topics ● Partially manually translated Thesaurus ● Cross-Language Speech Retrieval