The document summarizes the Czech Malach Cross-lingual Speech Retrieval Test Collection, which contains 353 audio recordings selected from interviews in the USC Shoah Foundation's Visual History Archive. The collection includes automatic transcripts of the interviews in multiple formats, as well as manual topic annotations of segments and metadata. It is intended to help researchers in fields like information retrieval, machine translation, and social studies by providing a test bed for cross-lingual speech retrieval systems.
Generative AI on Enterprise Cloud with NiFi and Milvus
Czech Malach Cross-lingual Speech Collection
1. Czech Malach Cross-lingual
Speech Retrieval Test Collection
Petra Galuščáková
galuscakova@ufal.mff.cuni.cz
Institute of Formal and Applied Linguistics
Charles University in Prague
5. 3. 2016
2. 2
USC Shoah Foundation's
Visual History Archive
● Established to collect and preserve
the testimonies of survivors and other witnesses of the
Holocaust
● Founded in 1994 by Steven Spielberg
● Interviews with the Jewish survivors, Roma and Sinti survivors,
liberators, survivors of the eugenics policies, political prisoners,
aid providers, homosexual survivors, war crimes trials
participants, ...
● Almost 52 000 videotaped testimonies in 56 countries and 32
languages collected between 1994 and 2000
● One of the largest available audio-visual archives
● http://sfi.usc.edu/
3. 3
Malach Centre
for Visual History
● Provides local access to the
digital archives of the USC Shoah Foundation
● Need to retrieve relevant segments of interviews
● Provide a test collection for the retrieval system
created in the Malach project
● http://ufal.mff.cuni.cz/cvhm
4. 4
Czech Malach Cross-lingual
Speech Retrieval Test Collection
● 353 audio recordings (592 hours of audio) randomly
selected from the set of Czech interviews
● Four automatic transcripts by different provides
● Manual topical annotations
● Manually entered metadata (PIQ, Thesaurus)
● Planned to be published in April 2016
● http://ufal.mff.cuni.cz/malach-test-collection
5. 5
Audience
● Historians, teachers, students
● Information Retrieval (IR)
● Cross-lingual IR
● CLEF 2006, 2007 Cross-Language Speech Retrieval
Track
● Speech processing
● Sentiment analysis
● Machine translation
● Social studies
...
6. 6
Collection
● Form of interviews
● Average length: 1 hour
and 41 minutes
● Recorded on tapes
(~ 30 minutes long),
which were digitalized
7. 7
Transcripts
● Provided by IBM (2003), The Johns Hopkins
University (2004, 2006) and
University of West Bohemia (2013)
● In 1-best, MLF and XML format
● Lattices available for 2013 transcripts
● XML transcripts are morphologically tagged
8. 8
Topics
● Annotators manually marked topically coherent segments
and assigned a single topic to each detected segment.
● The set of topics created for the annotation of the VHA.
● Topics for Czech collection were selected.
● Some of the topics were adapted to better react the Czech
realities.
● 5,375 annotations for 118 topics by 6 annotators (librarians
and historians)
● Divided into training, test and excluded sets
● All topics are in Czech and English
● Some topics are also in French, German and Spanish
9. 9
Topic Examples I
Number Name Description Narrative
1173 Children's
art in
Terezin
We are looking for the
description of the art-
related activities of
children in Terezin such as
music, plays, paintings,
writings and poetry
The relevant material
should include
discussions of such
activities and how
they influenced the
survival and following
life of the children.
Any episodes where
the interviewee
demonstrates
examples of such an
art are highly relevant.
1286 Music in the
Holocaust
Tell us if music helped
(spiritually or otherwise)
or hindered the prisoners
interned in concentration
camps
Descriptions of what
role music played in
the life of the
prisoners.
10. 10
Topic Examples II
● Daily life in Terezin
● Jewish children in schools
● The liberation of Buchenwald and Dachau
● Jewish partisans in Italy
● Strengthening faith
● Hidden children and rescuers
● Bombing of Birkenau and Buchenwald
● Minsk ghetto underground
...
11. 11
Annotations I
● Several topics annotated dually
● 2 topics annotated by all annotators
● Search Guided Relevance Assessments
● Set of possible relevant segments was automatically
restricted by an IR system, Thesaurus keywords, and PIQ
● Annotators entered queries and watched the retrieved
parts of recordings
● Each topic was processed in approximately 20 hours
● Highly-ranked Assessments
● Annotators manually evaluated runs submitted to the CLEF
campaign.
12. 12
Annotations II
● Average segment length is 167 second
● For each topic 44 relevant segments were found
in average.
13. 13
Thesaurus
● English Thesaurus with 60,000 keywords
● Terms are hierarchically organized
● Label, definition and scope
● Alternative labels (synonyms)
● Czech Thesaurus
● Labels were translated manually
● Part of the definitions (e.g. complete categories Culture,
Daily Life, Discrimination, Liberation) and scope
translated manually
● The rest of the Thesaurus was translated automatically