Presented by Kai Chan | UCLA - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search engine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Television News Search and Analysis with Lucene/Solr
1. Television
News
Search
and
Analysis
with
Lucene/Solr
Kai
Chan
<kai@ssc.ucla.edu>
Social
Sciences
CompuAng
University
of
California,
Los
Angeles
Lucene
RevoluAon,
May
10,
2012
2. CommunicaAon
Studies
Archive
Background
(1)
• ConAnuaAon
of
analog
recording
of
TV
news
– Thousands
of
tapes
since
Watergate/1970s
– Hard
to
look
for
a
parAcular
news
program
or
topic
1
3. CommunicaAon
Studies
Archive
Background
(2)
• Digital
recording
since
2005
• Capture
news
programs
on
computers
– Video:
can
be
streamed
over
the
Web
– Closed
capAoning
(“subAtle
text”):
indexed
and
searchable
– Image
snapshots
– Search
engine
and
analysis
tools
2
4. CommunicaAon
Studies
Archive
Background
(3)
• Also
download
transcripts
and
web-‐streamed
news
programs
• 100
news
programs
and
600,000
words
added
each
day
3
6. Why
This
is
Important
(1)
• Researchers
– Large
and
unique
collecAon
of
communicaAon
– Many
modaliAes
• Speech,
facial
expression,
body
gesture,
etc.
– Different
condiAons/secngs
– Different
networks
and
communiAes
– Allows
study
of
TV
news
+
communicaAon
in
general
in
ways
impossible
before
5
7. Why
This
is
Important
(2)
• Non-‐researchers
– TV
news
about
presentaAon
and
persuasion
• Which
happen
in
daily
life
also
– TV
main
source
of
news
for
many/most
– Greatly
affects
the
public’s
decisions
– Learn
about
what
we
watch
6
15. ApplicaAon
in
Research
• CommunicaAon
Studies
– Amount
of
coverage
for
events
over
Ame
• LinguisAc
– Speech
and
language
pagerns
• Computer
Science
– Object
idenAficaAon
– IdenAfy
news
anchors,
public
figures
– Story
segmentaAon
14
16. ApplicaAon
in
Teaching
(1)
• Chicano
Studies:
RepresentaAons
of
LaAnos
on
the
Television
News
– May
1,
2007
immigraAon
march
– MacArthur
Park,
Los
Angeles,
CA
– 2
days
(May
1
&
2,
2007)
– Framing,
stereotyping,
metaphor,
silencing
– reports
with
screenshots
and
links
to
news
stories
15
17. ApplicaAon
in
Teaching
(2)
• CommunicaAon
Studies:
PresidenAal
CommunicaAon
– 2008
presidenAal
primary
– 6
weeks
(Dec
2007
to
Feb
2008)
– Coverage
of
sound
bites
• Amount
of
Ame
given
to
candidate/party
• Types
of
response
(posiAve,
neutral,
negaAve)
– Students
created
their
own
poliAcal
ad.
16
18. Work
flow
(1)
Capture/conversion
machines
• 2
groups,
2
machines
per
group
Capture/ Backup
– Keep
the
best
recording
conversion storage
machines server
– 6
TV
tuners
per
machine
• Capture
video
and
CC
to
separate
files
in
real-‐Ame
Storage/
control
Image
– MPEG-‐TS
(~7
GB/hr)
server
server
– Timestamp
every
2-‐3
seconds
• Generate
image
snapshots
Video
Search
• Convert
videos
server
streaming
server
– MP4/H.264
(VGA,
~240
MB/hr)
17
19. Work
flow
(2)
Storage/staAc
file
servers
• Control
server
Capture/ Backup
conversion storage
– Download
TV
schedules
machines server
– Download
web-‐streamed
news
programs
Storage/
Image
– Collect
and
check
recordings
control
server
server
– Pushes
files
to
places
• Video
streaming
server
Video
Search
• Backup
storage
server
server
streaming
server
• Image
server
18
20. Work
flow
(3)
Search
server
• Lucene
index
updated
daily
Capture/ Backup
conversion storage
– Main
text
field
tokenized
machines server
– Separate
fields
for
date,
network,
show,
etc.
Storage/
Image
control
– Binary
fields
for
segment
and
server
server
Ame
data
• Hosts
search
engine
Search
Video
streaming
server
server
19
21. The
search
process
Video server Retrieve thumbnails Image server
Watch videos and montages Web server
Video files
(Apache)
Video streaming Thumbnail
server (Wowza) User & montages
Perform searches
Search server
Web server Custom code (PHP) front end
PHP-Java Bridge or Solr bridge
Custom code (Java) Lucene back end
MySQL database Lucene index
20
22. Custom
query
type
Segment-‐enclosed
query
(1)
• Problem
1:
search
for
“X
near
Z”
• Lucene:
search
for
“X
within
Y
words
of
Z”
– How
to
pick
Y?
– Hard
to
pick
a
fixed
number
21
23. Custom
query
type
Segment-‐enclosed
query
(2)
• Problem
2:
all
matched
search
words
might
not
be
talking
about
same
story
– E.g.
“Obama
AND
visit
AND
Afghanistan”
– Might
match
a
news
program
about
Obama’s
visit
to
Canada
+
violence
in
Afghanistan
22
24. Custom
query
type
Segment-‐enclosed
query
(3)
• A
news
program
can
contain
several
stories
– E.g.
Local,
naAonal,
world,
weather,
sports
23
25. Custom
query
type
Segment-‐enclosed
query
(4)
local story 1
local story 2
commercials
national story 1
national story 2
weather 1
commercials
world story 1
world story 2
weather 2
commercials
health
entertainment
sports 24
26. Custom
query
type
Segment-‐enclosed
query
(5)
• One
soluAon:
search
for
“X
and
Z
within
same
story
segment”
– Possible
with
Lucene
+
story
segment
info
• Bonus:
enables
searching/filtering
for
a
parAcular
story
type
– E.g.
PoliAcs
25
27. Custom
query
type
Segment-‐enclosed
query
(6)
• How
to
mark
segments
– Automated
• Computer
Science
researchers
working
on
them
• Word
frequency
• Scene
change
• Black
frame
and
silence
– Manual
segmentaAon
• Watch
the
video
• Decide
where
a
story
starts
and
ends
• Mark
posiAons
in
semi-‐automated
system
26
28. Custom
query
type
Segment-‐enclosed
query
(7)
seg. 1 seg. 1 seg. 2 seg. 2 seg. 3 seg. 3
begin end begin end begin end
span 1
span 2
span 3
span 4
span 5
27
29. Custom
query
type
Segment-‐enclosed
query
(8)
• Idea
– Get
spans
from
SpanNearQuery
– Filter
and
keep
those
fully
within
segments
• In
producAon:
segment
info
in
stored
fields
– As
a
list
of
<start
posiAon,
end
posiAon>
– Simple
to
implement
– Reasonably
fast
searching
• AlternaAve:
store
segment
info
as
terms
– Possible
to
find
segments
by
themselves
– Appears
to
run
much
faster
28
30. Custom
query
type
Time-‐enclosed
query
20 s 25 s 30 s 35 s 40 s 45 s 50 s 55 s 60 s
<= 20 s span 1
<= 15 s span 2
<= 10 s span 3
<= 35 s span 4
<= 25 s span 5
29
31. Custom
query
type
MulA-‐term
regular
expression
(1)
• “here
is
_
_
_
with
the
(news|story|details|
report)”
• Apply
RegEx
to
a
phrase
or
sentence
– Not
just
individual
words
• Lucene
core
has
regular
expression
query
support
– Good
starAng
point
– Not
a
complete
soluAon
for
us
30
32. Custom
query
type
MulA-‐term
regular
expression
(2)
• Problems
– Some
analyzers
do
not
work
with
RegEx
– Lucene’s
RegEx
query
classes
only
apply
RegEx
to
individual
terms
• Want
to
match
a
pagern
against
a
phrase/sentence
• Want
placeholders
for
whole
words
(not
just
characters)
– Term(fieldName,
“.*”)
matches
all
terms,
and
all
documents,
and
all
posiAons
in
the
index
• very
slow
• takes
lots
of
memory
31
33. Custom
query
type
MulA-‐term
regular
expression
(3)
• What
we
did
– Parse
and
translate
mulA-‐term
RegEx
into
Lucene
built-‐in
queries
(SpanNearQuery,
RegexQuery)
• E.g.
“here
is
_
_
_
with
the”
=
“here
is”
followed
by
“with
the”
(with
exactly
3
terms
in
between)
– Leading
and
trailing
placeholders
• E.g.
“_
_
is
the
_
_
_”
• Preserve
for
correctness
• Store
word
count
for
each
document
• Expand
each
span
on
both
sides
• Bounds
checking
32
34. Custom
query
type
MulA-‐term
regular
expression
(4)
• Regular
expression
libraries
differ
in
– Syntax
(e.g.
Perl
5-‐compaAble)
– CapabiliAes
(e.g.
back-‐references)
– Speed
• Memory
usage
– ProporAonal
to
number
of
terms
matched
– Increasing
available
memory
might
help
33
35. Custom
result
format
Occurrence
count
date word crisis crash meltdown tsunami
go through every span
generated by
...
(SpanTermQuery(meltdown)
filtered by date 9/15/08)
9/14/08
X docs, Y
9/15/08
occurrences
9/16/08
...
34
36. Future
work
Job
queue
(1)
• Research
front
moving
towards
analysis
of
whole
database
– Want
full
search
result
set
– Queries
are
intensive
and
take
a
long
Ame
• SoluAon
will
be
beyond
increasing
Ameout
– Users
might
close
their
browsers
– We
might
restart
the
search
back-‐end
35
37. Future
work
Job
queue
(2)
• Features
– Query
runs
in
background
– NoAficaAon
when
finished/failed
– Restart
queries
with
recoverable
errors
– Check
and
cancel
jobs
– Downloadable
result
– Schedule
recurring
queries
– Manage
job
priority
and
quota
36
38. Future
work
MulAple
sources
and
languages
(1)
• MulAlingual
news
programs
– E.g.
some
have
English
+
Spanish
CC
• MulAple
text
and
Amestamp
sources
– E.g.
CNN
transcript
available
from
website
– Applying
speech-‐to-‐text
to
videos
– Manual
correcAon
of
text
and
Amestamps
• MulAple
markets
– E.g.
Capture
TV
programs
in
Denmark
and
Norway
37
39. Future
work
MulAple
sources
and
languages
(2)
• Need
language
detecAon
– Libraries
exist
• Search
for
specific
channel
– Search
by
language
more
useful
– But
no
fixed
channel
-‐>
language
mapping
• What
will
proximity
search
and
occurrence
counAng
mean
when
there
are
mulAple
channels/languages?
38
40. Future
work
Metadata
• Types
of
metadata
– Segment
boundary,
type
and
topic
– Headline
and
descripAon
(from
transcripts)
– Website
links
– SyntacAc
tags
(e.g.
part
of
speech)
– Generated
annotaAon
(e.g.
object
idenAficaAon)
– User
annotaAon
(e.g.
scene
descripAon)
– Screen
text
• Eventually:
want
them
to
be
searchable
39
41. Thank
you
for
coming!
• Any
quesAons?
• My
e-‐mail:
kai@ssc.ucla.edu
• Slides
available:
hgp://ucla.in/IDJq2u
40