Television News Search and Analysis with Lucene/Solr

Television
News
Search
and

Analysis
with
Lucene/Solr

Kai
Chan
<kai@ssc.ucla.edu>

Social
Sciences
CompuAng

University
of
California,
Los
Angeles

Lucene
RevoluAon,
May
10,
2012

CommunicaAon
Studies
Archive

Background
(1)

•  ConAnuaAon
of
analog
recording
of
TV
news

–  Thousands
of
tapes
since
Watergate/1970s

–  Hard
to
look
for
a
parAcular
news
program
or

topic

1

CommunicaAon
Studies
Archive

Background
(2)

•  Digital
recording
since
2005

•  Capture
news
programs
on
computers

–  Video:
can
be
streamed
over
the
Web

–  Closed
capAoning
(“subAtle
text”):
indexed
and

searchable

–  Image
snapshots

–  Search
engine
and
analysis
tools

2

CommunicaAon
Studies
Archive

Background
(3)

•  Also
download
transcripts
and
web-‐streamed

news
programs

•  100
news
programs
and
600,000
words
added

each
day

3

CommunicaAon
Studies
Archive

Background
(4)

•  January
2005
to
present

–  28
networks

–  1,600
shows

–  130,000
hours

–  160,000
news
programs

–  50,000,000
images

–  880,000,000
words

4

Why
This
is
Important
(1)

•  Researchers

–  Large
and
unique
collecAon
of
communicaAon

–  Many
modaliAes

•  Speech,
facial
expression,
body
gesture,
etc.

–  Diﬀerent
condiAons/secngs

–  Diﬀerent
networks
and
communiAes

–  Allows
study
of
TV
news
+
communicaAon
in

general
in
ways
impossible
before

5

Why
This
is
Important
(2)

•  Non-‐researchers

–  TV
news
about
presentaAon
and
persuasion

•  Which
happen
in
daily
life
also

–  TV
main
source
of
news
for
many/most

–  Greatly
aﬀects
the
public’s
decisions

–  Learn
about
what
we
watch

6

ApplicaAon
in
Research

•  CommunicaAon
Studies

–  Amount
of
coverage
for
events
over
Ame

•  LinguisAc

–  Speech
and
language
pagerns

•  Computer
Science

–  Object
idenAﬁcaAon

–  IdenAfy
news
anchors,
public
ﬁgures

–  Story
segmentaAon

14

ApplicaAon
in
Teaching
(1)

•  Chicano
Studies:
RepresentaAons
of
LaAnos

on
the
Television
News

–  May
1,
2007
immigraAon
march

–  MacArthur
Park,
Los
Angeles,
CA

–  2
days
(May
1
&
2,
2007)

–  Framing,
stereotyping,
metaphor,
silencing

–  reports
with
screenshots
and
links
to
news
stories

15

ApplicaAon
in
Teaching
(2)

•  CommunicaAon
Studies:
PresidenAal

CommunicaAon

–  2008
presidenAal
primary

–  6
weeks
(Dec
2007
to
Feb
2008)

–  Coverage
of
sound
bites

•  Amount
of
Ame
given
to
candidate/party

•  Types
of
response
(posiAve,
neutral,
negaAve)

–  Students
created
their
own
poliAcal
ad.

16

Work
ﬂow
(1)

Capture/conversion
machines

•  2
groups,
2
machines
per
group
Capture/ Backup
–  Keep
the
best
recording
conversion storage
machines server
–  6
TV
tuners
per
machine

•  Capture
video
and
CC
to

separate
ﬁles
in
real-‐Ame
Storage/
control
Image
–  MPEG-‐TS
(~7
GB/hr)
server
server

–  Timestamp
every
2-‐3
seconds

•  Generate
image
snapshots
Video
Search
•  Convert
videos
server
streaming
server
–  MP4/H.264
(VGA,
~240
MB/hr)

17

Work
flow
(2)

Storage/staAc
file
servers

•  Control
server

Capture/ Backup
conversion storage
–  Download
TV
schedules
machines server
–  Download
web-‐streamed
news

programs
Storage/
Image
–  Collect
and
check
recordings
control
server
server
–  Pushes
files
to
places

•  Video
streaming
server
Video
Search
•  Backup
storage
server
server
streaming
server
•  Image
server

18

Work
flow
(3)

Search
server

•  Lucene
index
updated
daily
Capture/ Backup
conversion storage
–  Main
text
field
tokenized
machines server

–  Separate
fields
for
date,

network,
show,
etc.
Storage/
Image
control
–  Binary
fields
for
segment
and
server
server

Ame
data

•  Hosts
search
engine
Search
Video
streaming
server
server

19

The
search
process

Video server Retrieve thumbnails Image server
Watch videos and montages Web server
Video ﬁles
(Apache)
Video streaming Thumbnail
server (Wowza) User & montages

Perform searches

Search server
Web server Custom code (PHP) front end

PHP-Java Bridge or Solr bridge

Custom code (Java) Lucene back end
MySQL database Lucene index
20

Custom
query
type

Segment-‐enclosed
query
(1)

•  Problem
1:
search
for
“X
near
Z”

•  Lucene:
search
for
“X
within
Y
words
of
Z”

–  How
to
pick
Y?

–  Hard
to
pick
a
ﬁxed
number

21

Custom
query
type

Segment-‐enclosed
query
(2)

•  Problem
2:
all
matched
search
words
might

not
be
talking
about
same
story

–  E.g.
“Obama
AND
visit
AND
Afghanistan”

–  Might
match
a
news
program
about
Obama’s
visit

to
Canada
+
violence
in
Afghanistan

22

Custom
query
type

Segment-‐enclosed
query
(3)

•  A
news
program
can
contain
several
stories

–  E.g.
Local,
naAonal,
world,
weather,
sports

23

Custom
query
type

Segment-‐enclosed
query
(4)

local story 1
local story 2
commercials
national story 1
national story 2
weather 1
commercials
world story 1
world story 2
weather 2
commercials
health
entertainment
sports 24

Custom
query
type

Segment-‐enclosed
query
(5)

•  One
soluAon:
search
for
“X
and
Z
within
same

story
segment”

–  Possible
with
Lucene
+
story
segment
info

•  Bonus:
enables
searching/ﬁltering
for
a

parAcular
story
type

–  E.g.
PoliAcs

25

Custom
query
type

Segment-‐enclosed
query
(6)

•  How
to
mark
segments

–  Automated

•  Computer
Science
researchers
working
on
them

•  Word
frequency

•  Scene
change

•  Black
frame
and
silence

–  Manual
segmentaAon

•  Watch
the
video

•  Decide
where
a
story
starts
and
ends

•  Mark
posiAons
in
semi-‐automated
system

26

Custom
query
type

Segment-‐enclosed
query
(7)

seg. 1 seg. 1 seg. 2 seg. 2 seg. 3 seg. 3
begin end begin end begin end

span 1

span 2

span 3

span 4

span 5

27

Custom
query
type

Segment-‐enclosed
query
(8)

•  Idea

–  Get
spans
from
SpanNearQuery

–  Filter
and
keep
those
fully
within
segments

•  In
producAon:
segment
info
in
stored
ﬁelds

–  As
a
list
of
<start
posiAon,
end
posiAon>

–  Simple
to
implement

–  Reasonably
fast
searching

•  AlternaAve:
store
segment
info
as
terms

–  Possible
to
ﬁnd
segments
by
themselves

–  Appears
to
run
much
faster

28

Custom
query
type

Time-‐enclosed
query

20 s 25 s 30 s 35 s 40 s 45 s 50 s 55 s 60 s

<= 20 s span 1

<= 15 s span 2

<= 10 s span 3

<= 35 s span 4

<= 25 s span 5

29

Custom
query
type

MulA-‐term
regular
expression
(1)

•  “here
is
_
_
_
with
the
(news|story|details|
report)”

•  Apply
RegEx
to
a
phrase
or
sentence

–  Not
just
individual
words

•  Lucene
core
has
regular
expression
query

support

–  Good
starAng
point

–  Not
a
complete
soluAon
for
us

30

Custom
query
type

MulA-‐term
regular
expression
(2)

•  Problems

–  Some
analyzers
do
not
work
with
RegEx

–  Lucene’s
RegEx
query
classes
only
apply
RegEx
to

individual
terms

•  Want
to
match
a
pagern
against
a
phrase/sentence

•  Want
placeholders
for
whole
words
(not
just
characters)

–  Term(ﬁeldName,
“.*”)
matches
all
terms,
and
all

documents,
and
all
posiAons
in
the
index

•  very
slow

•  takes
lots
of
memory

31

Custom
query
type

MulA-‐term
regular
expression
(3)

•  What
we
did

–  Parse
and
translate
mulA-‐term
RegEx
into
Lucene

built-‐in
queries
(SpanNearQuery,
RegexQuery)

•  E.g.
“here
is
_
_
_
with
the”
=
“here
is”
followed
by
“with

the”
(with
exactly
3
terms
in
between)

–  Leading
and
trailing
placeholders

•  E.g.
“_
_
is
the
_
_
_”

•  Preserve
for
correctness

•  Store
word
count
for
each
document

•  Expand
each
span
on
both
sides

•  Bounds
checking

32

Custom
query
type

MulA-‐term
regular
expression
(4)

•  Regular
expression
libraries
diﬀer
in

–  Syntax
(e.g.
Perl
5-‐compaAble)

–  CapabiliAes
(e.g.
back-‐references)

–  Speed

•  Memory
usage

–  ProporAonal
to
number
of
terms
matched

–  Increasing
available
memory
might
help

33

Custom
result
format

Occurrence
count

date word crisis crash meltdown tsunami
go through every span
generated by
...
(SpanTermQuery(meltdown)
ﬁltered by date 9/15/08)

9/14/08
X docs, Y
9/15/08
occurrences
9/16/08

...

34

Future
work

Job
queue
(1)

•  Research
front
moving
towards
analysis
of

whole
database

–  Want
full
search
result
set

–  Queries
are
intensive
and
take
a
long
Ame

•  SoluAon
will
be
beyond
increasing
Ameout

–  Users
might
close
their
browsers

–  We
might
restart
the
search
back-‐end

35

Future
work

Job
queue
(2)

•  Features

–  Query
runs
in
background

–  NoAﬁcaAon
when
ﬁnished/failed

–  Restart
queries
with
recoverable
errors

–  Check
and
cancel
jobs

–  Downloadable
result

–  Schedule
recurring
queries

–  Manage
job
priority
and
quota

36

Future
work

MulAple
sources
and
languages
(1)

•  MulAlingual
news
programs

–  E.g.
some
have
English
+
Spanish
CC

•  MulAple
text
and
Amestamp
sources

–  E.g.
CNN
transcript
available
from
website

–  Applying
speech-‐to-‐text
to
videos

–  Manual
correcAon
of
text
and
Amestamps

•  MulAple
markets

–  E.g.
Capture
TV
programs
in
Denmark
and
Norway

37

Future
work

MulAple
sources
and
languages
(2)

•  Need
language
detecAon

–  Libraries
exist

•  Search
for
speciﬁc
channel

–  Search
by
language
more
useful

–  But
no
ﬁxed
channel
-‐>
language
mapping

•  What
will
proximity
search
and
occurrence

counAng
mean
when
there
are
mulAple

channels/languages?

38

Future
work

Metadata

•  Types
of
metadata

–  Segment
boundary,
type
and
topic

–  Headline
and
descripAon
(from
transcripts)

–  Website
links

–  SyntacAc
tags
(e.g.
part
of
speech)

–  Generated
annotaAon
(e.g.
object
idenAﬁcaAon)

–  User
annotaAon
(e.g.
scene
descripAon)

–  Screen
text

•  Eventually:
want
them
to
be
searchable

39

Thank
you
for
coming!

•  Any
quesAons?

•  My
e-‐mail:
kai@ssc.ucla.edu

•  Slides
available:
hgp://ucla.in/IDJq2u

40

Television News Search and Analysis with Lucene/Solr

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (6)

Similar to Television News Search and Analysis with Lucene/Solr

Similar to Television News Search and Analysis with Lucene/Solr (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Television News Search and Analysis with Lucene/Solr