Incorporating site level knowledge to extract structured data from web forums - keynote
1. Incorpora(ng
Site-‐Level
Knowledge
to
Extract
Structured
Data
from
Web
Forums
Jiang-‐Ming
Yang,
Rui
Cai,
Yida
Wang,
Jun
Zhu,
Lei
Zhang,
and
Wei-‐Ying
Ma
Web
Search
&
Mining
Group
Microso=
Research
Asia
2009-‐04
Saturday, May 22, 2010
2. Web
Forum
Data
• An
important
informa,on
resource
with
a
lot
of
human
knowledge.
• These
informa,on
include
recrea,on,
sports,
games,
computers,
art,
society,
science,
home,
health;
• 20%
pages
on
the
search
results
are
from
forums
Saturday, May 22, 2010
3. Understanding
Forum
Quality
Data
Crawling Assessmen
ExtracIon
t
Saturday, May 22, 2010
4. Understanding
Forum
Quality
Data
Crawling Assessmen
ExtracIon
t
WWW’08 WWW’09, SIGIR’09
iRobot:
An
Intelligent
Crawler
for
AutomaIon
Data
ExtracIon Quality
Assessment
Web
Forums
SIGIR’08
Exploring
Traversal
Strategy
KDD’09
Incremental
Crawling
Saturday, May 22, 2010
5. Understanding
Forum
Quality
Data
Crawling Assessmen
ExtracIon
t
WWW’08 WWW’09, SIGIR’09
iRobot:
An
Intelligent
Crawler
for
AutomaIon
Data
ExtracIon Quality
Assessment
Web
Forums
SIGIR’08
Exploring
Traversal
Strategy
KDD’09
Incremental
Crawling
Saturday, May 22, 2010
16. Forum
Sitemap
• A
sitemap
is
a
directed
graph
corresponding
consis,ng
of
a
set
of
ver$ces
and
the
links
Saturday, May 22, 2010
17. Forum
Sitemap
• A
sitemap
is
a
directed
graph
corresponding
consis,ng
of
a
set
of
ver$ces
and
the
links
• Rui
Cai,
Jiangming
Yang,
Wei
Lai,
Yida
Wang
and
Lei
Zhang.
iRobot:
An
Intelligent
Crawler
for
Web
Forums.
In
Proceedings
of
WWW
2008
Conference
Saturday, May 22, 2010
18. Page
Clustering
• Forum
pages
are
based
on
database
&
template
• Layout
is
robust
to
describe
template
– Layout
can
be
characterized
by
the
HTML
elements
in
different
DOM
paths
Saturday, May 22, 2010
19. Page
Clustering
• Forum
pages
are
based
on
database
&
template
• Layout
is
robust
to
describe
template
– Layout
can
be
characterized
by
the
HTML
elements
in
different
DOM
paths
Saturday, May 22, 2010
20. Page
Clustering
• Forum
pages
are
based
on
database
&
template
• Layout
is
robust
to
describe
template
– Layout
can
be
characterized
by
the
HTML
elements
in
different
DOM
paths
Saturday, May 22, 2010
21. Page
Clustering
• Forum
pages
are
based
on
database
&
template
• Layout
is
robust
to
describe
template
– Layout
can
be
characterized
by
the
HTML
elements
in
different
DOM
paths
Saturday, May 22, 2010
28. Inner-‐Page
Features
• The
inclusion
rela9on.
Data
records
usually
have
inclusion
relaIons.
• The
alignment
rela9on.
Since
data
is
generated
from
database
and
represented
via
templates,
data
records
with
the
same
label
may
appear
repeatedly
in
a
page.
• Time
Order.
Since
post
records
are
generated
sequenIally
along
Imeline,
the
post
Ime
should
be
sorted
ascending
or
descending.
Saturday, May 22, 2010
47. Markov
Logic
Networks
• An
MLN
can
be
viewed
as
a
template
for
construc,ng
Markov
Random
Fields.
• With
a
set
of
formulas
and
constants,
MLNs
define
a
Markov
network
with
one
node
per
ground
atom
and
one
feature
per
ground
formula.
The
probability
of
a
state
x
in
such
a
network
is
given
by:
Saturday, May 22, 2010
48. Markov
Logic
Networks
• Divide
DOM
tree
elements
into
three
categories
:
– Text
element
– Hyperlink
element
– Inner
element
• Benefit
– Reduce
the
number
of
possible
groundings
in
inference.
– Reduce
the
ambiguity
and
achieve
beRer
performance.
Saturday, May 22, 2010
49. Experiments
List
Pages Post
Pages
Saturday, May 22, 2010
57. Future
works
hJp://discussions.apple.com/
Saturday, May 22, 2010
58. Conclusion
• A
template-‐independent
approach
to
extract
structured
data
from
web
forum
sites.
• we
can
leverage
power
of
site-‐level
informaIon,
such
as
the
mutual
informaIon
among
pages,
inner
or
inter
verIces
of
the
sitemap.
• hZp://research.microso=.com/people/jmyang/
Saturday, May 22, 2010