EuroVis DocuBurst Presentation 2009

DocuBurst:
Visualizing Document Content
Using Language Structure

EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn

Document Content Visualization
4

 Navigation in collections of digital text
 Content analysis (digital humanities)

 Plagiarism detection

 Authorship attribution

...Using Language Structure
5

 Traditional glyph techniques use unstructured
word counts (e.g. tag clouds)
 DocuBurst structure is based on a carefully
designed ontology called WordNet

WordNet Background
6

 Basic data unit is a set of synonyms called a synset:
{lawyer, attorney}, {jump, hop, skip}

 Words can occur in multiple synsets:
{bank, financial institution}
{bank, slope, riverside}

 Free resource from Princeton University

Hyponymy Relation
7

 X is a Y or X is a kind of Y
 transitive, asymmetric relationship
 example
 {robin,redbreast} IS A {bird}
 robin and redbreast are hyponyms of bird

 forms the basic structure of the noun network

{robin, redbreast} IS-A {bird} IS-A
{animal, animate_being} IS-A
{organism, life_form, living_thing} IS-A {entity}

Creating DocuBurst
8

gamesgame
takentake

absolute,noun,10
chair,noun,2
moment,noun,11
game,noun,30
reality,noun,3
take,verb,13
represent,verb,17
...

game IS activity
chair IS furniture

Word Sense Ambiguity
10

 Man = {mankind,world}, {male human}, ...
 Water = {H2O}, {water supply}, {body of water}, ...
 Word senses are roughly ordered by frequency in
WordNet

Alternative Scoring Models
11

 Count for all senses
 undue prominence to ambiguous words
 Count first sense only
 loses too much information
 Divide by sense count (same for all senses)
 high penalty on polysemous words
 Divide by sense index
 decreased prominence for uncommon senses

Visual Encoding
12

 Node Size: # of leaves in subtree
 Stability across documents
 Node Position: IS-A relation
 Multi-level linguistic abstraction
 Additive
(2 ducks + 3 geese = 5 birds)
 Node Hue: sense index
 Differentiates subtrees
 Node Saturation: word count
 Ordering & approximate scale is perceived
 Node Label: First word in synset
 Words are ordered by commonality in the
language, reveals well-known words

Node Colouring Alternatives
13

Cumulative Counts Single Node Counts
Supports Visual Summaries Supports Precision and Selection

Trace-to-Root
15

Cattle IS-A bovine IS-A bovid IS-A ... Mammal IS-A vertebrate IS-A chordate IS-A animal

Level of Detail Filter
21

 Nodes > N away from root are hidden

Node Size Mapping
24

 Size by # leaves
+ consistent
– visual artifacts (highly relevant words with few leaves
are too small)

 Size by score
+ redundant encoding
+ important words more prominent
– disrupts inter-document comparison

Font Size Mapping
25

 Size to fit cell
+ maximize legibility
– short words have huge font

 Font size proportional to cell size
+ short words not more prominent
– small maximum size to accommodate long words

Inclusion of Zero-count Words
26

+ provides context (what is not in document)
– more cluttered

2008 U.S. Presidential Debate
32

Unexpected Uses
33

 WordNet Visualization

Unexpected Uses
34

 WordNet Visualization

Unexpected Uses
35

 Language Education
 “invaluable potential for writing and vocabulary
development at the secondary level”
 “I'm very interested in using the program, I'm an English
teacher”

Types of Document
37
Visualization

Features of Document
38
Visualization
 Semantic: indicate meaning
 Cluster: generalize into concepts
 Overview: provide quick gist
 Zoom: support varying level of detail
 Compare: multi-document comparisons
 Search: find specific words/phrases
 Read: drill-down to original text
 Pattern: reveal patterns of repetition
 Features: reveal extracted features such as emotion
 Suggest: automatically select interesting focus words
 Phrases: can show multi-word phrases
 All words: can show all parts of speech

Features of Document
39
Visualization

Semantics & Clustering
40

 Provides word
definitions and
relations
 Clusters of
related terms
allow variable
level of
abstraction

Phrases & All Words
41

 Cannot visualize multi-word phrases that are not
‘words’ in WordNet
 Only English nouns, verbs

DocuBurst Comparative Views
45

 Embed small multiples in e-libraries
 Colour scale based on text difference
 From each other
 From corpus average

Simplification
47

 Root suggestion
 How to know where to start exploring?
 Word sense disambiguation
 Attempt to select a sense
 Use a less detailed ontology

Thanks for your Attention!

Acknowledgements:
Ravin Balakrishnan and helpful reviewers.
Contact: ccollins@cs.utoronto.ca

EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn

EuroVis DocuBurst Presentation 2009

Recommandé

Recommandé

Contenu connexe

Similaire à EuroVis DocuBurst Presentation 2009

Similaire à EuroVis DocuBurst Presentation 2009 (20)

Dernier

Dernier (20)

EuroVis DocuBurst Presentation 2009