4. Document Content Visualization
4
Navigation in collections of digital text
Content analysis (digital humanities)
Plagiarism detection
Authorship attribution
5. ...Using Language Structure
5
Traditional glyph techniques use unstructured
word counts (e.g. tag clouds)
DocuBurst structure is based on a carefully
designed ontology called WordNet
6. WordNet Background
6
Basic data unit is a set of synonyms called a synset:
{lawyer, attorney}, {jump, hop, skip}
Words can occur in multiple synsets:
{bank, financial institution}
{bank, slope, riverside}
Free resource from Princeton University
7. Hyponymy Relation
7
X is a Y or X is a kind of Y
transitive, asymmetric relationship
example
{robin,redbreast} IS A {bird}
robin and redbreast are hyponyms of bird
forms the basic structure of the noun network
{robin, redbreast} IS-A {bird} IS-A
{animal, animate_being} IS-A
{organism, life_form, living_thing} IS-A {entity}
8. Creating DocuBurst
8
gamesgame
takentake
absolute,noun,10
chair,noun,2
moment,noun,11
game,noun,30
reality,noun,3
take,verb,13
represent,verb,17
...
game IS activity
chair IS furniture
10. Word Sense Ambiguity
10
Man = {mankind,world}, {male human}, ...
Water = {H2O}, {water supply}, {body of water}, ...
Word senses are roughly ordered by frequency in
WordNet
11. Alternative Scoring Models
11
Count for all senses
undue prominence to ambiguous words
Count first sense only
loses too much information
Divide by sense count (same for all senses)
high penalty on polysemous words
Divide by sense index
decreased prominence for uncommon senses
12. Visual Encoding
12
Node Size: # of leaves in subtree
Stability across documents
Node Position: IS-A relation
Multi-level linguistic abstraction
Additive
(2 ducks + 3 geese = 5 birds)
Node Hue: sense index
Differentiates subtrees
Node Saturation: word count
Ordering & approximate scale is perceived
Node Label: First word in synset
Words are ordered by commonality in the
language, reveals well-known words
13. Node Colouring Alternatives
13
Cumulative Counts Single Node Counts
Supports Visual Summaries Supports Precision and Selection
24. Node Size Mapping
24
Size by # leaves
+ consistent
– visual artifacts (highly relevant words with few leaves
are too small)
Size by score
+ redundant encoding
+ important words more prominent
– disrupts inter-document comparison
25. Font Size Mapping
25
Size to fit cell
+ maximize legibility
– short words have huge font
Font size proportional to cell size
+ short words not more prominent
– small maximum size to accommodate long words
26. Inclusion of Zero-count Words
26
+ provides context (what is not in document)
– more cluttered
35. Unexpected Uses
35
Language Education
“invaluable potential for writing and vocabulary
development at the secondary level”
“I'm very interested in using the program, I'm an English
teacher”
38. Features of Document
38
Visualization
Semantic: indicate meaning
Cluster: generalize into concepts
Overview: provide quick gist
Zoom: support varying level of detail
Compare: multi-document comparisons
Search: find specific words/phrases
Read: drill-down to original text
Pattern: reveal patterns of repetition
Features: reveal extracted features such as emotion
Suggest: automatically select interesting focus words
Phrases: can show multi-word phrases
All words: can show all parts of speech
45. DocuBurst Comparative Views
45
Embed small multiples in e-libraries
Colour scale based on text difference
From each other
From corpus average
46.
47. Simplification
47
Root suggestion
How to know where to start exploring?
Word sense disambiguation
Attempt to select a sense
Use a less detailed ontology
48. Thanks for your Attention!
Acknowledgements:
Ravin Balakrishnan and helpful reviewers.
Contact: ccollins@cs.utoronto.ca
EuroVis 2009 Christopher Collins, Sheelagh Carpendale, and Gerald Penn