Zoss High-Level Text Analysis and Techniques

Duke University Libraries, Digital Scholarship
Text > Data, October 25

HIGH-LEVEL TEXT ANALYSIS
AND TECHNIQUES
Angela Zoss
Data Visualization Coordinator
226 Perkins Library
angela.zoss@duke.edu

How I learned to love the
document.
B.A. courses: Linguistics, Communication

M.S. courses: Communication, Human-Computer
Interaction

Employment: arXiv.org Administrator
• Bibliometrics/Scientometrics
Ph.D. •
courses:Computer Mediated Discourse Analysis
• Latent Structure Analysis
• Natural Language Processing

Text analysis from…
• documents down to words (“low-level”)
• words up to documents (“high-level”)

Using documents to learn about
language
(or other social phenomena)
Analyzing documents as records/proxies of
language, social structures, events, etc.

Linguistic studies:
morphology, word counts, syntax, etc. …
over time (e.g., Google ngram viewer)
language across corpora (e.g., political
speeches)

Underwood, T. (2012). Where to start with text mining.

language
Historical culturomics of pronoun frequencies

language
Universal properties of mythological networks

Using language to learn about
documents
Analyzing documents as artifacts themselves, with
their own properties and dynamics

Literary, documentary studies:
Structural/rhetorical/stylistic analysis
Document categorization, classification
Detecting clusters of document features (topic
modeling)

Underwood, T. (2012). Where to start with text mining.

documents
Literary Empires, Mapping Temporal and
Spatial Settings in Swinburne

documents
Using Word Clouds for Topic Modeling Results

What are documents?
For this discussion,
digital versions of works of
spoken or written language
Examples:
books, articles, transcripts, emails, twe
ets…

Documents as context
Documents have:
• form(at)
• style
• provenance
• entities
• intentions

Why study documents?
• Describe a corpus
• Compare/organize documents
• Locate relevant information/filter out
irrelevant information

Describing a corpus
• Finding regularities/differences across
groups of documents
• Developing theories of structure, style, etc.
that can then be tested or applied
• May be manual (content analysis) or
computer-assisted (statistical)

Example: Storylines

http://xkcd.com/657/

Differences of
format, genre, participants…
• Articles may have sections, but these will
vary by discipline and type of article
• Books may be fiction or non-fiction (or
both)
• Transcripts may refer to multiple speakers,
non-text content
• …ad infinitum

Example: Literature
Fingerprinting

Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE
Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi:
10.1109/VAST.2007.4389004

Organizing documents
Detect similarity between documents and a
known category (or simply among
themselves)

Supports browsing, sentiment
analysis, authorship detection

Example: Bohemian Bookshelf

Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book
Discoveries through
Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, to appear.

Similarity based on…
• common document attributes
authorship, genre
• common language patterns
topics, phrases
• common entity references
characters, citations

Example: Quantitative
Formalism

Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An
experiment. Pamphlets of the Stanford Literary Lab (vol. 1).

Example: Clinton’s DNC Speech

http://b.globe.com/TogUqq

Example: View DHQ

http://digitalliterature.net/viewDHQ/vis3.html

Classification
• assigning an object to a single class
• often supervised, using an existing
classification scheme and a tagged corpus

Example: Relative signatures

Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level
of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012
(pp. 103-112).

Categorization
• assigning documents to one or more
categories
• suggestive of unsupervised clustering
techniques
• design choices made to fit particular tasks
or goals

Example: UCSD Map of
Science

Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., &
Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS
ONE, 7(7), e39464.

Example: NIH Map Viewer

https://app.nihmaps.org/nih/browser/

Reference
systems, infrastructure
What do we gain by adding structure?

What do we lose?

Text is only one component of a document.

Research questions often push us to be
creative with how we operationalize
constructs.

The richness of language and documents is
best preserved by using
multiple, complementary approaches.

QUESTIONS?
angela.zoss@duke.edu

Zoss High-Level Text Analysis and Techniques

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

En vedette

En vedette (14)

Similaire à Zoss High-Level Text Analysis and Techniques

Similaire à Zoss High-Level Text Analysis and Techniques (20)

Dernier

Dernier (20)

Zoss High-Level Text Analysis and Techniques

Notes de l'éditeur