This document discusses techniques for analyzing text and documents. It begins by introducing Angela Zoss and her background in linguistics, human-computer interaction, and natural language processing. It then discusses analyzing text at both the low level of individual words and the high level of full documents. Examples are given of using documents to learn about language patterns over time and across corpora, as well as using language to learn about the structure and topics of documents. The document focuses on describing, organizing, and classifying documents through both manual and automated methods. Examples described include studying pronoun frequencies, mythological networks, literary fingerprints, sentiment analysis, and classification schemes like the UCSD Map of Science and NIH Map Viewer.
1. Duke University Libraries, Digital Scholarship
Text > Data, October 25
HIGH-LEVEL TEXT ANALYSIS
AND TECHNIQUES
Angela Zoss
Data Visualization Coordinator
226 Perkins Library
angela.zoss@duke.edu
4. How I learned to love the
document.
B.A. courses: Linguistics, Communication
M.S. courses: Communication, Human-Computer
Interaction
Employment: arXiv.org Administrator
• Bibliometrics/Scientometrics
Ph.D. •
courses:Computer Mediated Discourse Analysis
• Latent Structure Analysis
• Natural Language Processing
6. Text analysis from…
• documents down to words (“low-level”)
• words up to documents (“high-level”)
7. Using documents to learn about
language
(or other social phenomena)
Analyzing documents as records/proxies of
language, social structures, events, etc.
Linguistic studies:
morphology, word counts, syntax, etc. …
over time (e.g., Google ngram viewer)
language across corpora (e.g., political
speeches)
Underwood, T. (2012). Where to start with text mining.
8. Using documents to learn about
language
Historical culturomics of pronoun frequencies
9. Using documents to learn about
language
Universal properties of mythological networks
10. Using language to learn about
documents
Analyzing documents as artifacts themselves, with
their own properties and dynamics
Literary, documentary studies:
Structural/rhetorical/stylistic analysis
Document categorization, classification
Detecting clusters of document features (topic
modeling)
Underwood, T. (2012). Where to start with text mining.
11. Using language to learn about
documents
Literary Empires, Mapping Temporal and
Spatial Settings in Swinburne
12. Using language to learn about
documents
Using Word Clouds for Topic Modeling Results
13. What are documents?
For this discussion,
digital versions of works of
spoken or written language
Examples:
books, articles, transcripts, emails, twe
ets…
16. Why study documents?
• Describe a corpus
• Compare/organize documents
• Locate relevant information/filter out
irrelevant information
17. Describing a corpus
• Finding regularities/differences across
groups of documents
• Developing theories of structure, style, etc.
that can then be tested or applied
• May be manual (content analysis) or
computer-assisted (statistical)
19. Differences of
format, genre, participants…
• Articles may have sections, but these will
vary by discipline and type of article
• Books may be fiction or non-fiction (or
both)
• Transcripts may refer to multiple speakers,
non-text content
• …ad infinitum
20. Example: Literature
Fingerprinting
Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE
Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi:
10.1109/VAST.2007.4389004
21. Organizing documents
Detect similarity between documents and a
known category (or simply among
themselves)
Supports browsing, sentiment
analysis, authorship detection
22. Example: Bohemian Bookshelf
Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book
Discoveries through
Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, to appear.
23. Similarity based on…
• common document attributes
authorship, genre
• common language patterns
topics, phrases
• common entity references
characters, citations
24. Example: Quantitative
Formalism
Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An
experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
27. Classification
• assigning an object to a single class
• often supervised, using an existing
classification scheme and a tagged corpus
28. Example: Relative signatures
Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level
of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012
(pp. 103-112).
29. Categorization
• assigning documents to one or more
categories
• suggestive of unsupervised clustering
techniques
• design choices made to fit particular tasks
or goals
30. Example: UCSD Map of
Science
Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., &
Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS
ONE, 7(7), e39464.
34. Text is only one component of a document.
Research questions often push us to be
creative with how we operationalize
constructs.
The richness of language and documents is
best preserved by using
multiple, complementary approaches.