A text extraction workshop delivered by Cameron Buckner on Friday, October 18th, 2012 as part of the University of Houston Digital Humanities Initiative.
DH TOOLS: Introduction to Text Analysis Using BEAGLE and MDS
1. DH TOOLS
Introduction to Text Analysis
Cameron Buckner
Visiting Assistant Professor
Department of Philosophy
cjbuckner@uh.edu
2. Our Initiative
• Promote, facilitate, interact
• Reading group • Tools workshops
• Speaker series • Grantwriting support
• Infrastructure advocacy
http://www.uh.edu/class/digitalhumanities/
3. Roadmap
Goal today: Analyze texts using cutting-edge analyses
from computational psycholinguistics with an off-the-shelf
tool, word2word
1. What can you do with text analysis?
2. A little bit of theory: Semantic spaces
3. BEAGLE: The holographic lexicon
4. MDS: Visualizing multidimensional networks
5. Examples
6. Hands-on play
4. What is DH?
• Computation and interpretation
• The use of computational tools for the
production, exploration, analysis, and
dissemination of humanistic knowledge
• Thread common between new and old:
pattern recognition
• Includes
• Digitization and archiving, markup
• Analysis & visualization
• Search & dissemination
• Pedagogy
5. Methods of Text Analysis I
• Statistical analysis, information extraction, machine
learning
• Syntactic: word frequencies (Google n-grams), vocabulary
usage, stylometry (authorship and genre), Pagerank
http://www.nytimes.com/interactive/2012/09/06/us/politics/conventio
n-word-counts.html
6. Methods of Text Analysis II
• Semantic: tf-idf, latent semantic analysis, latent dirichlet
allocation, entropy-based measures, ontologies
• Aim to model relevance, semantic similarity, taxonomic
relationships, object properties and relations
7. Reminders
• Be creative and have fun, but if you want to publish…
• Be principled:
• Junk in, junk out
• Always know assumptions required by a method
• Analyses should hold up under trivial transformations of data
representation
• Be prepared for pragmatic design decisions
• Go in with hypotheses and structured questions
• Confirm with careful humanistic interpretation
8. The Mental Lexicon
• A “mental dictionary”
• Contains information about:
• Word meaning, grammatical roles, taxonomic relations, typical
properties
• Behavioral indicators: recognition speed, synonymy and relevance
judgments, priming, frequency effects, categorization
9. BEAGLE
• A model that learns (unsupervised) a holographic mental
lexicon automatically from text
• History: Two approaches to semantic analysis
• Co-occurrence based measures (“bag of words”, LSA, tf-
idf)
• Good at determining relevance, bad at determining roles and
relations
• Order-based measures (n-gram models, generative
grammars, hidden Markov models)
• Good at identifying grammatical and structural relations, bad at
identifying relevance and meaning
• Challenge: Can the two be combined?
10. Context + Role
• Assumption: People acquire an
idiosyncratic mental lexicon from
patterns of co-occurrence and syntactic
relationships they encounter in natural
language.
• “You shall know a word by both the
company it keeps and how it keeps it.”
• Goal: If we could build a representation
of a text’s context/role distributions, we
could predict the structure of a mental
lexicon that produced a corpus and/or
that would be produced by it
• Texts as “mental fingerprints”
12. Basic Vector Approach
1. Start with a multi-dimensional vector space
2. Each term meaning is initially represented by a random,
constant environment vector and an empty memory
vector
3. Associations between terms can be represented by adding or
averaging their environment vectors into their memory
vectors
4. Each time terms co-occur, their memory vectors become
closer in multi-dimensional similarity space
13. Representing Order Info
• Convolution: compressing outer-product matrix of two
term vectors so that the product contains recoverable
information about both
• Example: z = x * y
• Association vector z contains information about both x and
y
• Can (approximately) reconstruct source vector y by probing
z (deconvolution) with x (and vice versa)
• Combined BEAGLE memory vector: Context memory
comes from vector addition, and order information comes
from n-gram binding using convolution
14.
15.
16. Combined Memory Vector
• m = memory vector
• e = initial random environment vector
• p = position in sentence
• lambda = constant chunking factor (size of n-gram window)
• bind i,j = a non-commutative convolution of constant order
vector with other environment vectors in n-gram
20. So, BEAGLE method
1. Choose number of dimensions for vector space, size of
n-gram window for order info
2. Clean up source documents using standard NLP (stop
words, stemmers, etc.)
3. Learn context and order vectors from corpus, combine
4. Select words of interest
5. Visualize multi-dimensional space using favorite
method (e.g. MDS)
21. Limitations of BEAGLE
• Only considers 1-sentence windows
• Lexical ambiguity
• Valence (e.g. synonyms, antonyms)
22. MDS
• A way to view a multi-dimensional similarity space
• Collapses multi-dimensional space in way that tries to
mutually preserve distances between vectors
• Collapsing dimensions often reveals most significant
[higher-order] dimensions
23. Uses
• How do two academic reference works compare in their
coverage of a discipline?
• Biases? Overlap?
InPhO-
Semantics
Credit:
Robert
Rose
25. Political rhetoric
• What can we learn from the “semantic space” derived
from a party or candidate’s rhetoric?
• Central issues?
• Key comparisons?
• Ideological focus/big tent?
• Location on ideological spectrum?
• Example: compare speeches from Republican and
Democratic political conventions
26. Heat Map: Terms most diagnostic of a speech’s being delivered by a Democrat
“Hotter” indicates more diagnostic in comparison. Hottest terms =
aarp, experience, affordable, abuelo, billionaires, afghanistan, beijing, biofuels, aliens
27. Character Analysis
• Moretti: “protagonist is the character that minimized the
sum of the distances to all other vertices”
• (But Moretti did it by hand!)
You shall know a word by the company it keepsYou shall know a word by the company it keeps and how it keeps it
If you have a photographic medium that can record and reproduce not only the amount of light that strikes it but also its direction, then you can represent multiple dimensions of an object simultaneously and recall the desired dimensions by shining reconstruction beam at different angles.
Snow/slow
SEP = Stanford Encyclopedia of Philosophy, IEP = Internet Encyclopedia of Philosophy
Analysis method = LSA
Analysis computed on composite transcripts from 2012 Democratic and Republican national conventions.