Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Mdst3703 2013-09-17-text-models
1. Text Models and Markup
Prof. Alvarado
MDST 3703
17 September 2013
2. Business
• Plan B: If Home Directory is not working for
you, please use the Hive
– Go to http://its.virginia.edu/hive/connected.html
– Install VMWare Client
– Use Notepad++
– Home Directory link your Desktop (also as J drive)
• Tutorials
– If you feel lost about HTML let me know
3. Review 1: Textual Signals
• Each of the authors last week viewed the text
as a kind of signal
• A signal is a pattern that contains messages
• Messages can be grasped through parsing the
signal
• What were the messages? How were they
parsed?
4. text can be viewed as a long signal consisting of characters selected from a common set of characte
5. A model of communication.
Messages get converted into signals and back into messages
by means of a shared code.
ENCODING DECODING
SHARED CODE
Person 1 Person 2
6. Author Parsed elements Decoded message
Levi-Strauss Relations and
bundles
Structural
oppositions
Colby Thesaurus words Thematic patterns
Ramsay Scenes Genres
7. Text is like this. This
is a map of DC
generated by
thousands of
individual Flickr and
Twitter events.
The picture is a kind
of signal—collective
and unconscious, yet
meaningful.
The patterns
discerned from the
signals are not
intentional, but they
are the products of
intentional activity.
http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps
[Text is like this]
8. Review 2: Semantic HTML
• Also called POSH—”Plain Old Semantic HTML”
• The use of HTML to describe a text, not to
format it (CSS is used to format)
• DIV, SPAN, CLASS, and ID are general purpose
tools to provide more flexible markup
• What kinds of things can POSH be used to
describe?
9. Segue
Semantic markup may be used to support the
analysis of each of our authors—including
Aristotle
Aristotle: Elements of drama, Elements of plot
<div class=“plot-element” id=“reversal-of-
fortune”> ... </div>
Levi-Strauss: Relations and Bundles in myths
<span class=“relation”> ... </span>
Colby: Theme words in folktales
<span class=“antagonism”>fight</span>
Ramsay: Scenes in plays
<div class=“scene”> ... </div>
10. Let’s step back and look more
closely at “text”
Let’s look at some material examples
21. Documents have thee Levels:
Structure, Content, Style
Structure
The organization of content into units (elements)
and logical relationships (e.g. reading order)
Content
TEXT, images, video clips, etc.
Style
Screen and print layout
Fonts, colors, etc.
22. Descriptive markup languages allow
us to define structure of documents
for computational purposes
Theoretically, they do not specify
layout or content
26. Document Elements and Structures
Play
– Act +
• Scene +
– Line +
Book
– Chapter +
• Verse +
Letter
– Heading
• Return Address
• Date
• Recipient Info
– Name
– Title
– Address
– Content
• Salutation
• Paragraph +
• Closing
28. XML is a markup
language
It is a more powerful
system for semantic
markup than POSH
29. What is XML?
• Stands for eXtensible Markup Language
– Actually invented after the web
– A simplification of SGML, the language used to create
HTML
– It specifies a set of rules for creating specialized markup
languages such as HTML and TEI
• It is simplified version of the SGML
– Standard Generalized Markup Language
• SGML was invented in the early 1970s to wrest the
control of documents from computer people who were
taking over industries like law and accounting
30.
31. XML looks like this
Notice how the element names reference units, not layout or style
33. XML Premises
1. All documents are comprised of elements.
2. Elements contain content.
3. Elements have no layout.
4. Elements are hierarchically ordered.
5. Elements are to be indicated by “markup” –
tags that define the beginning and end of an
element
34. XML Markup Rules
• Tags signify structural elements
• Three kinds of tag
– Start and End, e.g <p> and </p>
– Singleton, e.g <br />
• Start and singleton tags can have attributes
– Simple key/value pairs
– <div class="stanza" style="color:red;">
• Basic rules
– All attributes must be quoted
– All tags must nest (no overlaps!)
36. XML also provides Document Types
• A Document Type Definition (DTD) defines a set of
tags and rules for using them
– Specifies elements, attributes, and possible
combinations
– E.g. in HTML, the ol and ul elements must contain li
elements
• A DTD is just one kind of schema system used by
XML
• Schema express data models of/for texts
– TEI is a powerful way of describing primary source
materials for scholars
• Documents that use a schema properly are called
“valid”
37. Originally, DTDs defined “genres”
like business letter or mortgage form
They were later used to define more
abstract models of textual content
38. XML is used everywhere
• HTML
– E.g. Embed codes
• TEI (Text Encoding Initiative)
• RSS
• Civilization IV
• Playlists (e.g. XSPF or “spiff”)
• Google Maps (KML)
39. The Text Encoding Initiative created
TEI to mark up scholarly documents
Mainly primary sources such as
books and manuscripts
40. TEI
• Written in XML (was SGML)
• The dominant language used to encode
scholarly text
• Scholars can select from a large set of
elements or their own elements to match
what they are interested in
41. Examples
• The TEI Header
– http://tbe.kantl.be/TBE/examples/TBED02v00.ht
m
• TEI Prose
– http://tbe.kantl.be/TBE/examples/TBED03v00.ht
m
• Find others at the TEI By Example Project
– http://tbe.kantl.be/TBE/
42. XML and TEI both contain an
implicit theory of text
What is it?
43. OCHO
• XML (and therefore HTML and TEI) imply a
certain theory of text
– A text is an OHCO
• OHCO
– Ordered Hierarchy of Content Objects
• An OHCO is a kind of tree
– Elements follow each other in sequences
– Elements can contain other elements
45. OHCO allows for easy processing
• Every element has a precise address in the text
– E.g. HTML/body/p[1]
• Texts can be described in the language of kinship
– Ancestors, parents, siblings, children, etc.
• Texts can be restructured and manipulated by
known patterns and algorithms
– Traversing
– Pruning
– Cross-referencing
49. <page n=“2”>
. . .
<p id=“foo”>His good looks and his rank had one fair
claim on his attachment, since to them he must have owed a
wife</p>
</page>
<page n=“3”>
<p id=“bar” prev_id=“foo”> a very superior character to
anything deserved by his own.</p>
. . .
</page>
Solution 1: Split Elements
50. <p>His good looks and his rank had one fair claim on
his attachment, since to them he must have owed a
wife <pb n=“3” /> a very superior character to
anything deserved by his own.</p>
Solution 2: Use “Milestones”
One structure gets backgrounded
57. A KR is a model that comprises
1. A set of categories (aka Ontology)
Names and relationships between names
2. A set of inference rules (aka Logic)
A method of traversing names and relations
3. A medium for computation
A medium for mechanically producing inferences
4. A language for expressing these things
Such as a programming or markup language
61. Tables are more rigid
Trees allow for indefinite depth
But tables are easier to manipulate
In any case, tables and trees are two
major kinds of data structure that
you will encounter …
63. A Proposed Model
• Texts are not documents
– Documents are media, Texts are messages
• Texts and documents are part of a system
comprised of “levels”
– They are effectively archaeology sites with
stratigraphic layers
– Erasures are like cities building on top of each other
• Each level of the system is described by an
appropriate set of tools
– Document structures XML
– Textual structures, embedded ontologies Tables
64. Basic Levels
• Document
– Physical objects (paper)
– Logical objects (defined by space, style, punctuation,
etc.)
– Style and layout (also defined by space, color, etc.)
– Can have superimposed versions
• Text
– Sequences of characters
– Grammatical features
– Figures and poetic features
– Etc.
Editor's Notes
----- Meeting Notes (9/17/13 12:14) -----This is where I can add notes ...
Old French illuminated manuscript. What does the image mean?
TS Eliot, the Wasteland – note use of line breaks; what do they mean?