1. Marina Santini
Artificial Solutions,
KYH Agile Web Development
Stockholm
Uppsala University
Department of Linguistics and Philology, Seminar Series
Fri 4 March 2011
3. Outline
What is genre? What is web genre?
What is the difference betw genre and web genre?
Why is (web) genre important?
Automatic web genre identification
The very beginning: Biber and Karlgren&Cutting
Sharoff
Kim & Ross
Santini
Stein et al.
Web genre identification by Humans
Karlgren
Rosso & Haas
Crowston et al.
Future directions
4. What is genre? The beginning…
Aristotle (4th cent. b.C.): drama, lyrics, epics
Drama: tragedy, comedy, satyr
Literary theory and literary genres
Library classification
Library classification used also in online bookshops (e.g
Amazon)
Music genres (jazz, rock, etc.), film genres (thriller,
drama, western etc.)
5. More recently…
Genre in academic contexts, in
workplace and professional
contexts, public contexts, in
pedagogy (teaching writing), etc
(resarch articles, essays, emails,
memos, etc.)
7. Genre & Corpus Linguistics
Surprisingly, no explicit definition of what genre is…
Brown corpus (1961): 15 genres
Sockholm-Umeå Corpus (SUC) (1990s)
British National Corpus (1990s)
etc.
9. Why is genre important?
It is a context carrier: being based on recurrent
conventions and predictable expectations, genre
provides the communicative context and the
communicative purpose for which a text has been
produced.
Think of what happens in your mind when you come
across a specific genre. Eg, FAQs, reviews,
interviews, academic papers, reportages…
10. Benefits (I)
Being a context carrier…
Complexity reduction: a text receives identity
throught belonging to a certain genre;
Predictivity: genre reduces information overload.
Findability: genre helps find web documents
”relevant” to our information needs;
11. Benefits (II)
Genre competence increases information
understanding:
genre competence increases self protection against
digital crimes (fishing, hoaxes, cyberbullying) because it
can help us spot genre anomalies and consequently
malicious intentions;
Genre competence helps implement democracy:
some educational programs (e.g. in Australia) focus on
teaching genre since the primary school because those
who do not have genre competence because they drop
off school after the primary school become socially
disadvantaged in the structure of power.
12. What is webgenre ?
All types of genres that are on the web…
Paper genres that have been uploaded in any format
+ genres that do not have any countepart in the
paper world:
ex: home page, About Us, FAQs, webzine,
personal blog, corporate weblogs …
13. How is webgenre different from paper
genre?
On the web, there are new communicative settings,
and new communicative contexts, so new genres are
spawned
On the web, the new communicative settings have
been spurred by a proliferation of new technologies
that ease, foster and model our communication: ex:
chats, blogs, social networks, like Facebook, Twitter,
LinkedIn…
14. Then, a written text is not only
topic…
There are many dimensions of variation: domain,
topic, register, sentiment, level of complexity or
difficulty or specialisation, trustworthiness and
credibility, etc.
… genre is a dimension of variation. Genre gives us
a topic packaged in a certain way. From the package,
we are able to identify the communicative purpose of
the text and the commiunicative context that has
spawn such a text.
15. A step back…
Biber (1988)
Genre
Text types
66 linguistically-motivated features
Multi-Dimensional Analysis
Ad-hoc corpus
Karlgren & Cutting (1994)
Genre
20 shallow features
Brown Corpus
16. Biberian
Text Types
Biber (1988)
Biber (1989)
Biber (1993)
Biber (1995)
Biber (2004a)
Biber (2004b)
Biber et al. (2005)
etc.
Genres/Registers
vs.
Text Types
External Features
vs.
Internal Features
“I have used the term ‘genre’ (or ‘register’) for text varieties that are readily
recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon,
conversation), while I have used the term ‘text type’ for varieties that are defined
linguistically (rather than perceptually)” (Biber, 1993).
18. From Biber’s text types to genres of electronic
corpora: Karlgren and Cutting (1994)
19. Karlgren and Cutting (1994):
Recognizing Text Genres with Simple Metrics
Using Discriminant Analysis
20 features
Discriminant analysis
Brown corpus
21. More than 15 years later…
Grieve, Biber et al. ” We define a genre in a very similar
manner to how we define register – i.e. as a variety of
language defined by the external situation in which it is
produced. However, while a register is characterized by
pervasive linguistic features, a genre is characterized by
conventionalized linguistic features”
Karlgren: ”Genre is a vague but well-established
notion, and genres are explicitly identified and
discussed by language users even while they may be
difficult to encode and put into practical use”
GoWeb
22. The concept of genre is beneficial…
but difficult to pin down and to
agree upon
GoWeb
In the book, we do not
propose a single and
unified definition of
genre. Authors give
their different views on
genre.
23. Do we really need a definition?
After all….
… once we are convinced that genre is useful, we could just
say that: genre is a classificatory principle based on a
number of attributes.
The web is immense, we cannot think of classifying web
documents by genre manually, can we? Let’s just focus on
AUTOMATIC web GENRE CLASSIFCATION!
24. What do we need for Automatic
webGenre Identification (AGI)?
We need:
a genre taxonomy (palette) and a corpus
measurable attributes (features) that can be extracted
automatically
an automatic classifier, i.e. a computational model that
does the classification for us
26. Models for AGI: Scenarios
Serge Sharoff
Kim & Ross
Santini
Stein et al.
Others…
GoWeb
27. Morphology & the Linguist
Aim: Find a genre palette allowing comparison among
corpora (Web As Corpus initiative ) and across
languages
A functional genre palette inspired by J. Sinclair
Many corpora: English and Russian
Classifier: SVM
Features: POS trigrams (577 for Russian; 593 for
English)
Ex of POS trigrams: ADV ADJ NOUN
Sharoff GoWeb
30. KRYS I and Harmonic Descriptor
Representation (HDR)
Information studies , Digital Libraries:
semantic concept
Features: HDR = FP, LP or AP (betw 1 and
T/ (N x MP))
Number of features: 7431
Classifier: SVM
KRYS I + 7 webgenre collection (total: 24 +
7 genre classes , 3452 documents)
Kim & Ross GoWeb
2477 words
36. Inferential model
It is a simple probabilistic model based on rules.
It allows some ”reasonging” through the use of weights
(closer to artificial intelligence than machine learning)
40. Three experimental settings, three
different genre needs….
1. Genre comparison across corpora
2. Digital libraries, where documents can be more easily
monitored
3. The wild web, where everything is uncertain and
noisy
WEGA prototype:
a retrieval model for genre-enabled web search
41. Genre retrieval model
Genre collection and palette: KI-04 corpus: 8 webgenres
Firefox add-on
Model: ”lightweight GenreRich model” (linear discriminant
analysis)
Features: HTML, link features, character features,
vocabulary concentration features (< 100 features)
Stein, Meyer zu Eissen, Lipka GoWeb
44. Genre Classes & Human
Recognition
How can we decide on the most representative genre
classes? Let’s ask users… yes indeed, but how?
1) questionnaires (Karlgren)
2) card sorting (Rosso & Haas)
3) task-oriented studies (Crowston et al.)
4) others…
46. User Warrant
Collecting genre terminology in the users’ own words
(3 participants)
Make the users classify web pages and create piles
(rationale?)
Users choose the best of the collected genre
terminology (102 participants)
User validation of the genre palette (257 participants)
Genres’ usefulness of web search (32 participants)
GoWeb: Rosso & Haas
48. Genres & Tasks
3 groups of respondents : teachers, journalists, engineers,
Respondents were asked to carry out a web search for a
real task of their own choice
What is your search goal?
What type of web page would you call this?
What is it about the page that makes you call that?
Was this page useful to you?
GoWeb: Crowston et al.
49. What type of web page would you call this?
522 unique terms about 300
50. Syracuse corpus & AGI
ACL 2010 (Uppsala):
FINE-GRAINED GENRE CLASSIFICATION USING
STRUCTURAL LEARNING ALGORITHMS
Zhili Wu, Katja Markert and Serge Sharoff
The whole corpus: 3027 annotated webpages divided
into 292 genres.
Focussing on genres containing 15 or more examples,
the corpus is of about 2293 examples and 52 genres.
51. Conclusions (I) : Do we really need
a definition of genre?
1. Take a number of web pages belonging to different
web genres (e.g. blogs, home pages, news stories,
FAQs, etc.)
2. Identify and extract genre-revealing features
3. Feed an automatic classifier
Where is problem?
52. Conclusions (II)
The problem with this approach is that without a
theoretical definition and characterization of the
concept of genre, it is not clear:
how to create a genre taxonomy that both humans and
automatic classifiers can easily discriminate against
how to select representative corpus for the genre classes
in the taxonomy, since there is a lot of variation in users’
assessment …
how to identifiy the optimal genre–revealing features
53. Future Work
Genre is a high-level concept: we NEED a theoretical
definition of genre for computational and empirical
purposes.
Without a theoretical definition:
genres become lifeless texts, merely characterized by
formal attributes and the communicative context , i.e.
the thing that make genre important, is completely
stripped out
Although in some restricted experimental settings,
this ”formalistic” approach is quite rewarding (more
than 95% success rate), we can hardly generalize on it.
54. Future directions: AGI is a fertile land
for research and development…
Now that basic explorations have been carried out, we
should concentrate more on the correlation and
interrelation of the following variables:
Human agreement
Representation of genre classes
Number of genre classes
Nature of genre classes
Size of the whole corpus
Sturctured and unstructered noise
Genre-revealing features that account for the context that
genres carry with them
New computational models and algorithms…
55. Certainties….
Genre is a useful concept in many disciplines
Automatic genre classification is feasible, and there is ample
space for improvement
I am interested in your views on (web) genre:
send me your impressions, ideas, gut feelings and your genre
classes:
Facebook page: www.facebook.com/genresontheweb
Genre blog: www.forum.santini.se
Webrider’s Short proposal to EU: www.webrider.se
57. References (I)
Bateman, John (2008) Multimodality and Genre,
Palgrave Macmillan
Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre:
An Introduction to History, Theory, Research, and
Pedagogy (free book);
http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf
Bruce, Ian (2008) Academic Writing and Genre,
Continuum
Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic
Variation and Genre, De Gruyter Mouton
58. References (II)
Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in
the Internet, John Benjamins Publishing Company
Heyd, Theresa (2008) Email Hoaxes: Form, function,
genre ecology, John Benjamins Publishing Company
Lee, David (2001), Genres, Registers, Text Types,
Domains, And Styles: Clarifying The Concepts And
Navigating A Path Through The Bnc Jungle, Language
Learning & Technology September 2001, Vol. 5, Num. 3.
pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
59. References (III)
Luzón, María José, Ruiz-Madrid, María Noelia and
Villanueva, María Luisa (eds) (2010) Digital Genres,
New Literacies and Autonomy in Language
Learning, Cambridge Scholars Publishing
Martin, James and Rose, David (2008) Genre
Relations: Mapping Culture, Equinox
Puschmann, Cornelius (2010) The corporate blog as
an emerging genre of computer-mediated
communication: features, constraints, discourse
situation, Universitätsverlag Göttingen
WEGA prototype download, documentation and
references: http://www.uni-
weimar.de/cms/medien/webis/research/projects/wega
.html