SlideShare une entreprise Scribd logo
1  sur  59
Marina Santini
Artificial Solutions,
KYH Agile Web Development
Stockholm
Uppsala University
Department of Linguistics and Philology, Seminar Series
Fri 4 March 2011
Genres on
the Web GoWeb
Outline
 What is genre? What is web genre?
 What is the difference betw genre and web genre?
 Why is (web) genre important?
 Automatic web genre identification
 The very beginning: Biber and Karlgren&Cutting
 Sharoff
 Kim & Ross
 Santini
 Stein et al.
 Web genre identification by Humans
 Karlgren
 Rosso & Haas
 Crowston et al.
 Future directions
What is genre? The beginning…
 Aristotle (4th cent. b.C.): drama, lyrics, epics
 Drama: tragedy, comedy, satyr
 Literary theory and literary genres
 Library classification
 Library classification used also in online bookshops (e.g
Amazon)
 Music genres (jazz, rock, etc.), film genres (thriller,
drama, western etc.)
More recently…
 Genre in academic contexts, in
workplace and professional
contexts, public contexts, in
pedagogy (teaching writing), etc
(resarch articles, essays, emails,
memos, etc.)
Recent Genre Definitions: 2008-2010
Genre & Corpus Linguistics
 Surprisingly, no explicit definition of what genre is…
 Brown corpus (1961): 15 genres
 Sockholm-Umeå Corpus (SUC) (1990s)
 British National Corpus (1990s)
 etc.
David Lee and the BNC Jungle
Why is genre important?
 It is a context carrier: being based on recurrent
conventions and predictable expectations, genre
provides the communicative context and the
communicative purpose for which a text has been
produced.
Think of what happens in your mind when you come
across a specific genre. Eg, FAQs, reviews,
interviews, academic papers, reportages…
Benefits (I)
Being a context carrier…
 Complexity reduction: a text receives identity
throught belonging to a certain genre;
 Predictivity: genre reduces information overload.
 Findability: genre helps find web documents
”relevant” to our information needs;
Benefits (II)
 Genre competence increases information
understanding:
 genre competence increases self protection against
digital crimes (fishing, hoaxes, cyberbullying) because it
can help us spot genre anomalies and consequently
malicious intentions;
 Genre competence helps implement democracy:
 some educational programs (e.g. in Australia) focus on
teaching genre since the primary school because those
who do not have genre competence because they drop
off school after the primary school become socially
disadvantaged in the structure of power.
What is webgenre ?
 All types of genres that are on the web…
 Paper genres that have been uploaded in any format
+ genres that do not have any countepart in the
paper world:
 ex: home page, About Us, FAQs, webzine,
personal blog, corporate weblogs …
How is webgenre different from paper
genre?
 On the web, there are new communicative settings,
and new communicative contexts, so new genres are
spawned
 On the web, the new communicative settings have
been spurred by a proliferation of new technologies
that ease, foster and model our communication: ex:
chats, blogs, social networks, like Facebook, Twitter,
LinkedIn…
Then, a written text is not only
topic…
 There are many dimensions of variation: domain,
topic, register, sentiment, level of complexity or
difficulty or specialisation, trustworthiness and
credibility, etc.
 … genre is a dimension of variation. Genre gives us
a topic packaged in a certain way. From the package,
we are able to identify the communicative purpose of
the text and the commiunicative context that has
spawn such a text.
A step back…
 Biber (1988)
 Genre
 Text types
 66 linguistically-motivated features
 Multi-Dimensional Analysis
 Ad-hoc corpus
 Karlgren & Cutting (1994)
 Genre
 20 shallow features
 Brown Corpus
Biberian
Text Types
Biber (1988)
Biber (1989)
Biber (1993)
Biber (1995)
Biber (2004a)
Biber (2004b)
Biber et al. (2005)
etc.
Genres/Registers
vs.
Text Types
External Features
vs.
Internal Features
“I have used the term ‘genre’ (or ‘register’) for text varieties that are readily
recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon,
conversation), while I have used the term ‘text type’ for varieties that are defined
linguistically (rather than perceptually)” (Biber, 1993).
Multi-Dimensional Analysis
Factor Analysis, Factors Scores (Biber, 1988)
Cluster Analysis (Biber, 1989)
Additional Statistical Tests (Biber, 2004a; 2004b, etc.)
1. intimate interpersonal interaction
2. informational interaction
3. scientific exposition
4. learned exposition
5. imaginative narrative
6. general narrative exposition
7. situated reportage
8. involved persuasion
Cluster Analysis - Biber (1989)Factor 2 - Biber (1988)
Criticism: Lee (1999)
From Biber’s text types to genres of electronic
corpora: Karlgren and Cutting (1994)
Karlgren and Cutting (1994):
Recognizing Text Genres with Simple Metrics
Using Discriminant Analysis
 20 features
 Discriminant analysis
 Brown corpus
POSs & SUC
More than 15 years later…
 Grieve, Biber et al. ” We define a genre in a very similar
manner to how we define register – i.e. as a variety of
language defined by the external situation in which it is
produced. However, while a register is characterized by
pervasive linguistic features, a genre is characterized by
conventionalized linguistic features”
 Karlgren: ”Genre is a vague but well-established
notion, and genres are explicitly identified and
discussed by language users even while they may be
difficult to encode and put into practical use”
GoWeb
The concept of genre is beneficial…
but difficult to pin down and to
agree upon
GoWeb
In the book, we do not
propose a single and
unified definition of
genre. Authors give
their different views on
genre.
Do we really need a definition?
 After all….
 … once we are convinced that genre is useful, we could just
say that: genre is a classificatory principle based on a
number of attributes.
 The web is immense, we cannot think of classifying web
documents by genre manually, can we? Let’s just focus on
AUTOMATIC web GENRE CLASSIFCATION!
What do we need for Automatic
webGenre Identification (AGI)?
 We need:
 a genre taxonomy (palette) and a corpus
 measurable attributes (features) that can be extracted
automatically
 an automatic classifier, i.e. a computational model that
does the classification for us
Vector representation & supervised
machine learning algorithms (esp.
SVM)
Models for AGI: Scenarios
 Serge Sharoff
 Kim & Ross
 Santini
 Stein et al.
 Others…
GoWeb
Morphology & the Linguist
 Aim: Find a genre palette allowing comparison among
corpora (Web As Corpus initiative ) and across
languages
 A functional genre palette inspired by J. Sinclair
 Many corpora: English and Russian
 Classifier: SVM
 Features: POS trigrams (577 for Russian; 593 for
English)
Ex of POS trigrams: ADV ADJ NOUN
Sharoff  GoWeb
The expert (the linguist) decides:
Results
KRYS I and Harmonic Descriptor
Representation (HDR)
 Information studies , Digital Libraries:
semantic concept
 Features: HDR = FP, LP or AP (betw 1 and
T/ (N x MP))
 Number of features: 7431
 Classifier: SVM
 KRYS I + 7 webgenre collection (total: 24 +
7 genre classes , 3452 documents)
Kim & Ross  GoWeb
2477 words
KRYS I &
7-webgenre
collection
Accuracies
What about morphology & syntax?
What about noise?
 Collection: 7-webgenre collection + others
 Features: 100 facets
 Genre palette: 7 webgenres + other
 Classifier: inferential model subjective Bayesian
method
Santini  GoWeb
7-webgenre collection
 Balanced (200 web pages per genre
class)
 Genre palette
 Not annotated manually
 Built following 2 principles:
 Objective sources
 Consistent genre granularity
100 Facets
Inferential model
 It is a simple probabilistic model based on rules.
 It allows some ”reasonging” through the use of weights
(closer to artificial intelligence than machine learning)
Comparisons (I)
Different types of noise!
Results
Three experimental settings, three
different genre needs….
1. Genre comparison across corpora
2. Digital libraries, where documents can be more easily
monitored
3. The wild web, where everything is uncertain and
noisy
WEGA prototype:
a retrieval model for genre-enabled web search
Genre retrieval model
 Genre collection and palette: KI-04 corpus: 8 webgenres
 Firefox add-on
 Model: ”lightweight GenreRich model” (linear discriminant
analysis)
 Features: HTML, link features, character features,
vocabulary concentration features (< 100 features)
Stein, Meyer zu Eissen, Lipka GoWeb
WEGA (WEb Genre Analysis)
KI-04 genre collection: 8 webgenres
Genre Classes & Human
Recognition
 How can we decide on the most representative genre
classes? Let’s ask users… yes indeed, but how?
 1) questionnaires (Karlgren)
 2) card sorting (Rosso & Haas)
 3) task-oriented studies (Crowston et al.)
 4) others…
Questionnaires: ”what genres are
available on the internet?”
User Warrant
 Collecting genre terminology in the users’ own words
(3 participants)
 Make the users classify web pages and create piles
(rationale?)
 Users choose the best of the collected genre
terminology (102 participants)
 User validation of the genre palette (257 participants)
 Genres’ usefulness of web search (32 participants)
GoWeb: Rosso & Haas
Final
Genre
palette:
18
genres
Genres & Tasks
 3 groups of respondents : teachers, journalists, engineers,
 Respondents were asked to carry out a web search for a
real task of their own choice
 What is your search goal?
 What type of web page would you call this?
 What is it about the page that makes you call that?
 Was this page useful to you?
GoWeb: Crowston et al.
What type of web page would you call this?
 522 unique terms  about 300
Syracuse corpus & AGI
ACL 2010 (Uppsala):
FINE-GRAINED GENRE CLASSIFICATION USING
STRUCTURAL LEARNING ALGORITHMS
Zhili Wu, Katja Markert and Serge Sharoff
 The whole corpus: 3027 annotated webpages divided
into 292 genres.
 Focussing on genres containing 15 or more examples,
the corpus is of about 2293 examples and 52 genres.
Conclusions (I) : Do we really need
a definition of genre?
1. Take a number of web pages belonging to different
web genres (e.g. blogs, home pages, news stories,
FAQs, etc.)
2. Identify and extract genre-revealing features
3. Feed an automatic classifier
Where is problem?
Conclusions (II)
 The problem with this approach is that without a
theoretical definition and characterization of the
concept of genre, it is not clear:
 how to create a genre taxonomy that both humans and
automatic classifiers can easily discriminate against
 how to select representative corpus for the genre classes
in the taxonomy, since there is a lot of variation in users’
assessment …
 how to identifiy the optimal genre–revealing features
Future Work
Genre is a high-level concept: we NEED a theoretical
definition of genre for computational and empirical
purposes.
Without a theoretical definition:
 genres become lifeless texts, merely characterized by
formal attributes and the communicative context , i.e.
the thing that make genre important, is completely
stripped out
 Although in some restricted experimental settings,
this ”formalistic” approach is quite rewarding (more
than 95% success rate), we can hardly generalize on it.
Future directions: AGI is a fertile land
for research and development…
Now that basic explorations have been carried out, we
should concentrate more on the correlation and
interrelation of the following variables:
 Human agreement
 Representation of genre classes
 Number of genre classes
 Nature of genre classes
 Size of the whole corpus
 Sturctured and unstructered noise
 Genre-revealing features that account for the context that
genres carry with them
 New computational models and algorithms…
Certainties….
 Genre is a useful concept in many disciplines
 Automatic genre classification is feasible, and there is ample
space for improvement
 I am interested in your views on (web) genre:
 send me your impressions, ideas, gut feelings and your genre
classes:
 Facebook page: www.facebook.com/genresontheweb
 Genre blog: www.forum.santini.se
 Webrider’s Short proposal to EU: www.webrider.se
Thank you for your attention!
References (I)
 Bateman, John (2008) Multimodality and Genre,
Palgrave Macmillan
 Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre:
An Introduction to History, Theory, Research, and
Pedagogy (free book);
http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf
 Bruce, Ian (2008) Academic Writing and Genre,
Continuum
 Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic
Variation and Genre, De Gruyter Mouton
References (II)
 Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in
the Internet, John Benjamins Publishing Company
 Heyd, Theresa (2008) Email Hoaxes: Form, function,
genre ecology, John Benjamins Publishing Company
 Lee, David (2001), Genres, Registers, Text Types,
Domains, And Styles: Clarifying The Concepts And
Navigating A Path Through The Bnc Jungle, Language
Learning & Technology September 2001, Vol. 5, Num. 3.
pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
References (III)
 Luzón, María José, Ruiz-Madrid, María Noelia and
Villanueva, María Luisa (eds) (2010) Digital Genres,
New Literacies and Autonomy in Language
Learning, Cambridge Scholars Publishing
 Martin, James and Rose, David (2008) Genre
Relations: Mapping Culture, Equinox
 Puschmann, Cornelius (2010) The corporate blog as
an emerging genre of computer-mediated
communication: features, constraints, discourse
situation, Universitätsverlag Göttingen
 WEGA prototype download, documentation and
references: http://www.uni-
weimar.de/cms/medien/webis/research/projects/wega
.html

Contenu connexe

En vedette

Towards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpTowards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpMarina Santini
 
How Emotional Are Users' Needs? Emotion in Query Logs
How Emotional Are Users' Needs? Emotion in Query LogsHow Emotional Are Users' Needs? Emotion in Query Logs
How Emotional Are Users' Needs? Emotion in Query LogsMarina Santini
 
Lecture11 logistic regression
Lecture11 logistic regressionLecture11 logistic regression
Lecture11 logistic regressionMarina Santini
 
Lecture 5: Structured Prediction
Lecture 5: Structured PredictionLecture 5: Structured Prediction
Lecture 5: Structured PredictionMarina Santini
 
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisLecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsLecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsMarina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 6: Hidden Variables and Expectation-Maximization
Lecture 6: Hidden Variables and Expectation-MaximizationLecture 6: Hidden Variables and Expectation-Maximization
Lecture 6: Hidden Variables and Expectation-MaximizationMarina Santini
 
Lecture 2: From Semantics To Semantic-Oriented Applications
Lecture 2: From Semantics To Semantic-Oriented ApplicationsLecture 2: From Semantics To Semantic-Oriented Applications
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Marina Santini
 
Lecture 3: Semantic Role Labelling
Lecture 3: Semantic Role LabellingLecture 3: Semantic Role Labelling
Lecture 3: Semantic Role LabellingMarina Santini
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMarina Santini
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Marina Santini
 
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear ClassifiersLecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear ClassifiersMarina Santini
 
Lecture 10: SVM and MIRA
Lecture 10: SVM and MIRALecture 10: SVM and MIRA
Lecture 10: SVM and MIRAMarina Santini
 
Lecture 01: Machine Learning for Language Technology - Introduction
 Lecture 01: Machine Learning for Language Technology - Introduction Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 01: Machine Learning for Language Technology - IntroductionMarina Santini
 
Lecture 4: The Weka Package
Lecture 4: The Weka PackageLecture 4: The Weka Package
Lecture 4: The Weka PackageMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 

En vedette (20)

CityTimes
CityTimesCityTimes
CityTimes
 
Towards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpTowards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can Help
 
How Emotional Are Users' Needs? Emotion in Query Logs
How Emotional Are Users' Needs? Emotion in Query LogsHow Emotional Are Users' Needs? Emotion in Query Logs
How Emotional Are Users' Needs? Emotion in Query Logs
 
Lecture11 logistic regression
Lecture11 logistic regressionLecture11 logistic regression
Lecture11 logistic regression
 
Lecture 5: Structured Prediction
Lecture 5: Structured PredictionLecture 5: Structured Prediction
Lecture 5: Structured Prediction
 
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisLecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsLecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive Datasets
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 6: Hidden Variables and Expectation-Maximization
Lecture 6: Hidden Variables and Expectation-MaximizationLecture 6: Hidden Variables and Expectation-Maximization
Lecture 6: Hidden Variables and Expectation-Maximization
 
Lecture 2: From Semantics To Semantic-Oriented Applications
Lecture 2: From Semantics To Semantic-Oriented ApplicationsLecture 2: From Semantics To Semantic-Oriented Applications
Lecture 2: From Semantics To Semantic-Oriented Applications
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Lecture 3: Semantic Role Labelling
Lecture 3: Semantic Role LabellingLecture 3: Semantic Role Labelling
Lecture 3: Semantic Role Labelling
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability Theory
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear ClassifiersLecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
 
Lecture 10: SVM and MIRA
Lecture 10: SVM and MIRALecture 10: SVM and MIRA
Lecture 10: SVM and MIRA
 
Lecture 01: Machine Learning for Language Technology - Introduction
 Lecture 01: Machine Learning for Language Technology - Introduction Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 01: Machine Learning for Language Technology - Introduction
 
Lecture 4: The Weka Package
Lecture 4: The Weka PackageLecture 4: The Weka Package
Lecture 4: The Weka Package
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 

Similaire à Uppsala uni 4march2011

MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdfMartin Wynne
 
Discourse Analysis for Social Research
Discourse Analysis for Social ResearchDiscourse Analysis for Social Research
Discourse Analysis for Social ResearchDominik Lukes
 
Ontologies and the humanities: some issues affecting the design of digital in...
Ontologies and the humanities: some issues affecting the design of digital in...Ontologies and the humanities: some issues affecting the design of digital in...
Ontologies and the humanities: some issues affecting the design of digital in...Toby Burrows
 
A Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional GenresA Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional GenresAndrea Porter
 
Tutorial on Semantic Digital Libraries (WWW'2007)
Tutorial on Semantic Digital Libraries (WWW'2007)Tutorial on Semantic Digital Libraries (WWW'2007)
Tutorial on Semantic Digital Libraries (WWW'2007)Sebastian Ryszard Kruk
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehillseby
 
Cataloging fiction with audio
Cataloging fiction with audioCataloging fiction with audio
Cataloging fiction with audioJasmineWoodson
 
Cataloging Fiction With Audio
Cataloging Fiction With AudioCataloging Fiction With Audio
Cataloging Fiction With AudioJasmineWoodson
 
Gondek- Curriculum Map-extended
Gondek- Curriculum Map-extendedGondek- Curriculum Map-extended
Gondek- Curriculum Map-extendedabby gondek
 
Zoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesZoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesDukeDigitalScholarship
 
Methodology & Content analysis
Methodology & Content analysisMethodology & Content analysis
Methodology & Content analysisFlorence Paisey
 
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social TaggingSocial Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social TaggingShelly D. Farnham, Ph.D.
 
Narrative Essay Topics For High School.pdf
Narrative Essay Topics For High School.pdfNarrative Essay Topics For High School.pdf
Narrative Essay Topics For High School.pdfHeidi Prado
 
Ten lessons from a study of ten notational systems
Ten lessons from a study of ten notational systemsTen lessons from a study of ten notational systems
Ten lessons from a study of ten notational systemsJeff Long
 
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...cameron
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 
Where is New Media Now? Some Ideas...
Where is New Media Now? Some Ideas...Where is New Media Now? Some Ideas...
Where is New Media Now? Some Ideas...Jessica Laccetti
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docxwrite4
 

Similaire à Uppsala uni 4march2011 (20)

MacroMicroZoom.pdf
MacroMicroZoom.pdfMacroMicroZoom.pdf
MacroMicroZoom.pdf
 
Discourse Analysis for Social Research
Discourse Analysis for Social ResearchDiscourse Analysis for Social Research
Discourse Analysis for Social Research
 
Ontologies and the humanities: some issues affecting the design of digital in...
Ontologies and the humanities: some issues affecting the design of digital in...Ontologies and the humanities: some issues affecting the design of digital in...
Ontologies and the humanities: some issues affecting the design of digital in...
 
A Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional GenresA Simple Approach To Classify Fictional And Non-Fictional Genres
A Simple Approach To Classify Fictional And Non-Fictional Genres
 
Tutorial on Semantic Digital Libraries (WWW'2007)
Tutorial on Semantic Digital Libraries (WWW'2007)Tutorial on Semantic Digital Libraries (WWW'2007)
Tutorial on Semantic Digital Libraries (WWW'2007)
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehills
 
Cataloging fiction with audio
Cataloging fiction with audioCataloging fiction with audio
Cataloging fiction with audio
 
Cataloging Fiction With Audio
Cataloging Fiction With AudioCataloging Fiction With Audio
Cataloging Fiction With Audio
 
MDST 3270 F10 Seminar 9
MDST 3270 F10 Seminar 9MDST 3270 F10 Seminar 9
MDST 3270 F10 Seminar 9
 
Gondek- Curriculum Map-extended
Gondek- Curriculum Map-extendedGondek- Curriculum Map-extended
Gondek- Curriculum Map-extended
 
Zoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesZoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and Techniques
 
Methodology & Content analysis
Methodology & Content analysisMethodology & Content analysis
Methodology & Content analysis
 
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social TaggingSocial Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
 
GCRD 6353: Seminar 2
GCRD 6353: Seminar 2GCRD 6353: Seminar 2
GCRD 6353: Seminar 2
 
Narrative Essay Topics For High School.pdf
Narrative Essay Topics For High School.pdfNarrative Essay Topics For High School.pdf
Narrative Essay Topics For High School.pdf
 
Ten lessons from a study of ten notational systems
Ten lessons from a study of ten notational systemsTen lessons from a study of ten notational systems
Ten lessons from a study of ten notational systems
 
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Where is New Media Now? Some Ideas...
Where is New Media Now? Some Ideas...Where is New Media Now? Some Ideas...
Where is New Media Now? Some Ideas...
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docx
 

Plus de Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part) Marina Santini
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Marina Santini
 

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 

Uppsala uni 4march2011

  • 1. Marina Santini Artificial Solutions, KYH Agile Web Development Stockholm Uppsala University Department of Linguistics and Philology, Seminar Series Fri 4 March 2011
  • 3. Outline  What is genre? What is web genre?  What is the difference betw genre and web genre?  Why is (web) genre important?  Automatic web genre identification  The very beginning: Biber and Karlgren&Cutting  Sharoff  Kim & Ross  Santini  Stein et al.  Web genre identification by Humans  Karlgren  Rosso & Haas  Crowston et al.  Future directions
  • 4. What is genre? The beginning…  Aristotle (4th cent. b.C.): drama, lyrics, epics  Drama: tragedy, comedy, satyr  Literary theory and literary genres  Library classification  Library classification used also in online bookshops (e.g Amazon)  Music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.)
  • 5. More recently…  Genre in academic contexts, in workplace and professional contexts, public contexts, in pedagogy (teaching writing), etc (resarch articles, essays, emails, memos, etc.)
  • 7. Genre & Corpus Linguistics  Surprisingly, no explicit definition of what genre is…  Brown corpus (1961): 15 genres  Sockholm-Umeå Corpus (SUC) (1990s)  British National Corpus (1990s)  etc.
  • 8. David Lee and the BNC Jungle
  • 9. Why is genre important?  It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, reportages…
  • 10. Benefits (I) Being a context carrier…  Complexity reduction: a text receives identity throught belonging to a certain genre;  Predictivity: genre reduces information overload.  Findability: genre helps find web documents ”relevant” to our information needs;
  • 11. Benefits (II)  Genre competence increases information understanding:  genre competence increases self protection against digital crimes (fishing, hoaxes, cyberbullying) because it can help us spot genre anomalies and consequently malicious intentions;  Genre competence helps implement democracy:  some educational programs (e.g. in Australia) focus on teaching genre since the primary school because those who do not have genre competence because they drop off school after the primary school become socially disadvantaged in the structure of power.
  • 12. What is webgenre ?  All types of genres that are on the web…  Paper genres that have been uploaded in any format + genres that do not have any countepart in the paper world:  ex: home page, About Us, FAQs, webzine, personal blog, corporate weblogs …
  • 13. How is webgenre different from paper genre?  On the web, there are new communicative settings, and new communicative contexts, so new genres are spawned  On the web, the new communicative settings have been spurred by a proliferation of new technologies that ease, foster and model our communication: ex: chats, blogs, social networks, like Facebook, Twitter, LinkedIn…
  • 14. Then, a written text is not only topic…  There are many dimensions of variation: domain, topic, register, sentiment, level of complexity or difficulty or specialisation, trustworthiness and credibility, etc.  … genre is a dimension of variation. Genre gives us a topic packaged in a certain way. From the package, we are able to identify the communicative purpose of the text and the commiunicative context that has spawn such a text.
  • 15. A step back…  Biber (1988)  Genre  Text types  66 linguistically-motivated features  Multi-Dimensional Analysis  Ad-hoc corpus  Karlgren & Cutting (1994)  Genre  20 shallow features  Brown Corpus
  • 16. Biberian Text Types Biber (1988) Biber (1989) Biber (1993) Biber (1995) Biber (2004a) Biber (2004b) Biber et al. (2005) etc. Genres/Registers vs. Text Types External Features vs. Internal Features “I have used the term ‘genre’ (or ‘register’) for text varieties that are readily recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon, conversation), while I have used the term ‘text type’ for varieties that are defined linguistically (rather than perceptually)” (Biber, 1993).
  • 17. Multi-Dimensional Analysis Factor Analysis, Factors Scores (Biber, 1988) Cluster Analysis (Biber, 1989) Additional Statistical Tests (Biber, 2004a; 2004b, etc.) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Cluster Analysis - Biber (1989)Factor 2 - Biber (1988) Criticism: Lee (1999)
  • 18. From Biber’s text types to genres of electronic corpora: Karlgren and Cutting (1994)
  • 19. Karlgren and Cutting (1994): Recognizing Text Genres with Simple Metrics Using Discriminant Analysis  20 features  Discriminant analysis  Brown corpus
  • 21. More than 15 years later…  Grieve, Biber et al. ” We define a genre in a very similar manner to how we define register – i.e. as a variety of language defined by the external situation in which it is produced. However, while a register is characterized by pervasive linguistic features, a genre is characterized by conventionalized linguistic features”  Karlgren: ”Genre is a vague but well-established notion, and genres are explicitly identified and discussed by language users even while they may be difficult to encode and put into practical use” GoWeb
  • 22. The concept of genre is beneficial… but difficult to pin down and to agree upon GoWeb In the book, we do not propose a single and unified definition of genre. Authors give their different views on genre.
  • 23. Do we really need a definition?  After all….  … once we are convinced that genre is useful, we could just say that: genre is a classificatory principle based on a number of attributes.  The web is immense, we cannot think of classifying web documents by genre manually, can we? Let’s just focus on AUTOMATIC web GENRE CLASSIFCATION!
  • 24. What do we need for Automatic webGenre Identification (AGI)?  We need:  a genre taxonomy (palette) and a corpus  measurable attributes (features) that can be extracted automatically  an automatic classifier, i.e. a computational model that does the classification for us
  • 25. Vector representation & supervised machine learning algorithms (esp. SVM)
  • 26. Models for AGI: Scenarios  Serge Sharoff  Kim & Ross  Santini  Stein et al.  Others… GoWeb
  • 27. Morphology & the Linguist  Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages  A functional genre palette inspired by J. Sinclair  Many corpora: English and Russian  Classifier: SVM  Features: POS trigrams (577 for Russian; 593 for English) Ex of POS trigrams: ADV ADJ NOUN Sharoff  GoWeb
  • 28. The expert (the linguist) decides:
  • 30. KRYS I and Harmonic Descriptor Representation (HDR)  Information studies , Digital Libraries: semantic concept  Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP))  Number of features: 7431  Classifier: SVM  KRYS I + 7 webgenre collection (total: 24 + 7 genre classes , 3452 documents) Kim & Ross  GoWeb 2477 words
  • 33. What about morphology & syntax? What about noise?  Collection: 7-webgenre collection + others  Features: 100 facets  Genre palette: 7 webgenres + other  Classifier: inferential model subjective Bayesian method Santini  GoWeb
  • 34. 7-webgenre collection  Balanced (200 web pages per genre class)  Genre palette  Not annotated manually  Built following 2 principles:  Objective sources  Consistent genre granularity
  • 36. Inferential model  It is a simple probabilistic model based on rules.  It allows some ”reasonging” through the use of weights (closer to artificial intelligence than machine learning)
  • 40. Three experimental settings, three different genre needs…. 1. Genre comparison across corpora 2. Digital libraries, where documents can be more easily monitored 3. The wild web, where everything is uncertain and noisy WEGA prototype: a retrieval model for genre-enabled web search
  • 41. Genre retrieval model  Genre collection and palette: KI-04 corpus: 8 webgenres  Firefox add-on  Model: ”lightweight GenreRich model” (linear discriminant analysis)  Features: HTML, link features, character features, vocabulary concentration features (< 100 features) Stein, Meyer zu Eissen, Lipka GoWeb
  • 42. WEGA (WEb Genre Analysis)
  • 43. KI-04 genre collection: 8 webgenres
  • 44. Genre Classes & Human Recognition  How can we decide on the most representative genre classes? Let’s ask users… yes indeed, but how?  1) questionnaires (Karlgren)  2) card sorting (Rosso & Haas)  3) task-oriented studies (Crowston et al.)  4) others…
  • 45. Questionnaires: ”what genres are available on the internet?”
  • 46. User Warrant  Collecting genre terminology in the users’ own words (3 participants)  Make the users classify web pages and create piles (rationale?)  Users choose the best of the collected genre terminology (102 participants)  User validation of the genre palette (257 participants)  Genres’ usefulness of web search (32 participants) GoWeb: Rosso & Haas
  • 48. Genres & Tasks  3 groups of respondents : teachers, journalists, engineers,  Respondents were asked to carry out a web search for a real task of their own choice  What is your search goal?  What type of web page would you call this?  What is it about the page that makes you call that?  Was this page useful to you? GoWeb: Crowston et al.
  • 49. What type of web page would you call this?  522 unique terms  about 300
  • 50. Syracuse corpus & AGI ACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff  The whole corpus: 3027 annotated webpages divided into 292 genres.  Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples and 52 genres.
  • 51. Conclusions (I) : Do we really need a definition of genre? 1. Take a number of web pages belonging to different web genres (e.g. blogs, home pages, news stories, FAQs, etc.) 2. Identify and extract genre-revealing features 3. Feed an automatic classifier Where is problem?
  • 52. Conclusions (II)  The problem with this approach is that without a theoretical definition and characterization of the concept of genre, it is not clear:  how to create a genre taxonomy that both humans and automatic classifiers can easily discriminate against  how to select representative corpus for the genre classes in the taxonomy, since there is a lot of variation in users’ assessment …  how to identifiy the optimal genre–revealing features
  • 53. Future Work Genre is a high-level concept: we NEED a theoretical definition of genre for computational and empirical purposes. Without a theoretical definition:  genres become lifeless texts, merely characterized by formal attributes and the communicative context , i.e. the thing that make genre important, is completely stripped out  Although in some restricted experimental settings, this ”formalistic” approach is quite rewarding (more than 95% success rate), we can hardly generalize on it.
  • 54. Future directions: AGI is a fertile land for research and development… Now that basic explorations have been carried out, we should concentrate more on the correlation and interrelation of the following variables:  Human agreement  Representation of genre classes  Number of genre classes  Nature of genre classes  Size of the whole corpus  Sturctured and unstructered noise  Genre-revealing features that account for the context that genres carry with them  New computational models and algorithms…
  • 55. Certainties….  Genre is a useful concept in many disciplines  Automatic genre classification is feasible, and there is ample space for improvement  I am interested in your views on (web) genre:  send me your impressions, ideas, gut feelings and your genre classes:  Facebook page: www.facebook.com/genresontheweb  Genre blog: www.forum.santini.se  Webrider’s Short proposal to EU: www.webrider.se
  • 56. Thank you for your attention!
  • 57. References (I)  Bateman, John (2008) Multimodality and Genre, Palgrave Macmillan  Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre: An Introduction to History, Theory, Research, and Pedagogy (free book); http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf  Bruce, Ian (2008) Academic Writing and Genre, Continuum  Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic Variation and Genre, De Gruyter Mouton
  • 58. References (II)  Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in the Internet, John Benjamins Publishing Company  Heyd, Theresa (2008) Email Hoaxes: Form, function, genre ecology, John Benjamins Publishing Company  Lee, David (2001), Genres, Registers, Text Types, Domains, And Styles: Clarifying The Concepts And Navigating A Path Through The Bnc Jungle, Language Learning & Technology September 2001, Vol. 5, Num. 3. pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
  • 59. References (III)  Luzón, María José, Ruiz-Madrid, María Noelia and Villanueva, María Luisa (eds) (2010) Digital Genres, New Literacies and Autonomy in Language Learning, Cambridge Scholars Publishing  Martin, James and Rose, David (2008) Genre Relations: Mapping Culture, Equinox  Puschmann, Cornelius (2010) The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation, Universitätsverlag Göttingen  WEGA prototype download, documentation and references: http://www.uni- weimar.de/cms/medien/webis/research/projects/wega .html

Notes de l'éditeur

  1. Kris I is pdf7-webgenre collection