The document discusses lexical quality in social media. It defines lexical quality as the degree of excellence of words in a text. The researchers analyzed spelling errors across different social media sites and classes, finding that social networks had the most errors, followed by multimedia sites and blogs. On average, the lexical quality of social media was found to be worse than that of the overall web. The researchers also examined geographical differences and error types. Future work could involve expanding the analysis to more sites and words.
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Estimating Dyslexia in the Web
1. How Bad Do You Spell?
The Lexical Quality of Social Media
Ricardo Baeza-Yates Luz Rello
Yahoo! Research & Web Research and
Web Research Group, NLP Groups
Pompeu Fabra University, Pompeu Fabra University,
Barcelona, Spain Barcelona, Spain
FoSW 2011, Barcelona
2. Outline
Outline
— What & Why lexical quality
— How word sample, criteria, measure
— Results comparison of Social Media with the
Web
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
3. What
Outline
User-generated content:
Social Media social networks, blogs, micro-
blogs, multimedia & opinions.
Quality
content: representation:
— accuracy, source — legibility, spelling errors,
reputation, objectivity, grammatical errors, etc.
highly current, etc.
Lexical
community feedback, user
Quality
interactions, click counts...
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
4. What
Outline
Lexical quality mainly refers to the degree
Lexical Quality
of excellence of words in a text.
A lexical representation has high quality to the
extent that it has a fully sp ecified
orthographic representation (a spelling) and
redundant phonological representations (one
from spoken language and one recoverable from
orthographies-to-phonological mapping).
(Perfetti and Hart, 2002)
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
5. Why
Outline
— Spam detection
(Castillo, Donato, Gionis, Murdock and Silvestri 2007)
used
— Credibility determination
for
(Fogg et al. 2001)
— Wikipedia vandalism detection
(Potthast, Stein, and Gerling 2008)
Lexical
Quality
useful
metric
— There is a strong correlation between spelling errors
and web data content quality
(Gelman and Barletta 2008)
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
6. Methodology
Outline
Sample the Web using a
measure for lexical quality
(Baeza-Yates and Rello 2011)
A set of words (errors)
(Gelman and Barletta 2008)
Theoretical reason
Detailed classification of (Perfetti and Hart 2002)
spelling errors in the Web
(Baeza-Yates and Rello 2011)
Practical reason
(Ringlstetter, Schulz and Mihov 2006)
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
7. How Many Kinds of Errors in the Web?
Outline
pro duced by non-impaired native English
Regular spelling
individuals, such as the transposition error, i.e.
errors:
*recieve.
Regular typos: caused by the adjacency of letters in the keyboard,
i.e. *teceive.
Non-native made by people who use English as a foreign
speakers errors: language, i.e. *receibe.
errors commonly made made by dyslexics (i.e.
unfinishedwords or letters, omitted words,
Dyslexic errors:
inconsistent spaces between words and letters
(Vellutino, 1979). *reiecve instead of receive.
OCR errors: due to letters of similar shape, such as *ieceive.
8. Sample of Words
Outline
Ten words extracted from a sample 50 words + variants with errors = 1,345 words.
e.g.: comparison errors
Dyslexic error: *comaprsion.
Spelling errors: *comparision, *conparison and *coparison.
Typos: *vomparison, *xomparison, *cimparison, *cpmparison,
*conparison, *co,parison, *comoarison, *com[arison, *comprison,
*compsrison, *compaeison, *compatison, *comparuson,
*comparoson, *compariaon,*comparidon, *comparisin,
*comparispn, *comparisob and *comparisom.
OCR errors: *compaiison and *comparisom.
Non-native speakers: *comparition and *comparizon.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
9. How Did We Find/Generate the Errors?
Outline
Regular spelling
High frequency in query logs.
errors:
Generated by substituting each of the letters of the
Regular typos: intended word with the letter situated immediately
up, down, left and right from the intended letter.
Non-native
Linguistic knowledge.
speakers errors:
*gobernment is a typical error made by Spanish learners
of English.
— Graphemes <b> and <v> are pronounced as /b/.
—Phoneme /v/ does not exist in the standard Spanish
phonemic system.
— In Spanish is written with <b>.
Texts written by dyslexic users and literature.
Dyslexic errors: (Pedler 2007)
OCR errors: Substituting typical letters which are usually
mistaken.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
10. Selection Criteria
Outline
— Starting point was the selection of dyslexic errors.
— The errors related to word are unique and not ambiguous.
— To avoid the overlap of different type of errors:
— E.g.: We consider only words written by dyslexics containing
multi-errors, that is, the dyslexic word differs from the intended
correct word by more than one letter. For example, the dyslexic
word *konwlegde from knowledge.
— To avoid the overlap of dyslexic errors and real words:
— Errors which coincide with other existing words in English are
omitted, i.e. *trust being the intended word truth.
— Errors which give as a result a proper name are also filtered, for
instance the typo *wirries from worries is also a proper name.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Estimating Dyslexia in the Web
Lexical Quality of Social Media
11. Sample of Words
Outline
album *albun
always *alwasy
around *arround
because *becuase
enough *enoguh
everything *everyhting
having *haveing
problem *problen
remember *remenber
working *workig
— Relatively long (an average length of 8.2 letters).
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
12. Lexical Outline Measure
Quality
— Relative ratio of the misspells to the correct spellings averaged
over our word sample:
— A lower value of LQ implies better lexical quality, being 0
perfect quality.
— We estimate df by searching each word only in the English
pages of a major search engine.
— The relative order of the measure will hardly change as the size
of the set grows.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
13. Social Media Sites Classes
Outline
Social Networks (S) Collaboration Sites (C) Blogs & Micro-blogs (B)
Bebo CiteuLike Blogger
Facebook Digg Foursquare
Friendster Wikia LiveJournal
Hi5 Wikipedia Tumblr
LinkedIn Wikispaces Twitter
MySpace
Multimedia Sites (M) Opinions & Community
Question-Answering Systems (O)
Flickr
Fotolog Epinions
Last.fm Quora
Picasa Yelp
Based on:
Youtube Y! Answers
— Number of users
— Alexa ranking
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
14. The Lexical Quality of Social Media
Outline
— Relative size considering public content
(# words according to a major search engine).
— 55% social networks
— 23% multimedia
— 20% blogs
— No correlation of public content and LQ.
— Errors in the Social Media Web
— 62% Social Networks (S)
— 19% Multimedia (M)
— 47% Facebook
— Almost 80% Facebook + Fotolog +
Blogger + MySpace + Hi5
— No clear order for the site classes:
Collaborative sites are the best ones,
followed by blogs, multimedia, social
Range and average lexical quality in percentages
(Values over the social media average are highlighted)
networks, and further away opinions.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
15. The Lexical Quality of Social Media
Outline
— Just eight sites have lexical quality that is
better than the average of the Web, but those
account for less than 27% of the content.
— On average, social media classes have lexical
quality worse than the Web itself.
— Compared to high quality sites, the quality
of social media is one order of magnitude
worse.
— The lower quality of social media impacts
many more sites. For example we found that the
Range of percentages and average for a sample of
frequent misspellings in several sets of Web sites community section of the NY Times is the main
(Values over the Web average are highlighted) contributor to the decrease of their lexical quality.
A similar effect occurs in CNN or Microsoft.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
16. Geographical Distribution of Web LQ
Outline
— Lower LQ and higher Internet usage in Ireland, United Kingdom, Australia, New
Zealand and Canada.
— Higher impact of social media in countries where Internet penetration is higher.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
17. On-going Work
Outline
Error Types in the Web
Range of percentages and average for the different error classes Absolute error rate for different
and its prevalence in the Web typing errors
— These types of errors are quite word dependent, which explains the wide
range of percentage values obtained.
— More left-right than up and down typos.
— The percentage of dyslexic errors is much lower than the corresponding
number of dyslexic users (say 10%).
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
18. Conclusions
Outline
• Correlation between popularity and perceived semantic quality and
our defined lexical quality.
• Lexical quality of social media can be used to estimate the semantic
quality of social media.
• Nevertheless, these estimations should be taken with a grain of salt, as
they will change with a different sample of sites and/or words sample.
• Nevertheless, we believe that the main results will be maintained, e.g. that
the lexical quality of social media is worse than the overall Web.
• Our results contribute to the difficult and still open problem of
measuring the quality of content in social media and the Web in general.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
19. Future Work
Outline
1 — Define new ways to measure LQ and compare them
with these results to check for consistency.
2 — Increase the sample of social media sites studied.
3 — Increase the sample of words to measure the LQ.
4 — A linguistic model for improving accessibility to Web
based textual material
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
20. Outline
Thank you :-)
Any Questions?
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media