SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
How Bad Do You Spell?
The Lexical Quality of Social Media

  Ricardo Baeza-Yates                       Luz Rello

  Yahoo! Research &                         Web Research and
  Web Research Group,                       NLP Groups
  Pompeu Fabra University,                  Pompeu Fabra University,
  Barcelona, Spain                          Barcelona, Spain




                     FoSW 2011, Barcelona
Outline
                                      Outline




                        — What & Why                 lexical quality



                        — How       word sample, criteria, measure



                        — Results         comparison of Social Media with the
                                          Web




Ricardo Baeza-Yates and Luz Rello   FoSW 2011, Barcelona       The Lexical Quality of Social Media
What
                                           Outline

                                                           User-generated content:
                                Social Media               social networks, blogs, micro-
                                                           blogs, multimedia & opinions.



                                    Quality



                content:                                representation:
                — accuracy, source                      — legibility, spelling errors,
                reputation, objectivity,                grammatical errors, etc.
                highly current, etc.

                                                               Lexical
                 community feedback, user
                                                               Quality
                 interactions, click counts...

Ricardo Baeza-Yates and Luz Rello       FoSW 2011, Barcelona        The Lexical Quality of Social Media
What
                                      Outline


                                          Lexical quality mainly refers to the degree
        Lexical Quality
                                          of excellence of words in a text.




                                         A lexical representation has high quality to the
                                         extent that it has a fully sp ecified
                                         orthographic representation (a spelling) and
                                         redundant phonological representations (one
                                         from spoken language and one recoverable from
                                         orthographies-to-phonological mapping).
                                                                 (Perfetti and Hart, 2002)



Ricardo Baeza-Yates and Luz Rello   FoSW 2011, Barcelona      The Lexical Quality of Social Media
Why
                                        Outline


                                    — Spam detection
                                     (Castillo, Donato, Gionis, Murdock and Silvestri 2007)
                  used
                                    — Credibility determination
                  for
                                                                          (Fogg et al. 2001)
                                    — Wikipedia vandalism detection
                                                        (Potthast, Stein, and Gerling 2008)
      Lexical
      Quality



                  useful
                  metric
                                    — There is a strong correlation between spelling errors
                                    and web data content quality
                                                               (Gelman and Barletta 2008)



Ricardo Baeza-Yates and Luz Rello     FoSW 2011, Barcelona      The Lexical Quality of Social Media
Methodology
                                         Outline


               Sample the Web using a
               measure for lexical quality
                                           (Baeza-Yates and Rello 2011)




                 A set of words (errors)
                                           (Gelman and Barletta 2008)



                                                         Theoretical reason
               Detailed classification of                                    (Perfetti and Hart 2002)
               spelling errors in the Web
                     (Baeza-Yates and Rello 2011)
                                                         Practical reason
                                                             (Ringlstetter, Schulz and Mihov 2006)

Ricardo Baeza-Yates and Luz Rello        FoSW 2011, Barcelona             The Lexical Quality of Social Media
How Many Kinds of Errors in the Web?
                   Outline


                    pro duced by non-impaired native English
Regular spelling
                    individuals, such as the transposition error, i.e.
errors:
                    *recieve.

Regular typos:      caused by the adjacency of letters in the keyboard,
                    i.e. *teceive.

Non-native          made by people who use English as a foreign
speakers errors:    language, i.e. *receibe.

                    errors commonly made made by dyslexics (i.e.
                    unfinishedwords or letters, omitted words,
Dyslexic errors:
                    inconsistent spaces between words and letters
                    (Vellutino, 1979). *reiecve instead of receive.

OCR errors:         due to letters of similar shape, such as *ieceive.
Sample of Words
                                       Outline


   Ten words extracted from a sample 50 words + variants with errors = 1,345 words.


   e.g.: comparison errors
     Dyslexic error:                *comaprsion.

     Spelling errors:               *comparision, *conparison and *coparison.

     Typos:                         *vomparison, *xomparison, *cimparison, *cpmparison,
                                    *conparison, *co,parison, *comoarison, *com[arison, *comprison,
                                    *compsrison, *compaeison, *compatison, *comparuson,
                                    *comparoson, *compariaon,*comparidon, *comparisin,
                                    *comparispn, *comparisob and *comparisom.

     OCR errors:                    *compaiison and *comparisom.


     Non-native speakers:           *comparition and *comparizon.


Ricardo Baeza-Yates and Luz Rello       FoSW 2011, Barcelona         The Lexical Quality of Social Media
How Did We Find/Generate the Errors?
                           Outline

     Regular spelling
                                      High frequency in query logs.
     errors:
                                      Generated by substituting each of the letters of the
     Regular typos:                   intended word with the letter situated immediately
                                      up, down, left and right from the intended letter.
     Non-native
                                      Linguistic knowledge.
     speakers errors:
                                       *gobernment is a typical error made by Spanish learners
                                       of English.
                                       — Graphemes <b> and <v> are pronounced as /b/.
                                       —Phoneme /v/ does not exist in the standard Spanish
                                       phonemic system.
                                       — In Spanish is written with <b>.

                                      Texts written by dyslexic users and literature.
     Dyslexic errors:                 (Pedler 2007)

     OCR errors:                      Substituting typical letters which are usually
                                      mistaken.
Ricardo Baeza-Yates and Luz Rello   FoSW 2011, Barcelona        The Lexical Quality of Social Media
Selection Criteria
                                        Outline


        — Starting point was the selection of dyslexic errors.

       — The errors related to word are unique and not ambiguous.

        — To avoid the overlap of different type of errors:
                       — E.g.: We consider only words written by dyslexics containing
                       multi-errors, that is, the dyslexic word differs from the intended
                       correct word by more than one letter. For example, the dyslexic
                       word *konwlegde from knowledge.

        — To avoid the overlap of dyslexic errors and real words:
                       — Errors which coincide with other existing words in English are
                       omitted, i.e. *trust being the intended word truth.
                       — Errors which give as a result a proper name are also filtered, for
                       instance the typo *wirries from worries is also a proper name.


Ricardo Baeza-Yates and Luz Rello     FoSW 2011, Barcelona       The Estimating Dyslexia in the Web
                                                                     Lexical Quality of Social Media
Sample of Words
                                        Outline

                        album                                 *albun
                        always                                *alwasy
                        around                                *arround
                        because                               *becuase
                        enough                                *enoguh
                        everything                            *everyhting
                        having                                *haveing
                        problem                               *problen
                        remember                              *remenber
                        working                               *workig


                      — Relatively long (an average length of 8.2 letters).


Ricardo Baeza-Yates and Luz Rello      FoSW 2011, Barcelona       The Lexical Quality of Social Media
Lexical Outline Measure
                                     Quality



             — Relative ratio of the misspells to the correct spellings averaged
             over our word sample:




             — A lower value of LQ implies better lexical quality, being 0
             perfect quality.
             — We estimate df by searching each word only in the English
             pages of a major search engine.
             — The relative order of the measure will hardly change as the size
             of the set grows.


Ricardo Baeza-Yates and Luz Rello    FoSW 2011, Barcelona      The Lexical Quality of Social Media
Social Media Sites Classes
                                   Outline


    Social Networks (S)             Collaboration Sites (C)      Blogs & Micro-blogs (B)
         Bebo                            CiteuLike                       Blogger
         Facebook                        Digg                            Foursquare
         Friendster                      Wikia                           LiveJournal
         Hi5                             Wikipedia                       Tumblr
         LinkedIn                        Wikispaces                      Twitter
         MySpace

                   Multimedia Sites (M)            Opinions & Community
                                                   Question-Answering Systems (O)
                       Flickr
                       Fotolog                           Epinions
                       Last.fm                           Quora
                       Picasa                            Yelp
                                                                              Based on:
                       Youtube                           Y! Answers
                                                                              — Number of users
                                                                              — Alexa ranking
Ricardo Baeza-Yates and Luz Rello       FoSW 2011, Barcelona      The Lexical Quality of Social Media
The Lexical Quality of Social Media
                                 Outline

                                                              — Relative size considering public content
                                                              (# words according to a major search engine).
                                                                           — 55% social networks
                                                                           — 23% multimedia
                                                                           — 20% blogs
                                                               — No correlation of public content and LQ.
                                                               — Errors in the Social Media Web
                                                                          — 62% Social Networks (S)
                                                                          — 19% Multimedia (M)
                                                                          — 47% Facebook
                                                                          — Almost 80% Facebook + Fotolog +
                                                                          Blogger + MySpace + Hi5
                                                                — No clear order for the site classes: 
                                                                Collaborative sites are the best ones,
                                                                followed by blogs, multimedia, social
    Range and average lexical quality in percentages
     (Values over the social media average are highlighted)
                                                                networks, and further away opinions.

Ricardo Baeza-Yates and Luz Rello                 FoSW 2011, Barcelona       The Lexical Quality of Social Media
The Lexical Quality of Social Media
                               Outline

                                                      — Just eight sites have lexical quality that is
                                                      better than the average of the Web, but those
                                                      account for less than 27% of the content.

                                                      — On average, social media classes have lexical
                                                      quality worse than the Web itself.

                                                      — Compared to high quality sites, the quality
                                                      of social media is one order of magnitude
                                                      worse.

                                                      — The lower quality of social media impacts
                                                      many more sites. For example we found that the
 Range of percentages and average for a sample of
 frequent misspellings in several sets of Web sites   community section of the NY Times is the main
     (Values over the Web average are highlighted)    contributor to the decrease of their lexical quality.
                                                      A similar effect occurs in CNN or Microsoft.

Ricardo Baeza-Yates and Luz Rello            FoSW 2011, Barcelona         The Lexical Quality of Social Media
Geographical Distribution of Web LQ
                             Outline




      — Lower LQ and higher Internet usage in Ireland, United Kingdom, Australia, New
      Zealand and Canada.
      —  Higher impact of social media in countries where Internet penetration is higher.

Ricardo Baeza-Yates and Luz Rello   FoSW 2011, Barcelona      The Lexical Quality of Social Media
On-going Work
                                          Outline

                                      Error Types in the Web




    Range of percentages and average for the different error classes     Absolute error rate for different
                    and its prevalence in the Web                                typing errors


           — These types of errors are quite word dependent, which explains the wide
           range of percentage values obtained.
           — More left-right than up and down typos.
           — The percentage of dyslexic errors is much lower than the corresponding
           number of dyslexic users (say 10%).

Ricardo Baeza-Yates and Luz Rello           FoSW 2011, Barcelona      The Lexical Quality of Social Media
Conclusions
                                     Outline


     • Correlation between popularity and perceived semantic quality and
     our defined lexical quality.

     • Lexical quality of social media can be used to estimate the semantic
     quality of social media.

     • Nevertheless, these estimations should be taken with a grain of salt, as
     they will change with a different sample of sites and/or words sample.

     • Nevertheless, we believe that the main results will be maintained, e.g. that
     the lexical quality of social media is worse than the overall Web.

     • Our results contribute to the difficult and still open problem of
     measuring the quality of content in social media and the Web in general.

Ricardo Baeza-Yates and Luz Rello   FoSW 2011, Barcelona    The Lexical Quality of Social Media
Future Work
                                     Outline




              1 — Define new ways to measure LQ and compare them
              with these results to check for consistency.

              2 — Increase the sample of social media sites studied.

              3 — Increase the sample of words to measure the LQ.

              4 — A linguistic model for improving accessibility to Web
              based textual material




Ricardo Baeza-Yates and Luz Rello   FoSW 2011, Barcelona   The Lexical Quality of Social Media
Outline




                                    Thank you :-)

                                Any Questions?




Ricardo Baeza-Yates and Luz Rello     FoSW 2011, Barcelona   The Lexical Quality of Social Media

Contenu connexe

En vedette

英語を企業の社内公用語にする
英語を企業の社内公用語にする英語を企業の社内公用語にする
英語を企業の社内公用語にするShu Miyata
 
Coping With Dyslexia - Speld Victoria at Australia
Coping With Dyslexia - Speld Victoria at AustraliaCoping With Dyslexia - Speld Victoria at Australia
Coping With Dyslexia - Speld Victoria at Australiaspeldvic1
 
Data and Algorithmic Bias in the Web
Data and Algorithmic Bias in the WebData and Algorithmic Bias in the Web
Data and Algorithmic Bias in the WebWebVisions
 
Dyseggxia (Piruletras): A scientifically validated app to help children to ov...
Dyseggxia (Piruletras): A scientifically validated app to help children to ov...Dyseggxia (Piruletras): A scientifically validated app to help children to ov...
Dyseggxia (Piruletras): A scientifically validated app to help children to ov...Luz Rello
 
Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...
Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...
Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...Luz Rello
 

En vedette (7)

Eindassessment
EindassessmentEindassessment
Eindassessment
 
英語を企業の社内公用語にする
英語を企業の社内公用語にする英語を企業の社内公用語にする
英語を企業の社内公用語にする
 
Coping With Dyslexia - Speld Victoria at Australia
Coping With Dyslexia - Speld Victoria at AustraliaCoping With Dyslexia - Speld Victoria at Australia
Coping With Dyslexia - Speld Victoria at Australia
 
Data and Algorithmic Bias in the Web
Data and Algorithmic Bias in the WebData and Algorithmic Bias in the Web
Data and Algorithmic Bias in the Web
 
Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
 
Dyseggxia (Piruletras): A scientifically validated app to help children to ov...
Dyseggxia (Piruletras): A scientifically validated app to help children to ov...Dyseggxia (Piruletras): A scientifically validated app to help children to ov...
Dyseggxia (Piruletras): A scientifically validated app to help children to ov...
 
Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...
Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...
Luz rello - Ph.D. Thesis presentation - DysWebxia: A Text Accessibility Model...
 

Similaire à Estimating Dyslexia in the Web

Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011
Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011
Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011Luz Rello
 
A Description of Sebuano English Cyberlogues
A Description of Sebuano English CyberloguesA Description of Sebuano English Cyberlogues
A Description of Sebuano English CyberloguesYogeshIJTSRD
 
Language acquisition
Language acquisitionLanguage acquisition
Language acquisitionMuzo Bacan
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
 
Understanding Nonverbal Learning Disabilities
Understanding Nonverbal Learning DisabilitiesUnderstanding Nonverbal Learning Disabilities
Understanding Nonverbal Learning DisabilitiesBin Goldman, PsyD
 
2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy
2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy 2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy
2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy KatherineHaratsis
 
KOSO Knowledge Organization Systems Ontology
KOSO Knowledge Organization Systems OntologyKOSO Knowledge Organization Systems Ontology
KOSO Knowledge Organization Systems OntologyKatrin Weller
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfFaishaMaeTangog
 
Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...
Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...
Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...Akinsande Olalekan
 
Communicative competence
Communicative competenceCommunicative competence
Communicative competenceJheyswat
 
Chapter 9 learning goals
Chapter 9 learning goalsChapter 9 learning goals
Chapter 9 learning goalsjhoegh
 
Spanish 1 friends unidad 1 etapa 2
Spanish 1 friends unidad 1 etapa 2Spanish 1 friends unidad 1 etapa 2
Spanish 1 friends unidad 1 etapa 2pasaportealmundo
 
Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...James Cook University
 
RtI for students with significant disabilities
RtI for students with significant disabilitiesRtI for students with significant disabilities
RtI for students with significant disabilitiescarriefdelacruz
 

Similaire à Estimating Dyslexia in the Web (20)

SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-...
SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-...SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-...
SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-...
 
Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011
Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011
Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011
 
EL
ELEL
EL
 
1038
10381038
1038
 
A program
A programA program
A program
 
A Description of Sebuano English Cyberlogues
A Description of Sebuano English CyberloguesA Description of Sebuano English Cyberlogues
A Description of Sebuano English Cyberlogues
 
Language acquisition
Language acquisitionLanguage acquisition
Language acquisition
 
Psycolinguistic
PsycolinguisticPsycolinguistic
Psycolinguistic
 
Multlingual Linked Data Patterns
Multlingual Linked Data PatternsMultlingual Linked Data Patterns
Multlingual Linked Data Patterns
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
Understanding Nonverbal Learning Disabilities
Understanding Nonverbal Learning DisabilitiesUnderstanding Nonverbal Learning Disabilities
Understanding Nonverbal Learning Disabilities
 
2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy
2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy 2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy
2012 TESOL Seminar 2: Building a 4x4 toolkit for academic literacy
 
KOSO Knowledge Organization Systems Ontology
KOSO Knowledge Organization Systems OntologyKOSO Knowledge Organization Systems Ontology
KOSO Knowledge Organization Systems Ontology
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdf
 
Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...
Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...
Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Class...
 
Communicative competence
Communicative competenceCommunicative competence
Communicative competence
 
Chapter 9 learning goals
Chapter 9 learning goalsChapter 9 learning goals
Chapter 9 learning goals
 
Spanish 1 friends unidad 1 etapa 2
Spanish 1 friends unidad 1 etapa 2Spanish 1 friends unidad 1 etapa 2
Spanish 1 friends unidad 1 etapa 2
 
Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...
 
RtI for students with significant disabilities
RtI for students with significant disabilitiesRtI for students with significant disabilities
RtI for students with significant disabilities
 

Dernier

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Dernier (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Estimating Dyslexia in the Web

  • 1. How Bad Do You Spell? The Lexical Quality of Social Media Ricardo Baeza-Yates Luz Rello Yahoo! Research & Web Research and Web Research Group, NLP Groups Pompeu Fabra University, Pompeu Fabra University, Barcelona, Spain Barcelona, Spain FoSW 2011, Barcelona
  • 2. Outline Outline — What & Why lexical quality — How word sample, criteria, measure — Results comparison of Social Media with the Web Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 3. What Outline User-generated content: Social Media social networks, blogs, micro- blogs, multimedia & opinions. Quality content: representation: — accuracy, source — legibility, spelling errors, reputation, objectivity, grammatical errors, etc. highly current, etc. Lexical community feedback, user Quality interactions, click counts... Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 4. What Outline Lexical quality mainly refers to the degree Lexical Quality of excellence of words in a text. A lexical representation has high quality to the extent that it has a fully sp ecified orthographic representation (a spelling) and redundant phonological representations (one from spoken language and one recoverable from orthographies-to-phonological mapping). (Perfetti and Hart, 2002) Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 5. Why Outline — Spam detection (Castillo, Donato, Gionis, Murdock and Silvestri 2007) used — Credibility determination for (Fogg et al. 2001) — Wikipedia vandalism detection (Potthast, Stein, and Gerling 2008) Lexical Quality useful metric — There is a strong correlation between spelling errors and web data content quality (Gelman and Barletta 2008) Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 6. Methodology Outline Sample the Web using a measure for lexical quality (Baeza-Yates and Rello 2011) A set of words (errors) (Gelman and Barletta 2008) Theoretical reason Detailed classification of (Perfetti and Hart 2002) spelling errors in the Web (Baeza-Yates and Rello 2011) Practical reason (Ringlstetter, Schulz and Mihov 2006) Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 7. How Many Kinds of Errors in the Web? Outline pro duced by non-impaired native English Regular spelling individuals, such as the transposition error, i.e. errors: *recieve. Regular typos: caused by the adjacency of letters in the keyboard, i.e. *teceive. Non-native made by people who use English as a foreign speakers errors: language, i.e. *receibe. errors commonly made made by dyslexics (i.e. unfinishedwords or letters, omitted words, Dyslexic errors: inconsistent spaces between words and letters (Vellutino, 1979). *reiecve instead of receive. OCR errors: due to letters of similar shape, such as *ieceive.
  • 8. Sample of Words Outline Ten words extracted from a sample 50 words + variants with errors = 1,345 words. e.g.: comparison errors Dyslexic error: *comaprsion. Spelling errors: *comparision, *conparison and *coparison. Typos: *vomparison, *xomparison, *cimparison, *cpmparison, *conparison, *co,parison, *comoarison, *com[arison, *comprison, *compsrison, *compaeison, *compatison, *comparuson, *comparoson, *compariaon,*comparidon, *comparisin, *comparispn, *comparisob and *comparisom. OCR errors: *compaiison and *comparisom. Non-native speakers: *comparition and *comparizon. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 9. How Did We Find/Generate the Errors? Outline Regular spelling High frequency in query logs. errors: Generated by substituting each of the letters of the Regular typos: intended word with the letter situated immediately up, down, left and right from the intended letter. Non-native Linguistic knowledge. speakers errors: *gobernment is a typical error made by Spanish learners of English. — Graphemes <b> and <v> are pronounced as /b/. —Phoneme /v/ does not exist in the standard Spanish phonemic system. — In Spanish is written with <b>. Texts written by dyslexic users and literature. Dyslexic errors: (Pedler 2007) OCR errors: Substituting typical letters which are usually mistaken. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 10. Selection Criteria Outline — Starting point was the selection of dyslexic errors. — The errors related to word are unique and not ambiguous. — To avoid the overlap of different type of errors: — E.g.: We consider only words written by dyslexics containing multi-errors, that is, the dyslexic word differs from the intended correct word by more than one letter. For example, the dyslexic word *konwlegde from knowledge. — To avoid the overlap of dyslexic errors and real words: — Errors which coincide with other existing words in English are omitted, i.e. *trust being the intended word truth. — Errors which give as a result a proper name are also filtered, for instance the typo *wirries from worries is also a proper name. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Estimating Dyslexia in the Web Lexical Quality of Social Media
  • 11. Sample of Words Outline album *albun always *alwasy around *arround because *becuase enough *enoguh everything *everyhting having *haveing problem *problen remember *remenber working *workig — Relatively long (an average length of 8.2 letters). Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 12. Lexical Outline Measure Quality — Relative ratio of the misspells to the correct spellings averaged over our word sample: — A lower value of LQ implies better lexical quality, being 0 perfect quality. — We estimate df by searching each word only in the English pages of a major search engine. — The relative order of the measure will hardly change as the size of the set grows. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 13. Social Media Sites Classes Outline Social Networks (S) Collaboration Sites (C) Blogs & Micro-blogs (B) Bebo CiteuLike Blogger Facebook Digg Foursquare Friendster Wikia LiveJournal Hi5 Wikipedia Tumblr LinkedIn Wikispaces Twitter MySpace Multimedia Sites (M) Opinions & Community Question-Answering Systems (O) Flickr Fotolog Epinions Last.fm Quora Picasa Yelp Based on: Youtube Y! Answers — Number of users — Alexa ranking Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 14. The Lexical Quality of Social Media Outline — Relative size considering public content (# words according to a major search engine). — 55% social networks — 23% multimedia — 20% blogs — No correlation of public content and LQ. — Errors in the Social Media Web — 62% Social Networks (S) — 19% Multimedia (M) — 47% Facebook — Almost 80% Facebook + Fotolog + Blogger + MySpace + Hi5 — No clear order for the site classes:  Collaborative sites are the best ones, followed by blogs, multimedia, social Range and average lexical quality in percentages (Values over the social media average are highlighted) networks, and further away opinions. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 15. The Lexical Quality of Social Media Outline — Just eight sites have lexical quality that is better than the average of the Web, but those account for less than 27% of the content. — On average, social media classes have lexical quality worse than the Web itself. — Compared to high quality sites, the quality of social media is one order of magnitude worse. — The lower quality of social media impacts many more sites. For example we found that the Range of percentages and average for a sample of frequent misspellings in several sets of Web sites community section of the NY Times is the main (Values over the Web average are highlighted) contributor to the decrease of their lexical quality. A similar effect occurs in CNN or Microsoft. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 16. Geographical Distribution of Web LQ Outline — Lower LQ and higher Internet usage in Ireland, United Kingdom, Australia, New Zealand and Canada. —  Higher impact of social media in countries where Internet penetration is higher. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 17. On-going Work Outline Error Types in the Web Range of percentages and average for the different error classes Absolute error rate for different and its prevalence in the Web typing errors — These types of errors are quite word dependent, which explains the wide range of percentage values obtained. — More left-right than up and down typos. — The percentage of dyslexic errors is much lower than the corresponding number of dyslexic users (say 10%). Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 18. Conclusions Outline • Correlation between popularity and perceived semantic quality and our defined lexical quality. • Lexical quality of social media can be used to estimate the semantic quality of social media. • Nevertheless, these estimations should be taken with a grain of salt, as they will change with a different sample of sites and/or words sample. • Nevertheless, we believe that the main results will be maintained, e.g. that the lexical quality of social media is worse than the overall Web. • Our results contribute to the difficult and still open problem of measuring the quality of content in social media and the Web in general. Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 19. Future Work Outline 1 — Define new ways to measure LQ and compare them with these results to check for consistency. 2 — Increase the sample of social media sites studied. 3 — Increase the sample of words to measure the LQ. 4 — A linguistic model for improving accessibility to Web based textual material Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
  • 20. Outline Thank you :-) Any Questions? Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media