It was a paper presented by Anders Johannsen, Dirk Hov, Anders Søgaard from University of Copenhagen. It talks about the socio linguistic issues like age, gender, region, review ratings etc. and tries to relate it with different language reviews on trustpilot.
Schema on read is obsolete. Welcome metaprogramming..pdf
User review sites as a resource for large scale Sociolinguistic studies
1. USER REVIEW SITES AS A RESOURCE
FOR LARGE-SCALE SOCIOLINGUISTIC
STUDIES
By,
Ashutosh Bhargave.
Anders Johannsen, Dirk Hov, Anders Søgaard
University of Copenhagen
3. Sociolinguistic studies
Problems:
• Traditional approach.
• Social media data
Remedy:
• Paper aims to remedy both problems by exploring a
large new data source, international review
websites with user profiles.
language
extra-linguistic
variables
Relation
4. DATA FORMAT:
• The Trustpilot Corpus consists of user reviews from the Trustpilot
website.
• Users need to register with a username in
order to leave review
• no mandatory fields other than the name
• assign unique identifiers to both users and
companies and use those to link up reviews.
• mostly interested in age, gender, and location
in combination with the written reviews.
5. DATA AUGMENTATION
Augmented the retrieved data set in two ways,
1. gender information based on 1st names, and
2. geo tagging information (latitude & longitude)
Problems -
1. no gender information
2. “canonical" town
6. REPRESENTATIVENESS
restricted to the age range from 16 to 80.
median age in our data is typically close to the
country's median value.
more male than female users
average number of reviews per user is around 4
8. Emoticons, age, and gender
Eyes ( : ; ) Nose ( - or none) Mouth ( ( , ) , [ ,* etc)
women use emoticons almost twice as often as men do
for all ages, the use of a nose is highly anti correlated
with age
9. Ratings, categories, gender, and age
men tend to vote slightly more negative than women
people in the younger group are more likely to use
negative ratings than people in the older group
10. DENMARK
missing distinction between the reflexive possessive
pronouns and non-reflexives
record the frequency of sin/sit (his/her own) and the
joint frequency of all possessive pronouns(his). Then
compute the ratio of the former in all pronouns.
11. Swear words across location, gender, and age:
• as people grow older, they tend to use more conservative language
• women use this stronger version words less than the men
12. GERMAN
Replacement : β with ss
dass/daβ, “that", and the modal mussen/muβen, “to
must”
older speakers retain the traditional spelling they
acquired in their youth to a much greater extent .
13. CONCLUSION
Traditional sociolinguistic studies often lack
statistical power to draw valid conclusions and big-
data approaches to language studies mostly lack
extra-linguistic information that would enable
sociolinguistic studies.
Solution to this dilemma is user review sites.