80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
Tracking the Emergence of New Words across Time and Space
1. Tracking the Emergence of
New Words across Time and Space
Jack Grieve
Aston University
Research conducted with
Diansheng Guo & Alice Kasakoff, University of South Carolina
Andrea Nini, Aston University
Funded as part of the Digging into Data Challenge
2. Approaches to Historical Linguistics
There are several different approaches to the analysis of
language change:
Reconstruction through comparison of known languages
(comparative method)
Analysis of previous linguistic research (e.g. lexicographic
research)
Analysis of historical texts (corpus-based)
Apparent time studies with interview data (sociolinguistics)
Computer simulations
3. Lexical Change
Research in historical linguistics and etymology has
analysed how the usage of certain words have changed
over relatively long periods of time (primarily based on
historical corpora and lexicographic research), but overall
there are large gaps in our knowledge of lexical change,
including how newly emerging words enter a language
and spread across its speakers.
4. Words are Rare Events
The main problem with studying lexical variation and
change is that most words are incredibly rare, thus
requiring incredibly large corpora of natural language.
This is why most research on lexical variation and
change has focused on relatively high frequency words,
primarily function words (e.g. pronouns, prepositions,
auxiliary verbs).
7. The majority of the 67,000 most
frequent words in our corpus occur
less than once per 25 million words
Word Frequency Distribution (Zipf 1935, 1945)
8. New Words are Incredibly Rare Events
The analysis of new words requires even more data,
because emerging words are by definition especially
rare.
In addition, to analyse the temporal and spatial spread
of new words, large corpora must be compiled for a
large number of points in times and locations.
9. Big Data
Suitable data has recently become available with the
rise of the social media and smartphones, which
provide massive amounts of time-stamped and geo-
coded natural language data.
10. Goals of Today’s Talk
Identify emerging words from 2014 based on a multi-
billion word corpus of American tweets.
Chart their usage over time and identify common
temporal patterns of lexical spread.
Map their geographical diffusion and identify common
spatial patterns of lexical spread.
11. The Corpus
Since 2013, the team at USC have been compiling two
multi-billion word geocoded corpora for the US and the UK
using the Twitter API.
Twitter is a particularly rich source of geocoded data and
is also very popular, informal, and youthful, making it ideal
for tracking the emergence of new words.
Approximately 2% of tweets are geocoded.
12. The Corpus
The analysis today is based on a 8.9 billion word
corpus of American Tweets from October 2013-
November 2014, which totals approximately 980 million
Tweets from 7 million users.
Every tweet is geocoded with the precise longitude and
latitude of the user when posting, which were then used
to identify the county where each Tweet was produced.
21. Corpus Examples
username,fips,time,tweet
-‐,48439,Sun
Jul
27
23:59:59
EDT
2014,
don't
follow
the
right
ppl
lol
-‐,42007,Sun
Jul
27
23:59:59
EDT
2014,
yesss
moody
judy
-‐,36005,Sun
Jul
27
23:59:59
EDT
2014,
Man
i
was
just
thinking
shexx
be
lurking
but
won't
hmu
-‐,25021,Sun
Jul
27
23:59:59
EDT
2014,
no
seeing
u
on
tv
is
reel
but
not
seeing
u
on
twitter
is
real
for
me...so
pls
visit
us
here
everyday.
-‐,26163,Sun
Jul
27
23:59:59
EDT
2014,
Hate
seeing
my
friends
sad
-‐,12093,Sun
Jul
27
23:59:59
EDT
2014,
this
is
the
shirt
i
won
that
i
got
to
sign
btw!!:)
26. Identifying Rising Words
To find newly emerging words, we first measured the
degree to which the usage of each word in the corpus
had been rising over the 13 month period.
To identify these rising words we extracted the 67,000
words that occur at least 1,000 times in the corpus and
compared word relative frequency per day to day of the
year using a Spearman’s rank correlation coefficient.
30. The Top 10 Rising Words on Twitter 2014
Word ρ Definition
fuckboy 0.947 Asshole, Jerk, Poser, Tool, etc.
rn 0.938 Right Now (Top Riser 2013)
hbd 0.928 Happy Birthday
fw 0.927 Fuck with
unbothered 0.926 Unconcerned & Disengaged
ft 0.925 Face time
gmfu 0.924 Get me fucked up
sm 0.919 So Much
squad 0.919 Squad
asf 0.918 As fuck
31.
32.
33.
34.
35.
36. Identifying Emerging Words
Although measuring correlations allows for rising words
to be identified, most are far too common by 2014 to
show patterns of regional spread.
To identify emerging words we cross-referenced the list
of rising words against a list of rare words, defined as
words with low overall frequencies in the fourth quarter
of 2013 (excluding proper nouns).
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47. Top 10 Emerging Words on Twitter 2014
Words ρ Definition
unbothered 0.926 Unconcerned & Disengaged
gmfu 0.924 Get Me Fucked Up
joggers 0.908 Jogging pants
fuckboys 0.902 Losers, wimps, posers, etc.
rekt 0.900 Wrecked
tfw 0.879 That feel when
xans 0.878 Benzodiazepine pills
baeless 0.875 To be without a bae
boolin 0.857 Hanging out, esp. young men
lordt 0.854 Lord, as exclamation
48. Top 11-20 Emerging Words on Twitter 2014
Words ρ Definition
celfie 0.852 selfie
slays 0.843 impresses, succeeds at, etc.
famo 0.840 family and friends
fuckboi 0.838 fuckboy
(on) fleek 0.838 on point, esp. eyebrows
faved 0.836 to favorite something
gainz 0.828 earnings
bruuh 0.817 bro
amirite 0.816 am I right
notifs 0.808 notifications, especially online
61. S-shaped Curves
In the time charts for many of the rising and emerging
words we see clear s-curves or what look like the start
of s-curves.
62. S-shaped Curves
Similar results have also been found repeatedly in
sociolinguistic apparent time studies (see Labov, 2001),
as well as in corpus-based research in historical
linguistics (e.g. Nevalainen & Raumolin-Brunberg, 2003).
Similar results have also been obtained in research on
the diffusion of innovations (see Rogers, 2003), where it
is referred to as an S-shaped Curve of Diffusion.
65. Summary: Time Patterns
New words rise (and fall) very quickly in Modern
English, with numerous new words entering the
language and quickly rising in usage every year.
The usage of emerging words over time tends to follow
an s-shaped curve, echoing results found in
sociolinguistic apparent time studies and diffusion of
innovation research.
66. Goals of Today’s Talk
Identify emerging words from 2014 based on a multi-
billion word corpus of American tweets.
Chart their usage over time and identify common
temporal patterns of lexical spread.
Map their geographical diffusion and identify common
spatial patterns of lexical spread.
67. Mapping the Spread of New Words
An important technical problem is how to map the
spread of a new word across a region.
One approach is to map the relative frequency (e.g.
occurrences per million words) of the word across a
series of regional corpora (e.g. all the tweets from a
particular county) over a series of time points.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81. Geographical Diffusion of Linguistic Forms
Two major theories have been proposed to explain how
new linguistic forms generally spread in language:
The Wave Model states that new forms spread out
radially from their source.
The Gravity Model states that new forms spread out
from one urban area to the next, based on distance
and population size, only later filling in less
populated areas in between.
82. Assessing the Wave and Gravity Models
We can begin assess the validity of the wave and
gravity models for lexical spread by comparing the
spread of unbothered.
This analysis can be facilitated by focusing on one state
where the form eventually becomes relatively common,
for example Georgia.
97. Assessing the Wave and Gravity Models
The geographical spread of unbothered in Georgia
appears to be more complex than predicted by the
Wave or Gravity Model, although both appear to offer a
partial explanation for this pattern of spread
The percentage of African Americans, however, also
appears to be an important predictor.
101. Presenting a time series of maps is an effective way to
map lexical spread, but another technical issue is how
to map emerging words on one map:
Relative frequency
Date of first (or second...) occurrence
Number of words until first (or second...) occurrence
Mapping the Spread of New Words on One Map
102.
103.
104.
105.
106.
107.
108. Top 10 Emerging Words on Twitter 2014
Words ρ Definition
unbothered 0.926 Unconcerned & Disengaged
gmfu 0.924 Get Me Fucked Up
joggers 0.908 Jogging pants
fuckboys 0.902 Losers, wimps, posers, etc.
rekt 0.900 Wrecked
tfw 0.879 That feel when
xans 0.878 Benzodiazepine pills
baeless 0.875 To be without a bae
boolin 0.857 Hanging out, esp. young men
lordt 0.854 Lord, as exclamation
109.
110.
111.
112.
113.
114.
115.
116.
117.
118. Top 11-20 Emerging Words on Twitter 2014
Words ρ Definition
celfie 0.852 selfie
slays 0.843 impresses, succeeds at, etc.
famo 0.840 family and friends
fuckboi 0.838 fuckboy
(on) fleek 0.838 on point, esp. eyebrows
faved 0.836 to favorite something
gainz 0.828 earnings
bruuh 0.817 bro
amirite 0.816 am I right
notifs 0.808 notifications, especially online
119.
120. Summary: Regional Patterns
New words originate from across the US, including the
Southeast (e.g. Unbothered, Baeless, Boolin), the North
(e.g. Fuckboy, Gainz), and the West (e.g. Wrekt), and
tend to spread within these regions first.
Otherwise, the spread of new words appears to be highly
complex, affected by numerous factors, including
proximity, population density, and demographic patterns.
121. Traditional Approaches to Historical Linguistics
The empirical analysis of language change is generally
based on historical corpora, which tend to span
centuries, or collections of linguistic interviews, which
tend to span generations (i.e. based on apparent time).
Both sources of data tend to provide a broad temporal
scope but limited temporal resolution and amounts of
data (<1 million words).
122. The Uniformitarian Principle
“Knowledge of processes that operated in the past can
be inferred by observing ongoing processes in the
present” (Christy, 1983: ix).
This Uniformitarian Principle is cited in Labov (2001) to
justify the use of apparent time interview data in place of
historical corpora, but it also justifies the use of
extremely large and dense contemporary corpora in
place of both of these more common approaches.
123. A Modern Approach to Historical Linguistics
Analysing with modern language data mined from online
sources allows for unprecedentedly large, rich and
dense natural language corpora to be compiled.
Although historical scope is lost, this approach allows for
language change to be analysed in far greater detail
than would otherwise be possible.
124. Tracking the Emergence of
New Words across Time and Space
Jack Grieve
Centre for Forensic Linguistics
Aston University
Email: j.grieve1@aston.ac.uk
Website: https://sites.google.com/site/jackgrieveaston
Twitter: @JWGrieve