Report

Does Diction Matter?
To what extent is music genre determined by lyrics alone?
George Nisbet
School of Computing
University of Kent
Canterbury, Kent
gfn2@kent.ac.uk
Abstract— this project investigates the relationship between
music lyrics and genre, in order to determine if the genre of
a track can be successfully classified based solely on the
words of the song. In order to test this link, we first gather
genre and lyrics data in order to find the frequency of words
within tracks. This data is then fed into a C5.0 decision tree
generator in order to build a model to predict the genre of
tracks. We then measure the effectiveness of the model by
feeding it a test data set, and comparing the actual genres
against the predicted ones. The model also sheds light on
which genres appear to be easier to classify based on varying
numbers of test cases. The findings of this research will be
useful to those investigating genre classification through
other means by providing another set of factors to continue
their studies.
Keywords-music, genre, classification,lyrics
I. INTRODUCTION
Sitting in a bar in San Francisco (because no good
technical report ever started with “There I was eating a
bowl of salad”); my friend and I started talking about
Country music. I voiced my rather inebriated opinion that
‘all Country music sounds basically the same,’ to which
my friend replied “Well to be honest, it’s all mostly talking
about the same thing anyway.” The conversation went on
and truthfully I don’t remember much of it.
The following morning, the same friend sent me a
YouTube video [1] showing that many Country songs did,
indeed, have similar themes: guns, alcohol, trucks,
moonlight and a lot of use of the word ‘girl.’
The question then presented itself: “Are groups of
themes like this unique to the Country music genre? Or do
other music genres have their own set of particular themes
that present themselves throughout?
Sadly there will be no further mention of bars in this
report, but fear not – for this is the epic saga of how we
went about answering these questions, filled with tales of
breakthroughs, setbacks and right at the end, just for the
Hitchhiker’s Guide to the Galaxy fans, an answer which is
actually quite close to 42.
II. BACKGROUND
Though the impetus for this research began in San
Francisco, it quickly became apparent that others may have
had similar conversations. The works of McKay et al. have
investigated the use of lyrics features such as the number
of lines/words in the track, the average syllables per word,
and Flesh Kincaid grade to name a few [2]. In addition,
Lidy and Rauber have used rhythm histograms in an
attempt at classification, yielding as high as 82.8% success
[3].
Rhythm Histogram features are a descriptor for the
general rhythmic characteristics of an audio document. In a
RH the magnitudes of each modulation frequency bin for
all the critical bands of the human auditory range are
summed up to form a histogram of “rhythmic energy” per
modulation frequency [3].
The works of Haggblade et al. used Mel Frequency
Cepstral Coefficients (MFCCs) and a variety of machine
learning techniques for classifying genre [4]. MFCCs are a
group of coefficients which make up a representation of
the short-term power spectrum of a sound.
The use of MFCCs in conjunction with neural
networks for classification resulted in 96% accuracy [4],
showing these features to be highly effective for
classification. The objective of this research is to
investigate whether lyrics might also yield such high
effectiveness.
As such, our approach treats each track as a ‘bag of
words’ and attempts to analyse the effectiveness of using
the frequency of words within the lyrics.
In order to be able to build on previous research and
realise this project, a platform was needed that had the
versatility to collect data, text mine that data, and then
classify genres based on that text mining. This led into the
investigation of The R Project.
“R is a language and environment for statistical
computing and graphics” [5]. “R provides a wide variety
of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification,
clustering, …) and graphical techniques, and is highly
extensible” [5].
The range of libraries available made it evident that the
entire project could be accomplished within R, giving
unity of platform and an uncomplicated design process.
III. AIMS
A. To calculate the effectiveness of genre classification
using lyrics
B. To find the terms most responsible for classification
C. To identify if any genres are easier to classify than
others: Are their lyrics more distinct?

IV. DATA COLLECTION
The key to starting this research was acquiring the
necessary lyrics and genre data. Before researching the
mechanism by which the data would be downloaded, it
was first necessary to find APIs which had the data
required.
A. Lyrics
Musixmatch and the Echo Nest were investigated to
no avail, as the only available lyrics API from
Musixmatch provided 30% of the lyrics for each track,
and Echo Nest proved too complicated to implement in
good time.
From here, Chartlyrics was investigated and was
found to provide a Simple Object Access Protocol
(“SOAP”) API which could get the required data. The
initial approach was to have a single .txt file, containing
lines of comma separated artist and track names, which
would be iterated through to perform a SOAP query for
each track using the API, then get the lyrics data.
On investigation, however, SOAP proved to be too
inefficient to be able to get a data set of sufficient size
within a reasonable timeframe. After these initial
setbacks, it was decided that another approach was
needed.
The HTML pages of the Music Lyrics Database
(“MLDB”) were investigated and found to be highly
structured, showing promise for performing systematic
web scraping in order to get lyrics from the site.
The site is structured such that each artist has their
own page which contains URIs to each of their tracks’
lyrics. The structure of the URIs for the artist pages made
it very easy to iterate through each page. The artist page
URI is in the form:
http://mldb.org/artist-i-eminem.html
Where i is an integer. The string “eminem” on the end
of the URI was of no consequence: it was the value of i
that was the sole determinant for navigation to an artist
page. This made it possible to iterate through values of i
to navigate through the corresponding artists of those
pages:
…
i=54 → Madonna
i=55 → Pink
i=56 → Red Hot Chilli Peppers
…
From each of these iterations, the track lyrics could
then be extracted. The values of i to be used were selected
as an arbitrary range to ensure that no human preference
was given to any particular genre: a human user choosing
artists might subconsciously give more preference to
tracks they regularly hear on the radio, for example.
Having now found a lyrics data source, it was now
necessary to find a mechanism by which the data could be
imported and made usable in R. After some research, the
RCurl package proved to be the most promising
candidate: It proved simple to use with MLDB and scaled
much better than SOAP.
“The [RCurl] package allows one to compose general
HTTP requests and provides convenient functions to fetch
URIs, get & post forms, etc. and process the results
returned by the Web server” [6].
The lyrics were contained within the only <p> HTML
tag on each lyrics page of MLDB, making it possible to
easily extract them by parsing the HTML.
The structure of these pages also allowed extraction of
the artist name and track name, thus a directory of .txt
files could be built in sufficient quantity to attempt
classification.
The extracted lyrics for a track could then be written
to a .txt file so that they could be used in the next phase of
the project. A directory structure called Testbed was built
such that the lyrics files were organised thus:
/Testbed/artistName/trackName.txt
This ability to scrape lyrics for a single song proved
effective enough to expand the process to scrape multiple
tracks from multiple artists by iterating through the artist
pages, as above.
A total of 2813 files were created, each containing
lyrics for different tracks.
B. Genre
After having acquired the necessary lyrics data, the
next essential step was to map reliable genre data to the
existing lyrics .txt files.
Although Musixmatch proved unviable for lyrics data,
it did have an acceptable API for getting track genres. The
RCurl library used to get the lyrics data also worked when
using the Musixmatch genre API.
A function was defined to return the genre of an
individual track by creating a Musixmatch API query and
extracting the genre from the returned XML document.
An essential part of this process was ensuring that no
mismatches occurred between the data in the /Testbed
directory and the data returned from the Musixmatch API.
This was achieved by testing the Levenshtein (“LV”)
distance between the track and artist names.
“The Levenshtein algorithm (also called Edit-
Distance) calculates the least number of edit operations
that are necessary to modify one string to obtain another
string” [7]. It calculates the total number of substitutions,
insertions and deletions required to modify one string into
another.
More formally,
 “A deletion: the transformation
for some , where the hat omits
omission in the -th spot.
 An insertion: the
transformation
for some , and some letter
 A substitution: the
transformation

for some and some letter .”
[8]
For example, the LV distance between the strings
“race car” and “racing car” is calculated thus:
TABLE I.
CALCULATING THE LEVENSHTEIN DISTANCE BETWEEN TWO STRINGS
R A C E C A R
= = = S I I = = = =
R A C I N G C A R
Where I is an insertion, S, a substitution, and =, no
change.
The sum of the substitutions, insertions, and deletions
results in an LV distance of 3 for these strings.
By testing the track and artist names from the genre
query and the existing lyrics .txt files in this way, it was
possible to identify files whose names did not
synchronise. If the LV distance was larger than 3, then the
user would be required to intervene and assert whether the
track and artist names between the two sources were the
same:
[1] Artist from XML: Kanye West
[1] Artist from file: Kanye West ft.
JayZ
[1] LV Dist: 10
[1] Track name from XML: Otis
[1] Track name from file: Otis
[1] LV Dist: 0
[1] Continue?
>
Any tracks whose names did not match were ignored
to be removed during data cleanup.
For the tracks which passed the LV condition, the
primary and secondary genres were extracted from the
XML and appended onto the end of each file, separated by
dashes (-):
/Testbed/artistName/trackName-genre1-genre2-
genre3.txt
By naming the files in this way instead of including
the genre within the body of the files, the genre data
would not need to be removed from the body of the text
before text mining. It also meant that by splitting the
string about the dash character, the track name would
always be the first item in the list, and the most notable
genre would always be the second:
strsplit("trackName-genre1-genre2-
genre3", split="-")[[1]][1]
[1] "trackName"
strsplit("trackName-genre1-genre2-
genre3", split="-")[[1]][2]
[1] "genre1"
On a number of occasions, the MusixMatch API could
not find a match between the artist and track names from
the .txt file names, likely due to records not existing
within the API. These .txt files without genres were
removed from the Testbed directory as they could not be
used for training or testing.
V. TEXT MINING
The next phase of the project was to build a document
term matrix for the input data. “The term-document
matrix [transposed version of a document term matrix]
is a large grid representing every document and content
word in a collection [9].” Each row corresponds to a
document, and each column corresponds to a term
found within the collection. The matrix is populated
with how many times each term appears in each
document, thus:
S1 = “Because you’re so smooth”
S2 = “A smooth criminal”
TABLE II.
THE DOCUMENT TERM MATRIX OF TWO STRINGS, S1 AND S2
a because criminal smooth so you’re
S1 0 1 0 1 1 1
S2 1 0 1 1 0 0
Building a DTM was achieved with an R library called
‘tm,’ for which “The main structure for managing
documents … is a Corpus, representing a collection of
text documents” [10].
Since the objective of the project is to find words
specifically related to certain genres, it can be said that the
most frequent words in the English language – ‘the,’ ‘and’
etc – may be removed without damaging our ability to
classify genres. These common words in English (called
stop words) were removed from documents before any
attempt at classification.
VI. DOCUMENT TERM MATRIX CLEANUP
One of the major issues discovered when examining a
complete list of terms from the documents in the corpus
was that of spelling errors and synonyms: the tm library
provided no way of determining if words were similar in
meaning.
For example, if the corpus contained the words ‘yeah,’
‘yeahh,’ and ‘yeahhh’ which could be said to mean the
same thing, they would be classed as separate terms
within the matrix, thus affecting the reliability of any
predictive model we would endeavour to build.
As such, it became necessary to group similar terms
into a single column of the matrix, instead of having
multiple terms which map to the same meaning.
The first phase of this was the discovery of similar
terms. By creating a comma separated list containing the

terms deemed similar to each other, further action could
be taken to group these terms together.
Since all the words in the columns of the matrix were
in alphabetical order, it was a case of determining if the
LV distance between two terms, i and (i+1), was greater
than 2. If this was the case, then the same procedure was
carried out for the terms i and (i+2), i and (i+3), and so
forth. This was because terms with spelling errors were
likely to be next to each other in the list.
When the condition no longer held, and the terms i to
(i+n) were deemed to all have LV distances of less than 2,
the terms i to (i+n) would be recorded in a comma
separated file: thesaurus.txt
The second phase was to find which terms actually
had the same meaning after they had been ‘discovered.’
This was done by inspecting thesaurus.txt and
manually identifying the groups of terms with the same
meaning into a new file, similarTerms.txt, so that a new
function could act on this insight by modifying the
document term matrix.
This function would merge a collection of columns
together based on their column names. The function read
a line of the similarTerms.txt file and generated a list of
unique terms in that line. It then went and summed the
contents of each column with a matching column name,
and then replaced the contents of the first matching
column with this newly generated, summed column. The
remaining columns were then set to NULL and the
function moved on to the next line of the file to repeat the
merging operation:
TABLE III.
SUMMATION AND MERGING OF COLUMNS WITH SIMILAR MEANING
… Yeah Yeahh Yeahhh … … Yeah …
… 1 0 3 … → … 4 …
… 3 4 0 … … 7 …
VII. TRAINING
On investigation, R had a suitable library for
implementing a C5.0 decision tree. This approach was
selected for classification due to the C5.0 library
providing a function to take a slightly modified document
term matrix as an argument, and return a predictive model
for the cases within that data frame:
Where R is the training data and M is the model.
This meant that very little data transformation was
needed in order to be able to begin classifying, expediting
the timeframe of the project.
“In C5.0, the application's data file provides
information on the training cases from which C5.0 will
extract patterns” [11]. The data frame is the R C50
library’s equivalent of the data file used to provide data in
the command line C5.0 library.
As previously discussed, each row of the DTM
indicates the frequency of terms within the lyrics of a
single track. The DTM was therefore the same format
required for the R C50 package, with one exception: The
target vector.
The target vector is a list of type String, corresponding
to the expected output of the decision tree. In the case of
this project, this is the genre of each song determined by
extraction from the file name of each track.
The genre data had previously been appended to the
file names of the lyrics .txt files which made it possible to
construct the target vector by extracting the first genre
listed on the end of the file name.
By appending this vector as the last column of our
DTM, the complete data frame required for the C50
library was now created:
TABLE IV.
APPENDING THE TARGET VECTOR TO A DOCUMENT TERM MATRIX
… criminal smooth so you’re target
S1 … 0 1 1 1
Latin
Rock
S2 … 1 1 0 0 Pop
From here, a sample of 90% of tracks in the DTM was
used to train the model, with the remainder reserved for
testing.
The training data and training target vectors were fed
into the C5.0() function to train the model:
model <- C5.0(trainD, trainT, trails
= 3)
The trials parameter is related to the adaptive
boosting functionality of C5.0. “The idea is to generate
several classifiers (either decision trees or rulesets) rather
than just one. When a new case is to be classified, each
classifier votes for its predicted class and the votes are
counted to determine the final class” [11]. This process is
designed to improve the accuracy of the predictive model.
VIII. TESTING
After building the predictive model, the final part of
the process was to have the model predict the genres of a
sample of test tracks using the function call:
p <- C50 :: predict.C5.0(model,
newdata = testD)
This call returned a list of predicted genres which
would then be compared side-by-side to the genres
collected from MusixMatch during the data collection
phase. The ‘result’ of the project is the percentage of the
predicted genres which match their tracks’ actual genres,
thus:
Figure 1. Comparing actual genres side by side to predicted genres.

Where is the set of all test cases, is the actual
genre, and is the predicted genre, using # for the
number of elements of a set.
The model was trained and subsequently tested over
10 iterations to ensure reliability.
IX. ANALYSIS
A. Assumptions
1. The terms collected from the data are considered
to be lyrics and not as metadata that may be
contained within the music
2. The terms collected are not confined to the
English language.
3. The data set is large enough to enable effective
classification
4. No spelling errors occur in the data.
B. Calculating the effectiveness of genre classification
Referring back to the project aims, the objective to
find a percentage of correctly classified tracks was quite
easily done: After 10 successive training and testing
cycles (referred to as runs, and each using different
samples of the same data), the average percentage of
correctly classified tracks was 40.7%, with a standard
deviation of 0.03. This was achieved using a pool of 26
genres across 2813 tracks, the genre distribution of which
is as follows:
Figure 2. Breakdown of cases by genre for training and
testing.
C. Finding the terms most responsible for classification
Finding the terms most responsible for genre required
some additional analysis: Each run provided the ‘attribute
usage’ for all terms over 1%, an example of which is
below:
Attribute usage:
100.00% yankee
99.29% pero
98.89% harris
97.71% dutty
97.23% ludacris
95.02% eminem
93.24% kanye
Figure 3. Attribute usage from the summary of the C5.0 model.
n.b. From here, the words ‘term’ and ‘attribute’ are
used interchangeably.
“The figure before each attribute is the percentage of
training cases … for which the value of that attribute is
known and is used in predicting a class” [11]. The term at
the root of the decision tree will therefore have an
attribute usage of 100%, as every attempt at classification
must start at the root node of the tree.
For the purposes of finding the most relevant terms, it
was decided to pay particular attention to the terms with
attribute usages of 90% and over.
These terms did not necessarily have an attribute
usage of over 90% in every single run, and for the
majority of terms this was indeed the case. From here,
there were two factors which were considered for each
individual term: The number of runs in which the term
scored over 90%, and the average attribute usage over the
runs in which the term did score over 90%. Multiplying
these two factors gave a ‘mark’ of their relevance over all
of the runs.
TABLE V.
MOST INFLUENTIAL TERMS ON GENRE CLASSIFICATION
Term
Occurances
out of 10
runs
Average
Attribute
Usage
Mark
harris 10 98.8 987.95
nigga 10 95.2 952.06
eminem 9 92.4 832.04
dutty 7 97.6 683.21
yankee 6 100.0 600.00
que 4 99.3 397.35
ludacris 4 97.1 388.54
kanye 4 93.2 372.74
soy 3 99.8 299.41
pero 2 99.2 198.38
por 2 99.2 198.38
quien 1 100.0 100.00
como 1 99.3 99.25
inna 1 97.6 97.63
waan 1 97.6 97.55

The maximum possible mark for system of grading is
1000: where the term is present in all 10 runs, with 100%
average attribute usage in every run.
D. Identifying the most distinguishable genres
Lastly, it was possible to infer the most
distinguishable genres based on their percentage error and
number of training examples.
The genres ‘Latin,’ ‘Latin Urban,’ and ‘Reggae’
showed noticeably low error rates, even though these
genres had only a fraction of the training examples that
many of the other genres classified had.
One could infer from this that these genres are the
most distinctive – and the presence of Spanish words in
Table V supports this assertion.
Figures 4-7 inclusive, below, show the proportion of
error incurred during testing for each run across these
three genres, with a graph for Pop for comparison.
Figure 4. Note the low number of training cases and proportionately low error.

Figure 7. Now note the high number of training cases and proportionately high error.
Figure 8. Percentage error when classifying genres against number of training cases.
In Figure 8, the dots in black are the genres represented in
the ‘Incurred Error’ graphs, Figures 4-7. One can see that
despite Pop having many times more training data, its
error rate is still higher than that of Latin, Latin Urban,
and Reggae.
I. CONCLUSION
The project achieved its initial aims: a percentage
value of success was calculated; analysis of the model
summary showed the terms most responsible for the
classifications; and the most distinctive genres were
identified.
One of the outcomes of this project is the significance
of the difference between lyrics and regular text. A news
article, for example, would almost always use the word
‘going’ as opposed to ‘goin’.’ In lyrics, however, the
differences between ‘going’ and ‘goin’’ may well be
significant: the Hip Hop or Rap genres may use the
apostrophised version, while more conservative genres
might retain the full spelling. Although the words mean
the same thing, the classification of their genres may well
depend on this subtle difference in spelling.
The main conclusions we can draw from the results
and analysis are that this approach performs moderately
well when classifying over 26 genres: 41%.
The works of McKay et al. achieved as high as 89%
classification success using audio, symbolic, and cultural
feature extraction [2], though with their lyrical features
managed 43% over only a 10-genre data set. Given that
the data set for this research includes 26 genres, the ‘bag
of words’ approach appears to perform better than the
lyrics features used by McKay et al.
It would appear then, based on these combined
findings, that the ability of lyrics alone to classify genre
isn’t sufficient – other features must be considered.

This research has yielded interesting results
surrounding the terms most responsible for
classification. The most noticeable word in this is
‘nigga,’ which is widely considered more socially
acceptable in some genres over others: It would be
highly unlikely to find ‘nigga’ in Country or Heavy
Metal track lyrics.
Finally, one can conclude that Reggae, Latin, and
Latin Urban are the most distinguishable genres when
classifying this data set, due to their low error rate by
comparison to their low number of training examples.
II. FURTHER WORK
This project amounts to a preliminary investigation
into genre classification, and several other avenues for
further investigation have been identified:
1. Use of alternative classification techniques:
C5.0 fares moderately well with the data
provided. Other mechanisms might include k-
means cluster analysis which provides the
advantage of knowing how many clusters to use
in classification – the number of unique genres
used in the data set.
2. Part of Speech analysis: This research does not
take advantage of examining similarities in
grammatical properties which may prove an
effective addition to the ‘bag of words’ analysis
of this project.
3. N-gram analysis: Further analysis of groups of
words would be an invaluable addition to the
work done in this project. Particularly as
themes in lyrics often fall across multiple
words: “You know what I mean,” for example.
4. Combining lyrics with audio and musical
features. This would be a case of adding
additional columns to the end of the document
term matrix to allow for other features, such as
beats per minute, words per minute etc.
III. ACKNOWLEDGEMENTS
I would like to thank the staff of the Computer
Science department at the University of Kent whose
tuition and support has led me through university and
culminated in this research. I’d like to show my
particular gratitude to Dr. Colin Johnson, whose
suggestions, guidance, and support were pivotal in the
success of this research, and Dr. John Crawford, whose
tutoring and mentoring have been a great source of
support throughout my degree.
I would also like to thank Nagendra Prahalad of
Cisco Systems, Inc, who once said to me “Once you
start getting your head around data mining, you’ll just
want to find patterns in everything.” This research has
been the embodiment of that mentoring.
In addition, I would like to thank the chaps at the R
project and to those who built the packages and libraries
used in this project.
And finally to Alex Holden of the University of
Kent and Cisco Systems Inc., for if we’d not had the
drunken conversation in San Francisco; we’d not have
this research.
IV. REFERENCES
[1] G. Smith, Why Country Music Was Awful in
2013. [Film]. 2013.
[2] C. McKay et al., “Evaluating the Genre
Classification Performance of Lyrical Features
Relative to Audio, Symbolic and Cultural
Features,” 2010.
[3] T. Lidy and A. Rauber, “Evaluation of Feature
Extractors and Psycho-Acoustic Transformations
for Music Genre Classification,” 2005.
[4] M. Haggblade et al., “Music Genre
Classification,” Stanford, 2011.
[5] The R Foundation, “What is R?,” The R
Foundation, [Online]. Available: http://www.r-
project.org/about.html.
[6] D. Lang, “R Project,” 27 01 2015. [Online].
Available: http://cran.r-
project.org/web/packages/RCurl/RCurl.pdf.
[7] “The Levenshtein-Algorithm: How Levenshtein
works,” [Online]. Available:
http://www.levenshtein.net/. [Accessed 05 03
2015].
[8] J. Kun, “The Blessing of Distance,” 26 August
2012. [Online]. Available:
http://jeremykun.com/tag/levenshtein-distance/.
[9] M. Ceglowski, “The Term-Document Matrix,”
SEOBook, 2002. [Online]. Available:
http://www.seobook.com/lsi/tdm.htm.
[10] I. Feinerer, 10 June 2014. [Online]. Available:
http://cran.r-
project.org/web/packages/tm/vignettes/tm.pdf.
[Accessed 9 March 2015].
[11] R. Research, “C5.0: An Informal Tutorial,”
Rulequest Research, 2013. [Online]. Available:
http://www.rulequest.com/see5-unix.html.

Report

Recommandé

Recommandé

Contenu connexe

Similaire à Report

Similaire à Report (20)

Report