The document discusses predicting personality traits from Facebook statuses using machine learning techniques. It presents the goals of applying stylometry and natural language processing to a dataset of Facebook statuses labeled with Big Five personality traits. The methodology uses supervised machine learning algorithms like Naive Bayes and k-NN with features extracted from the statuses like word counts and part-of-speech tags. Baseline models using only the status text or derived features performed poorly. Combining status text with other features in a pipeline improved performance slightly but results were still limited by hardware and computational constraints.
2. Agenda
1. Goals
2. Stylometry and its use-cases
3. Predicting Big 5 Personality traits
1. Split the dataset
2. Train and test statistical models
3. Evaluate the performance & show final results
4. Summary
3. Goals
•Getting into the field of stylometry & natural
language processing
•Conducting various data experiments on FB’s dataset
Non-Goals
•Achieving better results than existing studies
4. Stylometry
• Emerged in the second half of 19th century
• Wincenty Lutosławski coined it since 1897
• Def.: “the statistical analysis of literary style” dealing with “the study of
individual or group characteristics in written language” (e.g. sentence
length) Holmes & Kardos (2003), Knight (1993)
• Applied for authorship attribution & profiling, plagiarism etc.
5. Examples
• Authorship Identification in Greek Tweets
• Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek
popular users
• Character and word n-grams
• Forensic Stylometry for Anonymous Emails
• Frequent pattern technique
• Company email dataset containing 200,399 real-life emails from 158 employees
• Dream of the Red Chamber (1759) by Cao Xuegin
• First, a circulation of hand-written 80 chapters of novel
• Cheng-Gao’s first printed edition: 40 additional chapters being added
• The “chrono-devide” proven lastly via SVC-RFE with 10-50 features*
* Hu et al. (2014)
6. Supervised Machine Learning (S-ML)
• Dataset from MyPersonality.org project:
• 9917 Facebook’s status updates from 250 users
• Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or remained)
• Statuses are/will be classified to Big-Five binary personality traits (Extroversion,
Agreeableness, Neuroticism, Openness to experience, Conscientiousness)
• S-ML (vs. Unsupervised ML)
• Dataset contains many input & (desired) output variables
• S-ML learns by examples and after several iterations is able to classify an input
7. Methodology & Tools
• Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc.
• Methodology of S-ML:
• Extract relevant stylometric (NLP) features
• Split dataset into training & testing set
• Train the model on the training set Learn by examples
• Test the model on the ‘unseen’ set Classify
• Validate the performance of the model Evaluate
> Prepare data
*https://github.com/dmpe/CaseSolvingSeminar/
8. Extracted features from statuses
5 Labels
from ODS
Feature from
ODS
Extracted ones
Lexical (6) Character (8)
cNEU
STATUS
# functional words string length
lexical diversity [0-1]
# words # dots
# commas
cAGR
# personal pronouns
smileys
# semicolons
# colons
cOPN
Parts-of-speech Tags # *PROPNAME*
cCON
Bag-of-words (ngrams) average word length
cEXT
9. Splitting dataset using stratified k-fold CV
• Create 5 trait datasets based on our labels
• Use stratified k-fold cross-validation to split into the training and testing set
>>> train_X, test_X, train_Y, test_Y =
sk.cross_validation.train_test_split(agr[:,1:9],
agr["cAGR"],
train_size = 0.66, stratify = agr["cAGR"],
random_state = 5152)
16. Pipeline 3: Mix of STATUS and NON-STATUS cols.
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVC
OPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB
AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC
EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NN
CON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC
17. Results/Summary
• Hardly any improvement of head-first approach
• At least over the baseline
• Limited:
• strongly by Hardware & CPU
• grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels)
rapidly growing effort
• grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes
• Future Research: look on GPU (NVIDIA)
• inconsistent data (multiple languages e.g. Spanish)