SlideShare une entreprise Scribd logo
1  sur  17
Customer Linguistic Profiling
Predicting Personality Traits
on Facebook’s statuses
Vishweshwara Keekan
Dmitrij Petrov
Dustin Nguyen
Agenda
1. Goals
2. Stylometry and its use-cases
3. Predicting Big 5 Personality traits
1. Split the dataset
2. Train and test statistical models
3. Evaluate the performance & show final results
4. Summary
Goals
•Getting into the field of stylometry & natural
language processing
•Conducting various data experiments on FB’s dataset
Non-Goals
•Achieving better results than existing studies
Stylometry
• Emerged in the second half of 19th century
• Wincenty Lutosławski coined it since 1897
• Def.: “the statistical analysis of literary style” dealing with “the study of
individual or group characteristics in written language” (e.g. sentence
length) Holmes & Kardos (2003), Knight (1993)
• Applied for authorship attribution & profiling, plagiarism etc.
Examples
• Authorship Identification in Greek Tweets
• Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek
popular users
• Character and word n-grams
• Forensic Stylometry for Anonymous Emails
• Frequent pattern technique
• Company email dataset containing 200,399 real-life emails from 158 employees
• Dream of the Red Chamber (1759) by Cao Xuegin
• First, a circulation of hand-written 80 chapters of novel
• Cheng-Gao’s first printed edition: 40 additional chapters being added
• The “chrono-devide” proven lastly via SVC-RFE with 10-50 features*
* Hu et al. (2014)
Supervised Machine Learning (S-ML)
• Dataset from MyPersonality.org project:
• 9917 Facebook’s status updates from 250 users
• Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or  remained)
• Statuses are/will be classified to Big-Five binary personality traits (Extroversion,
Agreeableness, Neuroticism, Openness to experience, Conscientiousness)
• S-ML (vs. Unsupervised ML)
• Dataset contains many input & (desired) output variables
• S-ML learns by examples and after several iterations is able to classify an input
Methodology & Tools
• Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc.
• Methodology of S-ML:
• Extract relevant stylometric (NLP) features
• Split dataset into training & testing set
• Train the model on the training set  Learn by examples
• Test the model on the ‘unseen’ set  Classify
• Validate the performance of the model  Evaluate
> Prepare data
*https://github.com/dmpe/CaseSolvingSeminar/
Extracted features from statuses
5 Labels
from ODS
Feature from
ODS
Extracted ones
Lexical (6) Character (8)
cNEU
STATUS
# functional words string length
lexical diversity [0-1]
# words # dots
# commas
cAGR
# personal pronouns
smileys
# semicolons
# colons
cOPN
Parts-of-speech Tags # *PROPNAME*
cCON
Bag-of-words (ngrams) average word length
cEXT
Splitting dataset using stratified k-fold CV
• Create 5 trait datasets based on our labels
• Use stratified k-fold cross-validation to split into the training and testing set
>>> train_X, test_X, train_Y, test_Y =
sk.cross_validation.train_test_split(agr[:,1:9],
agr["cAGR"],
train_size = 0.66, stratify = agr["cAGR"],
random_state = 5152)
Classification Metrics -> Confusion Matrix (1)
“Golden Standard”
(Real Truth Values)
Positive Negative
Observed
Predicted
positive
True
Positive
False
Positive
(Type 1
error)
Precision
Predicted
Negative
False
Negative
(Type 2
error)
True
Negative
Recall/
Sensitivity
(Specificity)
Accuracy =
TP + TN
TN + FP + FN + TP
Precision =
TP
FP + TP
Recall =
TP
FN + TP
F1-score = 2 ∗
precision ∗ recall
precision + recall
Learning and predicting
• Head-on approach: Classifiers only
# Assumption: features are numeric values only
classifier = MultinomialNB()
classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X
• But: “Status” is a string and not numeric
nb_pipeline= Pipeline([
('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))),
('nb', MultinomialNB())
])
predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X)
• Validation of results
scores = cross_validation.cross_val_score(
nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’
)
accuracy, std_deviation = scores.mean(), scores.std() * 2
precision = average_precision_score(test_Y, predicted)
recall = recall_score(test_Y, predicted, labels=[False, True])
f1 = f1_score(test_Y, predicted, labels=[False, True])
Pipeline: Source Code Example
pipeline = sklearn.pipeline.Pipeline([
('features', sklearn.pipeline.FeatureUnion(
transformer_list=[
(‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status
('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()),
])),
('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values
(‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])),
('scaler', sklearn.preprocessing.MinMaxScaler()),
])),
],
)),
(‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB())
])
Parameter fine-tuning
• Most transformers and classifier accept different parameters
• Parameters can heavily influence the result
grid_params = {
'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))}
}
grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0)
y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test)
# print best parameters
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(grid_parameter.keys()):
print("t%s: %r" % (param_name, best_parameters[param_name]))
Baseline 1: STATUS (TF-IDF) column only
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN
OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB
AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB
EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC
CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN
Baseline 2: Derived columns
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB
OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB
AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC
EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB
CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB
Pipeline 3: Mix of STATUS and NON-STATUS cols.
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVC
OPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB
AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC
EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NN
CON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC
Results/Summary
• Hardly any improvement of head-first approach
• At least over the baseline
• Limited:
• strongly by Hardware & CPU
• grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels)
 rapidly growing effort
• grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes
• Future Research: look on GPU (NVIDIA)
• inconsistent data (multiple languages e.g. Spanish)

Contenu connexe

Similaire à Customer Linguistic Profiling

Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.pptAlpha474815
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.pptSagarDR5
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selectionMarco Meoni
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachMinhazul Arefin
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Machine learning- key concepts
Machine learning- key conceptsMachine learning- key concepts
Machine learning- key conceptsAmir Ziai
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learningIvo Andreev
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptMahyuddin8
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptghoitsun
 
Cs1123 3 c++ overview
Cs1123 3 c++ overviewCs1123 3 c++ overview
Cs1123 3 c++ overviewTAlha MAlik
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
The operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzerThe operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzerAndrey Karpov
 

Similaire à Customer Linguistic Profiling (20)

Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.ppt
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Tf estimators studygroup9
Tf estimators studygroup9Tf estimators studygroup9
Tf estimators studygroup9
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning Approach
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Machine learning- key concepts
Machine learning- key conceptsMachine learning- key concepts
Machine learning- key concepts
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
Cs1123 3 c++ overview
Cs1123 3 c++ overviewCs1123 3 c++ overview
Cs1123 3 c++ overview
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
The operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzerThe operation principles of PVS-Studio static code analyzer
The operation principles of PVS-Studio static code analyzer
 

Plus de F789GH

Apple's Communication: Antennagate & Batterygate
Apple's Communication: Antennagate & BatterygateApple's Communication: Antennagate & Batterygate
Apple's Communication: Antennagate & BatterygateF789GH
 
Discovering Data Science Design Patterns with Examples from R and Python Soft...
Discovering Data Science Design Patterns with Examples from R and Python Soft...Discovering Data Science Design Patterns with Examples from R and Python Soft...
Discovering Data Science Design Patterns with Examples from R and Python Soft...F789GH
 
Scrum for beginners
Scrum for beginnersScrum for beginners
Scrum for beginnersF789GH
 
Service Innovation - Increasing effectiveness for corporate clients at JOSEPHS
Service Innovation - Increasing effectiveness for corporate clients at JOSEPHSService Innovation - Increasing effectiveness for corporate clients at JOSEPHS
Service Innovation - Increasing effectiveness for corporate clients at JOSEPHSF789GH
 
Co-creating a Smart Home concept
Co-creating a Smart Home conceptCo-creating a Smart Home concept
Co-creating a Smart Home conceptF789GH
 
Smart Factory: ICT Requirements
Smart Factory: ICT RequirementsSmart Factory: ICT Requirements
Smart Factory: ICT RequirementsF789GH
 
Presentations on two case studies
Presentations on two case studiesPresentations on two case studies
Presentations on two case studiesF789GH
 
Datenanalyse mit R
Datenanalyse mit RDatenanalyse mit R
Datenanalyse mit RF789GH
 
Introduction to the Corporate Social Responsibility
Introduction to the Corporate Social ResponsibilityIntroduction to the Corporate Social Responsibility
Introduction to the Corporate Social ResponsibilityF789GH
 
Project Management with Microsoft SharePoint and VCSs (Git & SVN)
Project Management with Microsoft SharePoint and VCSs (Git & SVN)Project Management with Microsoft SharePoint and VCSs (Git & SVN)
Project Management with Microsoft SharePoint and VCSs (Git & SVN)F789GH
 
SkyBoard Inc.: Transition to SAP ERP
SkyBoard Inc.: Transition to SAP ERPSkyBoard Inc.: Transition to SAP ERP
SkyBoard Inc.: Transition to SAP ERPF789GH
 
Consuming information: The move from radio to internet
Consuming information: The move from radio to internetConsuming information: The move from radio to internet
Consuming information: The move from radio to internetF789GH
 
Warum mochte ich für FirefoxOS entwickeln
Warum mochte ich für FirefoxOS entwickelnWarum mochte ich für FirefoxOS entwickeln
Warum mochte ich für FirefoxOS entwickelnF789GH
 
Domain name system security extension
Domain name system security extensionDomain name system security extension
Domain name system security extensionF789GH
 
Social CRM in the Banking Environment (in Germany and Swizerland)
Social CRM in the Banking Environment (in Germany and Swizerland)Social CRM in the Banking Environment (in Germany and Swizerland)
Social CRM in the Banking Environment (in Germany and Swizerland)F789GH
 
Data in the 21st century
Data in the 21st centuryData in the 21st century
Data in the 21st centuryF789GH
 
Kernmodelle
KernmodelleKernmodelle
KernmodelleF789GH
 
Moebel
MoebelMoebel
MoebelF789GH
 
Presentace woyzek
Presentace woyzekPresentace woyzek
Presentace woyzekF789GH
 
Warum kann man Pi nicht als einen Bruch aufschreiben ?
 Warum kann man Pi nicht als einen Bruch aufschreiben ? Warum kann man Pi nicht als einen Bruch aufschreiben ?
Warum kann man Pi nicht als einen Bruch aufschreiben ?F789GH
 

Plus de F789GH (20)

Apple's Communication: Antennagate & Batterygate
Apple's Communication: Antennagate & BatterygateApple's Communication: Antennagate & Batterygate
Apple's Communication: Antennagate & Batterygate
 
Discovering Data Science Design Patterns with Examples from R and Python Soft...
Discovering Data Science Design Patterns with Examples from R and Python Soft...Discovering Data Science Design Patterns with Examples from R and Python Soft...
Discovering Data Science Design Patterns with Examples from R and Python Soft...
 
Scrum for beginners
Scrum for beginnersScrum for beginners
Scrum for beginners
 
Service Innovation - Increasing effectiveness for corporate clients at JOSEPHS
Service Innovation - Increasing effectiveness for corporate clients at JOSEPHSService Innovation - Increasing effectiveness for corporate clients at JOSEPHS
Service Innovation - Increasing effectiveness for corporate clients at JOSEPHS
 
Co-creating a Smart Home concept
Co-creating a Smart Home conceptCo-creating a Smart Home concept
Co-creating a Smart Home concept
 
Smart Factory: ICT Requirements
Smart Factory: ICT RequirementsSmart Factory: ICT Requirements
Smart Factory: ICT Requirements
 
Presentations on two case studies
Presentations on two case studiesPresentations on two case studies
Presentations on two case studies
 
Datenanalyse mit R
Datenanalyse mit RDatenanalyse mit R
Datenanalyse mit R
 
Introduction to the Corporate Social Responsibility
Introduction to the Corporate Social ResponsibilityIntroduction to the Corporate Social Responsibility
Introduction to the Corporate Social Responsibility
 
Project Management with Microsoft SharePoint and VCSs (Git & SVN)
Project Management with Microsoft SharePoint and VCSs (Git & SVN)Project Management with Microsoft SharePoint and VCSs (Git & SVN)
Project Management with Microsoft SharePoint and VCSs (Git & SVN)
 
SkyBoard Inc.: Transition to SAP ERP
SkyBoard Inc.: Transition to SAP ERPSkyBoard Inc.: Transition to SAP ERP
SkyBoard Inc.: Transition to SAP ERP
 
Consuming information: The move from radio to internet
Consuming information: The move from radio to internetConsuming information: The move from radio to internet
Consuming information: The move from radio to internet
 
Warum mochte ich für FirefoxOS entwickeln
Warum mochte ich für FirefoxOS entwickelnWarum mochte ich für FirefoxOS entwickeln
Warum mochte ich für FirefoxOS entwickeln
 
Domain name system security extension
Domain name system security extensionDomain name system security extension
Domain name system security extension
 
Social CRM in the Banking Environment (in Germany and Swizerland)
Social CRM in the Banking Environment (in Germany and Swizerland)Social CRM in the Banking Environment (in Germany and Swizerland)
Social CRM in the Banking Environment (in Germany and Swizerland)
 
Data in the 21st century
Data in the 21st centuryData in the 21st century
Data in the 21st century
 
Kernmodelle
KernmodelleKernmodelle
Kernmodelle
 
Moebel
MoebelMoebel
Moebel
 
Presentace woyzek
Presentace woyzekPresentace woyzek
Presentace woyzek
 
Warum kann man Pi nicht als einen Bruch aufschreiben ?
 Warum kann man Pi nicht als einen Bruch aufschreiben ? Warum kann man Pi nicht als einen Bruch aufschreiben ?
Warum kann man Pi nicht als einen Bruch aufschreiben ?
 

Dernier

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 

Dernier (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 

Customer Linguistic Profiling

  • 1. Customer Linguistic Profiling Predicting Personality Traits on Facebook’s statuses Vishweshwara Keekan Dmitrij Petrov Dustin Nguyen
  • 2. Agenda 1. Goals 2. Stylometry and its use-cases 3. Predicting Big 5 Personality traits 1. Split the dataset 2. Train and test statistical models 3. Evaluate the performance & show final results 4. Summary
  • 3. Goals •Getting into the field of stylometry & natural language processing •Conducting various data experiments on FB’s dataset Non-Goals •Achieving better results than existing studies
  • 4. Stylometry • Emerged in the second half of 19th century • Wincenty Lutosławski coined it since 1897 • Def.: “the statistical analysis of literary style” dealing with “the study of individual or group characteristics in written language” (e.g. sentence length) Holmes & Kardos (2003), Knight (1993) • Applied for authorship attribution & profiling, plagiarism etc.
  • 5. Examples • Authorship Identification in Greek Tweets • Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek popular users • Character and word n-grams • Forensic Stylometry for Anonymous Emails • Frequent pattern technique • Company email dataset containing 200,399 real-life emails from 158 employees • Dream of the Red Chamber (1759) by Cao Xuegin • First, a circulation of hand-written 80 chapters of novel • Cheng-Gao’s first printed edition: 40 additional chapters being added • The “chrono-devide” proven lastly via SVC-RFE with 10-50 features* * Hu et al. (2014)
  • 6. Supervised Machine Learning (S-ML) • Dataset from MyPersonality.org project: • 9917 Facebook’s status updates from 250 users • Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or  remained) • Statuses are/will be classified to Big-Five binary personality traits (Extroversion, Agreeableness, Neuroticism, Openness to experience, Conscientiousness) • S-ML (vs. Unsupervised ML) • Dataset contains many input & (desired) output variables • S-ML learns by examples and after several iterations is able to classify an input
  • 7. Methodology & Tools • Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc. • Methodology of S-ML: • Extract relevant stylometric (NLP) features • Split dataset into training & testing set • Train the model on the training set  Learn by examples • Test the model on the ‘unseen’ set  Classify • Validate the performance of the model  Evaluate > Prepare data *https://github.com/dmpe/CaseSolvingSeminar/
  • 8. Extracted features from statuses 5 Labels from ODS Feature from ODS Extracted ones Lexical (6) Character (8) cNEU STATUS # functional words string length lexical diversity [0-1] # words # dots # commas cAGR # personal pronouns smileys # semicolons # colons cOPN Parts-of-speech Tags # *PROPNAME* cCON Bag-of-words (ngrams) average word length cEXT
  • 9. Splitting dataset using stratified k-fold CV • Create 5 trait datasets based on our labels • Use stratified k-fold cross-validation to split into the training and testing set >>> train_X, test_X, train_Y, test_Y = sk.cross_validation.train_test_split(agr[:,1:9], agr["cAGR"], train_size = 0.66, stratify = agr["cAGR"], random_state = 5152)
  • 10. Classification Metrics -> Confusion Matrix (1) “Golden Standard” (Real Truth Values) Positive Negative Observed Predicted positive True Positive False Positive (Type 1 error) Precision Predicted Negative False Negative (Type 2 error) True Negative Recall/ Sensitivity (Specificity) Accuracy = TP + TN TN + FP + FN + TP Precision = TP FP + TP Recall = TP FN + TP F1-score = 2 ∗ precision ∗ recall precision + recall
  • 11. Learning and predicting • Head-on approach: Classifiers only # Assumption: features are numeric values only classifier = MultinomialNB() classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X • But: “Status” is a string and not numeric nb_pipeline= Pipeline([ ('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))), ('nb', MultinomialNB()) ]) predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X) • Validation of results scores = cross_validation.cross_val_score( nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’ ) accuracy, std_deviation = scores.mean(), scores.std() * 2 precision = average_precision_score(test_Y, predicted) recall = recall_score(test_Y, predicted, labels=[False, True]) f1 = f1_score(test_Y, predicted, labels=[False, True])
  • 12. Pipeline: Source Code Example pipeline = sklearn.pipeline.Pipeline([ ('features', sklearn.pipeline.FeatureUnion( transformer_list=[ (‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status ('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()), ])), ('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values (‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])), ('scaler', sklearn.preprocessing.MinMaxScaler()), ])), ], )), (‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB()) ])
  • 13. Parameter fine-tuning • Most transformers and classifier accept different parameters • Parameters can heavily influence the result grid_params = { 'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))} } grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0) y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test) # print best parameters best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(grid_parameter.keys()): print("t%s: %r" % (param_name, best_parameters[param_name]))
  • 14. Baseline 1: STATUS (TF-IDF) column only Trait Dataset Achieved Results with 10-fold CV Best Algorithm Accuracy Mean Accuracy Stand. Dev Recall Precision F1-score NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN
  • 15. Baseline 2: Derived columns Trait Dataset Achieved Results with 10-fold CV Best Algorithm Accuracy Mean Accuracy Stand. Dev Recall Precision F1-score NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB
  • 16. Pipeline 3: Mix of STATUS and NON-STATUS cols. Trait Dataset Achieved Results with 10-fold CV Best Algorithm Accuracy Mean Accuracy Stand. Dev Recall Precision F1-score NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVC OPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NN CON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC
  • 17. Results/Summary • Hardly any improvement of head-first approach • At least over the baseline • Limited: • strongly by Hardware & CPU • grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels)  rapidly growing effort • grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes • Future Research: look on GPU (NVIDIA) • inconsistent data (multiple languages e.g. Spanish)