SlideShare a Scribd company logo
1 of 25
Download to read offline
6th Author Profiling task at PAN
Multimodal Gender
Identification in Twitter
PAN-AP-2018 CLEF 2018
Avignon, 10-14 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Politècnica de València
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Martin Potthast & Benno Stein
Bauhaus-Universität Weimar
Manuel Montes y Gómez
INAOE - Mexico
Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings.
This is crucial for:
- Marketing.
- Security.
- Forensics.
2
PAN’18AuthorProfiling
Task goal
To investigate the identification of
author’s gender with multimodal
information: texts + images.
3
AuthorProfiling
Three languages:
English SpanishArabic
PAN’18
Corpus
4
AuthorProfiling
● PAN-AP'17 subset extended with images shared in author's timelines:
○ 100 tweets per author.
○ 10 images per author.
PAN’18
The accuracy is calculated per modality and language:
● Text-based.
● Image-based.
● Combined.
The final ranking is the average of the combined*
accuracies per language:
Evaluation measures
5
AuthorProfilingPAN’18
* If only the textual approach was submitted, its accuracy has been used
Baselines
6
AuthorProfiling
● BASELINE-stat: A statistical baseline that emulates random
choice. Both modalities.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 5,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
○ Textual modality.
● BASELINE-rgb:
○ RGB color for each pixel in each author images.
○ The author is represented with the minimum, maximum,
mean, median, and standard deviation of the RGB values.
○ Images modality.
PAN’18
23 participants
22 working notes
17 countries 7
AuthorProfiling
Netherlands
Slovenia
PAN’18
Approaches
8
AuthorProfilingPAN’18
Approaches - Preprocessing
9
AuthorProfiling
Punctuation signs Ciccone et al., Stout et al., HaCohen-Kerner et al., Veenhoven et al.
Character flooding Ciccone et al., Raiyani et al.
Lowercase Von Däniken et al., Veenhoven et al., Nieuwenhuis et al., Bayot &
Gonçalves, Kosse et al., Stout et al., Schaetti, HaCohen-Kerner et al.
Stopwords Ciccone et al., Raiyani et al., HaCohen-Kerner et al., Veenhoven et al.
Twitter specific components:
hashtags, urls, mentions and
RTs
Ciccone et al., Takahashi et al., Stout et al., Raiyani et al., Schaetti,
HaCohen-Kerner et al., Von Däniken et al., Martinc et al., Veenhoven et
al., Nieuwenhuis et al., Kosse et al.
Contractions and abbreviations Stout et al., Raiyani et al.
Normalisation and diacritics
removal in Arabic
Ciccone et al.
Resizing, rescaling Takahashi et al., Martinc et al., Sierra-Loaiza & González
Normalisation (subtracting the
average RGB value per lang)
Takahashi et al.
PAN
TEXTSIMAGES
Approaches - Textual Features
10
AuthorProfiling
Stylistic features:
- Ratios of links
- Hashtag or user mentions
- Character flooding
- Emoticons / laugher expressions
- Domain names
Patra et al., Karlgren et al. ,HaCohen-Kerner et al., Von Däniken et
al.
N-gram models Stout et al., Sandroni-Dias & Paraboni, López-Santillán et al., Von
Däniken et al., Tellez et al., Nieuwenhuis et al., Kosse et al.,
Daneshvar, HaCohen-Kerner et al., Ciccone et al., Aragón & López
LSA Patra et al.
Second order representation Áragon & López
A variation of LDSE Gàribo-Orts
Word embeddings Martinc et al., Veenhoven et al., Bayot & Gonçalves,
López-Santillán et al., Takahashi et al., Patra et al.
Character embeddings Schaetti
PAN’18
Approaches - Image Features
11
AuthorProfiling
Face detection Stout et al., Ciccone et al., Veenhoven et al.
Objects detection Ciccone et al.
Local binary patterns Ciccone et al.
Color histogram Ciccone et al., HaCohen-Kerner et al.
Image resources and tools (e.g.
ImageNet, TorchVision...)
Patra et al., Nieuwenhuis et al., Aragón & López, Schaetti,
Takahashi et al.
Hand-crafted features HaCohen-Kerner et al.
Bag of Visual Words Tellez et al.
PAN’18
Approaches - Methods
12
AuthorProfiling
Logistic regression Sandroni-Dias & Paraboni, HaCohen-Kerner et al., Von Däniken et al.,
Nieuwenhuis et al.
SVM López-Santillán et al., Aragón & López, Ciccone et al., Patra et al., Tellez
et al., Veenhoven et al.
Multilayer Perceptron HaCohen-Kerner et al.
Basic feed-forward network Kosse et al.
Distance-based method Tellez et al., Karlgren et al.
IF condition Gáribo-Orts
RNN Takahashi et al., Bayot & Gonçalves, Stout et al.
CNN Schaetti
ResNet18 Schaetti
Bi-LSTM Veenhoven et al.
PAN’18
Textual modality
13
AuthorProfilingPAN’18
● AR: n-grams
● EN: n-grams
● ES: n-grams
Images modality
14
AuthorProfilingPAN’18
● Best: Pre-trained CNN w. ImageNet
● 2nd. AR: VGG16 + ResNet50 from ImageNet
● 2nd. EN: VGG16 + ResNet50 from ImageNet
● 2nd. ES: Color histogram + faces + objects +
local binary patterns
Improvement with images
15
AuthorProfilingPAN’18
● In average, there is almost no improvement.
● Some systems obtain high improvements (up to 7.73%)
○ Pre-trained CNN w. ImageNet.
Improvement (AR)
16
AuthorProfilingPAN’18
Improvement (EN)
17
AuthorProfilingPAN’18
Improvement (ES)
18
AuthorProfilingPAN’18
Final ranking
19
AuthorProfiling
*
PAN’18
20
AuthorProfiling
PAN-AP 2018 best results
PAN’18
Conclusions
● Several approaches to tackle the task:
○ Deep learning prevailing.
● Textual classification:
○ Best results regarding textual subtask: n-grams + traditional methods (SVM, logistic reg.).
○ The second best result for Spanish: bi-LSTM with word embeddings.
● Images classification approaches based on:
○ Face recognition. <- Failed!
○ Pre-trained models and image processing tools such as ImageNet. <- Best results obtained
with semantic features extracted from the images.
○ Hand-crafted features such as color histograms and bag-of-visual-words.
● Texts vs. Images:
○ Textual features discriminate better than images.
○ On average, there is no improvement when images are used.
○ Elaborated representations improves up to 7.73% (English).
● Best results:
○ Over 80% on average (EN 85.84%; ES 82%; AR: 81.80%).
○ English (85.84%): Takahashi et al. with deep learning techniques (RNN for text, ImageNet +
CNN for images).
○ Spanish (82%): Daneshvar with SVM and combinations of n-grams (only textual features).
○ Arabic (81.80%): Tellez et al. with SVM + n-grams, and Bag of Visual Words.
● Insight:
○ Traditional approaches still remain competitive, but deep learning is acquiring strength. 21
AuthorProfilingPAN’18
Task impact
22
AuthorProfiling
PARTICIPANTS COUNTRIES
PAN-AP 2013
21 16
PAN-AP 2014
10 8
PAN-AP 2015
22 13
PAN-AP 2016
22 15
PAN-AP 2017
22 19
PAN-AP 2018
23 17
PAN’18
Industry at PAN (Author Profiling)
23
AuthorProfiling
Organisation Sponsors
Participants
PAN’18
2019 -> Robot or human?
24
AuthorProfilingPAN’18
25
AuthorProfiling
On behalf of the author profiling task organisers:
Thank you very much for participating
and hope to see you next year!!
PAN’18

More Related Content

Similar to Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identification in Twitter (6)

Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 
EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03
 
OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for all
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
 

More from Francisco Manuel Rangel Pardo

More from Francisco Manuel Rangel Pardo (20)

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
 
Redes sociales y preadolescentes
Redes sociales y preadolescentesRedes sociales y preadolescentes
Redes sociales y preadolescentes
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
My Phd Student T-Shirt
My Phd Student T-ShirtMy Phd Student T-Shirt
My Phd Student T-Shirt
 
Kico's Stairway to Phd
Kico's Stairway to PhdKico's Stairway to Phd
Kico's Stairway to Phd
 

Recently uploaded

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 

Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identification in Twitter

  • 1. 6th Author Profiling task at PAN Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 Avignon, 10-14 September Francisco Rangel Autoritas Consulting & PRHLT Research Center - Universitat Politècnica de València Paolo Rosso PRHLT Research Center Universitat Politècnica de Valencia Martin Potthast & Benno Stein Bauhaus-Universität Weimar Manuel Montes y Gómez INAOE - Mexico
  • 2. Introduction Author profiling aims at identifying personal traits such as age, gender, personality traits, native language, language variety… from writings. This is crucial for: - Marketing. - Security. - Forensics. 2 PAN’18AuthorProfiling
  • 3. Task goal To investigate the identification of author’s gender with multimodal information: texts + images. 3 AuthorProfiling Three languages: English SpanishArabic PAN’18
  • 4. Corpus 4 AuthorProfiling ● PAN-AP'17 subset extended with images shared in author's timelines: ○ 100 tweets per author. ○ 10 images per author. PAN’18
  • 5. The accuracy is calculated per modality and language: ● Text-based. ● Image-based. ● Combined. The final ranking is the average of the combined* accuracies per language: Evaluation measures 5 AuthorProfilingPAN’18 * If only the textual approach was submitted, its accuracy has been used
  • 6. Baselines 6 AuthorProfiling ● BASELINE-stat: A statistical baseline that emulates random choice. Both modalities. ● BASELINE-bow: ○ Documents represented as bag-of-words. ○ The 5,000 most common words in the training set. ○ Weighted by absolute frequency. ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. ○ Textual modality. ● BASELINE-rgb: ○ RGB color for each pixel in each author images. ○ The author is represented with the minimum, maximum, mean, median, and standard deviation of the RGB values. ○ Images modality. PAN’18
  • 7. 23 participants 22 working notes 17 countries 7 AuthorProfiling Netherlands Slovenia PAN’18
  • 9. Approaches - Preprocessing 9 AuthorProfiling Punctuation signs Ciccone et al., Stout et al., HaCohen-Kerner et al., Veenhoven et al. Character flooding Ciccone et al., Raiyani et al. Lowercase Von Däniken et al., Veenhoven et al., Nieuwenhuis et al., Bayot & Gonçalves, Kosse et al., Stout et al., Schaetti, HaCohen-Kerner et al. Stopwords Ciccone et al., Raiyani et al., HaCohen-Kerner et al., Veenhoven et al. Twitter specific components: hashtags, urls, mentions and RTs Ciccone et al., Takahashi et al., Stout et al., Raiyani et al., Schaetti, HaCohen-Kerner et al., Von Däniken et al., Martinc et al., Veenhoven et al., Nieuwenhuis et al., Kosse et al. Contractions and abbreviations Stout et al., Raiyani et al. Normalisation and diacritics removal in Arabic Ciccone et al. Resizing, rescaling Takahashi et al., Martinc et al., Sierra-Loaiza & González Normalisation (subtracting the average RGB value per lang) Takahashi et al. PAN TEXTSIMAGES
  • 10. Approaches - Textual Features 10 AuthorProfiling Stylistic features: - Ratios of links - Hashtag or user mentions - Character flooding - Emoticons / laugher expressions - Domain names Patra et al., Karlgren et al. ,HaCohen-Kerner et al., Von Däniken et al. N-gram models Stout et al., Sandroni-Dias & Paraboni, López-Santillán et al., Von Däniken et al., Tellez et al., Nieuwenhuis et al., Kosse et al., Daneshvar, HaCohen-Kerner et al., Ciccone et al., Aragón & López LSA Patra et al. Second order representation Áragon & López A variation of LDSE Gàribo-Orts Word embeddings Martinc et al., Veenhoven et al., Bayot & Gonçalves, López-Santillán et al., Takahashi et al., Patra et al. Character embeddings Schaetti PAN’18
  • 11. Approaches - Image Features 11 AuthorProfiling Face detection Stout et al., Ciccone et al., Veenhoven et al. Objects detection Ciccone et al. Local binary patterns Ciccone et al. Color histogram Ciccone et al., HaCohen-Kerner et al. Image resources and tools (e.g. ImageNet, TorchVision...) Patra et al., Nieuwenhuis et al., Aragón & López, Schaetti, Takahashi et al. Hand-crafted features HaCohen-Kerner et al. Bag of Visual Words Tellez et al. PAN’18
  • 12. Approaches - Methods 12 AuthorProfiling Logistic regression Sandroni-Dias & Paraboni, HaCohen-Kerner et al., Von Däniken et al., Nieuwenhuis et al. SVM López-Santillán et al., Aragón & López, Ciccone et al., Patra et al., Tellez et al., Veenhoven et al. Multilayer Perceptron HaCohen-Kerner et al. Basic feed-forward network Kosse et al. Distance-based method Tellez et al., Karlgren et al. IF condition Gáribo-Orts RNN Takahashi et al., Bayot & Gonçalves, Stout et al. CNN Schaetti ResNet18 Schaetti Bi-LSTM Veenhoven et al. PAN’18
  • 13. Textual modality 13 AuthorProfilingPAN’18 ● AR: n-grams ● EN: n-grams ● ES: n-grams
  • 14. Images modality 14 AuthorProfilingPAN’18 ● Best: Pre-trained CNN w. ImageNet ● 2nd. AR: VGG16 + ResNet50 from ImageNet ● 2nd. EN: VGG16 + ResNet50 from ImageNet ● 2nd. ES: Color histogram + faces + objects + local binary patterns
  • 15. Improvement with images 15 AuthorProfilingPAN’18 ● In average, there is almost no improvement. ● Some systems obtain high improvements (up to 7.73%) ○ Pre-trained CNN w. ImageNet.
  • 21. Conclusions ● Several approaches to tackle the task: ○ Deep learning prevailing. ● Textual classification: ○ Best results regarding textual subtask: n-grams + traditional methods (SVM, logistic reg.). ○ The second best result for Spanish: bi-LSTM with word embeddings. ● Images classification approaches based on: ○ Face recognition. <- Failed! ○ Pre-trained models and image processing tools such as ImageNet. <- Best results obtained with semantic features extracted from the images. ○ Hand-crafted features such as color histograms and bag-of-visual-words. ● Texts vs. Images: ○ Textual features discriminate better than images. ○ On average, there is no improvement when images are used. ○ Elaborated representations improves up to 7.73% (English). ● Best results: ○ Over 80% on average (EN 85.84%; ES 82%; AR: 81.80%). ○ English (85.84%): Takahashi et al. with deep learning techniques (RNN for text, ImageNet + CNN for images). ○ Spanish (82%): Daneshvar with SVM and combinations of n-grams (only textual features). ○ Arabic (81.80%): Tellez et al. with SVM + n-grams, and Bag of Visual Words. ● Insight: ○ Traditional approaches still remain competitive, but deep learning is acquiring strength. 21 AuthorProfilingPAN’18
  • 22. Task impact 22 AuthorProfiling PARTICIPANTS COUNTRIES PAN-AP 2013 21 16 PAN-AP 2014 10 8 PAN-AP 2015 22 13 PAN-AP 2016 22 15 PAN-AP 2017 22 19 PAN-AP 2018 23 17 PAN’18
  • 23. Industry at PAN (Author Profiling) 23 AuthorProfiling Organisation Sponsors Participants PAN’18
  • 24. 2019 -> Robot or human? 24 AuthorProfilingPAN’18
  • 25. 25 AuthorProfiling On behalf of the author profiling task organisers: Thank you very much for participating and hope to see you next year!! PAN’18