Panel at Web Intelligence, Dec 4-6, 2018, Santiago Chile
Funding Acknowledgement: Research supported in part by:
NSF Award#: CNS 1513721 TWC SBE: Medium: Context-Aware Harassment Detection on Social Media.
View represented are those of the speaker/author, and not of the sponsor.
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Computational Social Science as the Ultimate Web Intelligence
1. Computational Social Science
as the Ultimate Web Intelligence
Kno.e.sis Projects at the Intersection of Big Data, AI, Social Good and Health
Panel at Web Intelligence 2018
Prof. Amit Sheth
LexisNexis Ohio Eminent Scholar
Executive Director, Kno.e.sis - Ohio Center of Excellence in
Knowledge-enabled Computing & BioHealth Innovation
Presentation template by SlidesCarnival
Photographs by Unsplash
Icons by thenounproject
2. Big Data | Social Media | AI
2
Harnessing Twitter ‘Big Data’ for
Automatic Emotion Identification
2.5 M Tweets with Machine
Learning algorithms
Trends
Emotions
eDrugTrends - Identify emerging trends in
cannabis and synthetic cannabinoid use in the
U.S.
Web Forum Data & Tweets with
NLP, ML & Semantic Web
Technologies
Intents
Sentiments
Hazards SEES - Cross-modal aggregation
of Multi-modal & Multi-disciplinary
Data to support human efforts in disaster
management
Extracting Diverse Sentiment Expressions
with Target-Dependent Polarity from
Twitter
Opinions
400 000 Tweets with an
Optimization Model
People
Places
Times
3. Gender-Based Violence in
140 Characters or Fewer: A
#BigData Case Study of
Twitter
14 million tweets
collected from Twitter
over a period of 10
months
3
1. Gender-based violence in 140 characters or fewer: A #BigData case study of Twitter, Hemant Purohit, Tanvi Banerjee, Andrew Hampton, Valerie L. Shalin, Nayanesh Bhandutia, and Amit
Sheth, First Monday, Volume 21, Number 1 - 4 January 2016
4. Outcomes of Analysis
◎ Trends of GBV tweets across 5 countries; USA,
India, Philippines, Nigeria, South Africa.
4
◎ Three thematic groups of GBV tweets: physical
violence, sexual violence, and harmful practices.
◎ Nigeria has the highest percentage of tweets with URLs in
comparison to other countries.
◎ Numerous explanations;
○ Literacy,
○ Credibility of the public press
○ Possibility that reliance on external resources somehow reduces
the threat of being identified as the responsible party.
5. Context-Aware
Harassment Detection
on Social Media
24 000 tweets collected
Supervised ML methods
used
5
1. Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie L. Shalin, Amit Sheth. A Quality Type-aware Annotated Corpus and
Lexicon for Harassment Research. Web Science, WebSci 2018, Amsterdam, The Netherlands, May 27-30, 2018
2. Mohammadreza Rezvan, Saeedeh Shekarpour, Thirunarayan, K., Valerie L. Shalin, Sheth, A. (2018). Analyzing and learning the languagefor different types of harassment
Knoesis wiki for Context-Aware Harassment Detection on Social
Media
6. Outcomes and Insights
Lexicon
Covering different types of harassment content
● Sexual
● Political
● Racial
Tweets
24 000 non-redundant annotated
tweets with 3000 are labeled as
harassing
Features
Combination of features resulted in best
accuracy
○ TFIDF
○ word2vec
○ paragraph2vec
○ LIWC vector
ML Methods
Gradient Boosting Machine (GBM)
outperformed SVM, KNN and NB
6
● Intellectuel
● Appearance - related
● General
7. 7
1. Gaur, Manas, Ugur Kursuncu, Amanuel Alambo, Amit Sheth, Raminta Daniulaityte, Krishnaprasad Thirunarayan, and Jyotishman Pathak. "Let Me Tell You About Your
Mental Health!: Contextualized Classification of Reddit Posts to DSM-5 for Web-based Intervention." In Proceedings of the 27th ACM CIKM 2018.
Patient
ClinicianEMR
Insight
DSM-5 & Drug Abuse
Ontology
Improved
Healthcare
Classification of Reddit
Content to DSM-5 for
Web-based
Intervention
3 Million Posts from 270K
Reddit Users collected From
2005-2015 with zero shot
learning
Provide clinicians, insights of their patients
Knoesis wiki for Modeling Social Behavior for Healthcare
Utilization in Depression
8. Outcomes & Insights
9
Our sophisticated methods have
reduced the false alarm rate to 3%
- 5% by incorporating domain
knowledge and slang terms in
social media data
9. Views: People - Content - Network
Information in tweets by a user displays
an intent based on the user type:
Personal accounts share opinions, Retail
accounts promote related products for
sale, Media accounts disseminate
information.
Proper incorporation
of each view is
essential to
better represent
characteristics
of users.
User Modeling in Marijuana-related Communications
11
Multimodality
- The information shared in different
formats contributes to the meaning:
Text, Image, Emoji, Interactions
- Translation of image and emoji to textual
representation using state-of-the-art tools
such as EmojiNet.
People: user description, emoji,
profile pictures.
Content: text, emoji
Network: interactions with other
users: retweets and mentions.
🏈
😉
🍔
1. Ugur Kursuncu, Manas Gaur, Usha Lokala, Anurag Illendula, Krishnaprasad Thirunarayan, Raminta Daniulaityte, Amit Sheth, and I. Budak Arpinar. "" What's ur type?"
Contextualized Classification of User Types in Marijuana-related Communications using Compositional Multiview Embedding." In Proceedings of IEEE International
Conference on Web Intelligence, 2018
Knoesis wiki for eDrugTrends
10. Outcomes & Insights
◎ Incorporation of multimodal data,
specifically profile pictures and network
interactions, significantly contributes into
the classification of users.
◎ Multimodality significantly improves the
classification performance in the case of
imbalanced dataset, e.g., profile pictures
of users.
◎ Compositional of embeddings of views
(e.g., person, content, network) provide
more coherent representation of users.
12
Features Personal Media Retail
1 Tweet + Desc 0.95 0.42 0.73
2 w/ Composition 0.94 0.18 0.71
3 w/ Metadata 0.94 0.17 0.72
4 w/ Image 0.97 0.72 0.87
5 w/ Network 0.98 0.73 0.91
F-Scores for each user type
11. Fusing Visual, Textual and
Connectivity Clues for Studying
Mental Health
Knoesis wiki for Modeling Social Behavior for Healthcare Utilization in Depression
Develop a multimodal framework and
employing statistical techniques for
fusing heterogeneous sets of features
obtained by processing visual, textual
and user interaction data to identify
depressive behavior and demographic
inference.
13
1. Amir Hossein Yazdavar, Mohammad Saied Mahdavinejad, Goonmeet Bajaj, Krishnaprasad Thirunarayan, Jyotishman Pathak and Amit Sheth. Fusing Visual, Textual and
Connectivity Clues for Studying Mental Health in Population. In: 30th International Conference on World Wide Web (Submitted WWW-2019)
◎ How well do the content of posted images (colors,
aesthetic and facial presentation) reflect depressive
behavior?
◎ Does the choice of profile picture show any psychological
traits of depressed online persona? Are they reliable
enough to represent the demographic information such as
age and gender?
◎ Are there any underlying common themes among
depressed individuals generated using multimodal
content that can be used to detect depression reliably?
12. Outcomes & Insights
14
Characterizing Linguistic Patterns in two aspects:
Depressive-behavior and Age Distribution
Gender Biases
and Depressive
Behavior
Association (Chi-
square test: color-
code:
(blue:association),
(red: repulsion),
size: amount of
each cell’s
contribution)
The age
distribution for
depressed and
control users
in ground-truth
dataset
13. Outcomes & Insights
15
The explanation of the log-odds prediction of outcome (0.31) for
a sample user (y-axis shows the outcome probability (depressed
or control), the bar labels indicate the log-odds impact of each
feature)
Ranking Features obtained from Different Modalities with
Boruta Algorithm
14. Create value from data that supports action
Big Data & AI
16
What can we do that
is unique?
Emotions
Sentiments
Intentions Derive Insights
Scale to identify important & relevant
issues to human kind
Floods Earthquake
Wildfires Tsunami
Derive insights from data
Do more exercises
Reduce sugar intake
Increase water intake
More at: http://knoesis.org/projects, http://bit.ly/Kapproach
Opinions - "Time for dabs": Analyzing Twitter data on butane hash oil use.
Sharing behavior analysis. Social media provide the opportunity to distribute information, potentially reflecting both the senders’ judgment of information importance, and reliance on the voice of others. Sharing functions as an amplification of these voices, often through the voices of influential celebrities. We analyze two types of sharing behavior in the social media community surrounding GBV events: direct content resharing as a retweet (RT), and indirect sharing via references to external resources, such as news, blogs, articles, and multimedia, using URLs, etc.
the low retweeting frequency in Nigeria is particularly remarkable (see Table 5). One might hypothesize that a low literacy country such as Nigeria, in which senders are less able to compose messages, would have the highest retweet ratio. The adjacent analysis of the proportion of URL references with respect to the total corpus suggests a different sociocultural phenomenon at work concerning the identifiability of the responsible party. For GBV tweets containing URLs, Nigeria has the highest percentage of tweets with URLs in comparison to other countries. Numerous explanations can be tested, including literacy, credibility of the public press, and the possibility that reliance on external resources somehow reduces the threat of being identified as the responsible party.
Goal - understanding individuals mental health situation
Provide clinicias insights of his/ pataients
Not all the Reddit content types (Main Posts, Comments, and Replies) are informative.
Identification of Features that represent users on Reddit:
Vertical Linguistic Features (e.g. Inter-Subreddit Similarity)
Horizontal Linguistic Features (e.g. Subordinate Conjunction)
Fine-Grained Features (e.g. Readability scores)
Word Embedding with/without modulation
Coherence-based topic selection that associate subreddit to DSM-5
Enrichment of DAO ontology with DSM-5 lexicon and Slang Terms : DSM-5 Knowledge Hierarchy
DAO - we created
A sophisticated method allowed us to hugly reduce the false alarm rate -
Explain the optimization effort in one sentence
25% reduction in the false alarm rate (2- 5%) while the other methods have higher false alarm rates ()
Takeaway;
Incorporation of domain knowledge and
slang terms in social media data
1)Analysis of content of posted images in terms of colors,
aesthetic and facial presentation and their associations with
depressive behavior;
2)Uncovering the underlying relationships between the visual
and contextual content of likely depressed profiles obtained
using demographic inference process which can facilitate
community-level management of depression
Top left: Our findings from social media are consistent with the findings in the medical literature as according to the third National Health and Nutrition Examination Survey [29] more women than men were given a diagnosis of depression.
Bottom Left:
shows that young people aged below 24 tend to be more depressed suggesting that either likely depressed-user population is younger,
or youngsters are more likely to disclose their age say with the
intention of connecting to their peers (social homophily
Right: The waterfall charts represent how the probability of being de-
pressed changes with the addition of each feature variable.
Left: illustrates feature importance obtained by Boruta algorithm.