Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Extracting Health-related
Social Structures
from Conversations in Twitter
Abduljaleel Al Rubaye
Dr. Ronaldo Menezes
BioCom...
Health &
Social Communities
2
Outlines / Thesis Structure
Collecting Data Filtering Process Building Networks
Analyzing th...
3
Motivation
o Health is an important aspect in one’s live.
o Quality of health defines our general wellbeing.
o In order ...
4
Motivation
o The acquisition of knowledge is more prominent with individuals who are already suffering from
serious heal...
5
Social Communities
o The benefits of being in a social community :
• help to get support
• prevent the loneliness
• elim...
6
Online Social Networks
o Many ways to communicate with social.
o A common way is using Online Social Networks.
• They be...
7
Twitter
o One of the most popular social networking applications
o Last statistic : 316 million active users by the 3rd ...
8
The Goal
o In this work we tried to collect the tweets that mentioned one of the top leading causes of
death in the U.S....
9
The Goal
o Retrieving networks out of conversations between Twitter users; Even where users are not
talking directly.
o ...
10
Health Conditions
o An official updated list was provided by the Center for Diseases Control
and Prevention (CDC).
o Th...
11
Collecting Data
o In order to track tweets we used the keywords below:
12
Tweets Tracker
o The Twitter crawler was coded using Python 2.7
o Tracks data in a period of 60 days:
From Feb 17th to ...
13
Statistics (before filtration)
o Collected Tweets Worldwide: 12,518,372
o Number of times that a health condition was m...
14
Statistics (before filtration)
o Tweet distribution per day
o Due to a disruption in accessing the global tweet stream ...
15
Filtration Process / Location Type
o Number of geocoded tweets: 370,376 (about 3% of total number of tweets)
o Tweets t...
16
Filtration Process
o The geocoding system Geopy was used to retrieve a readable address out of information we have.
o G...
17
Statistics (after filtration)
o Number of Tweets originated from the US : 2,351,991
o Due to a disruption in accessing ...
18
Statistics (after filtration)
o Normalized the total collected tweets by state’s census population to visualize the dis...
19
Building Networks
o Nodes: users of the same US state
o Links: two users will have a link in between if both mentioned ...
20
Time Window
o Is a predefined period of time.
o Restricts defining relations among entities in that limited interval.
o...
21
Time Window
2) How to move the time window over the data set ?
The simplest way is to move the time window event by eve...
22
Weighted Networks
o Due to using time window, we’ll have weighted networks
• Some tweets could be exist in the time win...
23
An Example
24
Results
o Due to very small amount of collected tweets, the health condition Pneumonitis due to solids
and liquids was ...
25
o A sample of the networks the we generated.
o State: Florida
o Health condition: Diabetes
o TW Size: 1 hour
o 11,760 n...
26
Samples of the Generated Networks
Alabama–HeartDiseases
27
Samples of the Generated Networks
Alabama–HeartDiseases
o Networks were analyzed using three general properties of networks:
• Degree Distribution
• Average Path Length
• Cluster...
Scale free networks :
o Degree distribution :
• The probability of having nodes with a certain degree.
29
Degree Distribut...
30
o In scale-free networks the degree distribution of nodes is a power-law distribution and follows
the function:
o where...
31
o The distribution is displayed as a box plot.
o Each box plot shows the distributions of the weighted degree distribut...
32
o However, in our networks, the exponent might be between 2 and 3, but it does not mean that
the correspondent degree d...
33
o Using the package power-law performed comparison between power law and :
• exponential distribution
• log normal dist...
34
Example of Comparison between different distributions
35
o Only 117 out of 7344 Networks have a power-law distribution. (2.3% of all the networks)
o Among the 117 networks that...
Small – World :
o Average path length:
- determines the average number of hops between a pair of nodes.
- defined as follo...
37
Average Path Length Evaluation
The distributions of the Networks’ Average Path Length
38
Average Path Length
o Why do we observe that by increasing the time window the average path length is also
increasing ?...
39
Clustering Coefficients Evaluation
The distributions of the Networks’ Clustering Coefficients
40
Conclusion
o Since only 2.3% of the networks’ WDD follow power-law; the majority of the generated
networks do not have ...
41
Thanks
Thank you
Prochain SlideShare
Chargement dans…5
×

presentation

144 vues

Publié le

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

presentation

  1. 1. Extracting Health-related Social Structures from Conversations in Twitter Abduljaleel Al Rubaye Dr. Ronaldo Menezes BioComplex Lab Florida Institute of Technology Melbourne, Fl 1
  2. 2. Health & Social Communities 2 Outlines / Thesis Structure Collecting Data Filtering Process Building Networks Analyzing the Networks Degree Distribution Average Path Length Clustering Coefficient Time Window
  3. 3. 3 Motivation o Health is an important aspect in one’s live. o Quality of health defines our general wellbeing. o In order to be healthy, most of us try to be informed about the latest developments, medical practices, treatments, drugs, etc.
  4. 4. 4 Motivation o The acquisition of knowledge is more prominent with individuals who are already suffering from serious health conditions, particularly the ones that may lead to death. o People who have been diagnosed with serious health conditions may suffer the symptoms of their condition for a considerable period of time. o These people naturally form support groups to share their feelings as well as share their daily experiences through. (like contributing to social communities)
  5. 5. 5 Social Communities o The benefits of being in a social community : • help to get support • prevent the loneliness • eliminate behavioral risks • help to exchange and share the experiences • can improve the knowledge and health education faster
  6. 6. 6 Online Social Networks o Many ways to communicate with social. o A common way is using Online Social Networks. • They become part of our lives • Easy to access • Ease the process of finding and connecting individuals that have the same interest
  7. 7. 7 Twitter o One of the most popular social networking applications o Last statistic : 316 million active users by the 3rd quarter of 2015
  8. 8. 8 The Goal o In this work we tried to collect the tweets that mentioned one of the top leading causes of death in the U.S. o Generating networks from term co-occurrence in Twitter to form social communities related to top causes of death in the USA. o We reconstruct the structures using the concept of time window. - Ferreira et al. “The small world of seismic events” - Meng et al. “Systematic dynamic and heterogeneous analysis of rich social network data”
  9. 9. 9 The Goal o Retrieving networks out of conversations between Twitter users; Even where users are not talking directly. o Finding if these networks have social networks characteristic. o Is the time window a good tool to unveil social structures from Twitter timeline conversations ? o Find if there is any specific time window’s length at which the tweet conversations better appear to be a typical social network conversations.
  10. 10. 10 Health Conditions o An official updated list was provided by the Center for Diseases Control and Prevention (CDC). o The list includes 113 causes of death in the US. o CDC focused more on top 15 causes of death. o The causes included 13 health conditions that lead to death 1. Heart Diseases 2. Malignant Neoplasms (Cancer) 3. Chronic Lower Respiratory Diseases (CLRD) 4. Cerebrovascular Diseases (Stroke) 5. Alzheimer’s disease 6. Diabetes mellitus 7. Influenza and pneumonia 8. Kidney diseases 9. Septicemia 10. Chronic Liver Diseases (CLD) 11. Hypertension (High blood pressure) 12. Parkinson’s Disease 13. Pneumonitis due to solids and liquids.
  11. 11. 11 Collecting Data o In order to track tweets we used the keywords below:
  12. 12. 12 Tweets Tracker o The Twitter crawler was coded using Python 2.7 o Tracks data in a period of 60 days: From Feb 17th to April 17th 2015 o In collecting data we used the following tools : • Twitter streaming API • Mongo DB • PyMongo
  13. 13. 13 Statistics (before filtration) o Collected Tweets Worldwide: 12,518,372 o Number of times that a health condition was mentioned in the total tweets :
  14. 14. 14 Statistics (before filtration) o Tweet distribution per day o Due to a disruption in accessing the global tweet stream on day 27, the process of collecting tweets stopped for a while.
  15. 15. 15 Filtration Process / Location Type o Number of geocoded tweets: 370,376 (about 3% of total number of tweets) o Tweets that only include textual location: 8,785,834 (70%) o No location included: 4,750,895 (37%)
  16. 16. 16 Filtration Process o The geocoding system Geopy was used to retrieve a readable address out of information we have. o Geopy uses several different geo-location services (Google maps, Bing maps, Open Street, … etc.) o Due to time limitation we used the geocoder service Open Street Map Nominatim that responds to one request per second. o Examples: - Valid readable address: Input: (North of Chicago) output: (Chicago, Cook County, Illinois, United States of America) - Invalid readable address: Input: (Behind the tears of a clown) output: (Error) o We retrieved 5,338,448 (61%) valid address out of textual information.
  17. 17. 17 Statistics (after filtration) o Number of Tweets originated from the US : 2,351,991 o Due to a disruption in accessing the global tweet stream on day 27, the process of collecting tweets stopped for a while.
  18. 18. 18 Statistics (after filtration) o Normalized the total collected tweets by state’s census population to visualize the distribution. o Why? According to S.Burton et. al. (right time, right place; health communication on twitter) the number of Twitter users per state is correlated to the state’s population. # tweets California: (278,771 tweets) North Dakota: (2,703 tweets) Health conditions Cancer: (1,395,590 tweets) (59%) Pneumonitis due to liquids and solids: (44 tweets)
  19. 19. 19 Building Networks o Nodes: users of the same US state o Links: two users will have a link in between if both mentioned the same health condition o According to the definition, users who mentioned the same health conditions will be related to each other. o To construct a network of a specific health condition if we consider all tweets in the collecting period of 60 day, the network will be too densely connected.
  20. 20. 20 Time Window o Is a predefined period of time. o Restricts defining relations among entities in that limited interval. o Two issues should be addressed before utilizing the time window concept: 1) The size of the time window: • If the size is too large or too small might not get useful information • A very large time window’s size results in having fully connected clusters connected to each other. - if we assign the highest possible size to the time window we will have one big clique. • A very small length may lead us to generate networks that most probably have many of disconnected nodes. • Hence we assigned 12 different lengths (1,2,3,4,…12) hours, to define the relations among Twitter users related to the same health condition. • That means we will have 12 different structures representing the same data set.
  21. 21. 21 Time Window 2) How to move the time window over the data set ? The simplest way is to move the time window event by event (tweet by tweet) Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 . . .
  22. 22. 22 Weighted Networks o Due to using time window, we’ll have weighted networks • Some tweets could be exist in the time window many times. That makes the link weighted. • As much as the two tweets be closer to each other (in time) they could appear in the time window for more than one iteration. • Some tweets might be in the time window only few times. The links’ weights are less. • Some other never happen to be in the same time window. That means no link in between.
  23. 23. 23 An Example
  24. 24. 24 Results o Due to very small amount of collected tweets, the health condition Pneumonitis due to solids and liquids was not considered in the analytical work. o At the end of network building process we end up with 7344 constructed networks. (51 states × 12 time windows × 12 health condition)
  25. 25. 25 o A sample of the networks the we generated. o State: Florida o Health condition: Diabetes o TW Size: 1 hour o 11,760 nodes & 160,664 edges o Tools that was used: • NetworkX (Python Library) • Gephi (Network Visualization Tool) Samples of the Generated Networks
  26. 26. 26 Samples of the Generated Networks Alabama–HeartDiseases
  27. 27. 27 Samples of the Generated Networks Alabama–HeartDiseases
  28. 28. o Networks were analyzed using three general properties of networks: • Degree Distribution • Average Path Length • Clustering Coefficient 28 Network Analysis
  29. 29. Scale free networks : o Degree distribution : • The probability of having nodes with a certain degree. 29 Degree Distribution Analysis • Since the networks are weighted, considering the weighted degrees can capture more information about the structures of the networks.
  30. 30. 30 o In scale-free networks the degree distribution of nodes is a power-law distribution and follows the function: o where the exponent value (alpha) in most cases falls in the range o A few number of nodes are highly connected (hubs) o A large number of nodes have low degrees Degree Distribution Analysis
  31. 31. 31 o The distribution is displayed as a box plot. o Each box plot shows the distributions of the weighted degree distribution’s exponent of all the networks related to one health condition and generated using the same time window size. Degree Distribution Analysis
  32. 32. 32 o However, in our networks, the exponent might be between 2 and 3, but it does not mean that the correspondent degree distribution is a power law. o According to Clauset et al. (2009) “Power law Distribution in empirical data” Due to occurring fluctuations in the degree distributions, the power law behavior is not easy to be understood. o Clauset et al. introduced an approach to compare between the power law and other distributions. Degree Distribution Analysis
  33. 33. 33 o Using the package power-law performed comparison between power law and : • exponential distribution • log normal distribution • truncated power law distribution o By having the process of comparison done, we retrieved the values R and p : • distribution_compare ( ’dist_1’ , ’dist_2’ ) • if (R > 0) & (p was significant; p < 0.05) => the distribution behaves more as the first distribution. • if (R < 0 ) & ( p was significant; p < 0.05) => the second distribution is favored. • else the behavior of the distribution is unclear. ⟹ R : the likelihood ratio between the distributions p: represents how much the result is significant Degree Distribution Analysis
  34. 34. 34 Example of Comparison between different distributions
  35. 35. 35 o Only 117 out of 7344 Networks have a power-law distribution. (2.3% of all the networks) o Among the 117 networks that follow a power-law, the number of networks that was generated using the time window of one hour was larger than the other networks. Degree Distribution Analysis
  36. 36. Small – World : o Average path length: - determines the average number of hops between a pair of nodes. - defined as follows: - d(i,j) is the shortest path between nodes i and j o Clustering coefficient : - measures how tightly the nodes are connected to each other. - defined as follows: - it calculates the tendency of nodes to cluster with each other. o In small world networks average path length is low and clustering coefficient is high. 36 Network Analysis
  37. 37. 37 Average Path Length Evaluation The distributions of the Networks’ Average Path Length
  38. 38. 38 Average Path Length o Why do we observe that by increasing the time window the average path length is also increasing ? i ii iii o i) ℓ = 1 2+6 2 + 8 = 10 8 = 1.25 o ii) ℓ = 1 6+6 8 + 8 = 18 12 = 1.33 o iii) ℓ = 1 30 58 = 1.933 ℓ = 1 3 2 1 + 1 + 1 + 2 + 1 + 2 ℓ = 1 2 1 1 + 1
  39. 39. 39 Clustering Coefficients Evaluation The distributions of the Networks’ Clustering Coefficients
  40. 40. 40 Conclusion o Since only 2.3% of the networks’ WDD follow power-law; the majority of the generated networks do not have characteristics of scale-free networks. o However, a power-law DD is not a necessary condition in social networks. o The networks that were generated using the time window of one hour have the lower value of average path length. o Despite the time window’s size, the majority of the networks have a high clustering coefficient. o The TW approach retrieved the properties of small-world networks in which the average path length is low and the clustering coefficient is high. o The level of awareness about the diseases does not lead to having more clustered networks.
  41. 41. 41 Thanks Thank you

×