Alz Hack II

AlzHack
Data Driven Diagnosis of Alzheimer's Disease
Frank Kelly

Diagnose Alzheimer’s disease as early as possible
Benefit to millions of people (potentially)
Our goal:

Why is Alzheimer’s disease diagnosis important?
Chronic neurodegenerative disease
60-70% of dementia cases = Alzheimer's
48 million people affected worldwide (2015)
Wrecks people’s lives (+ their families’)
800,000 people (in the UK) formally diagnosed
Only 43% of those with the condition get a diagnosis
Figures: wikipedia & http://www.bbc.co.uk/science/0/21878238

Demographic changes mean it
will be more widespread
Chart credit: economist.com
By 2050 the number of dementia
sufferers is expected to triple
A global, mounting problem

How is Alzheimer’s disease
diagnosed today?
Medical history
Mental status tests
Physical and neurological examination
Blood tests and brain imaging
Example test sheet:: http://www.ftdrg.org/wp-content/uploads/4a-CCT_revised-Picture-stimulus.pdf

A gradual decline
-20
years
-10
years
Death-15
years
-5
years
Earliest Alzheimer’s Mild to moderate Severe
Common diagnosis period

Who are we ?
Full bios: https://alzhack.wordpress.com
What is our approach? We’re doing citizen science
● No lab, or lab coats
● Readily available data
● Other people’s research

Diagnose Alzheimer’s disease as early as possible
Why?
Participate in clinical drug trials Benefit from treatment
More time to plan
Take own decisions
Better carer relationship
Reduce anxieties about unknowns
Sketch: http://www.businessfinancenews.com/28526-will-astrazeneca-plc-and-eli-lilly-give-breakthrough-in-alzheimers/

Design of Study
&
Data Collection

How the disease manifests itself
Protein plaques and
tangles accumulate in the
brain:
Disrupting
communication
between nerve cells
Kills nerve cells
Loss of brain tissue
Facts: https://www.alzheimers.org.uk/site/scripts/documents_info.php?documentID=100 Imagery: www.alz.org

How the disease manifests itself (1)
Starts in the hippocampus
Harder to form new memories
Difficult to recollect from days or
hours ago
Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA

How the disease manifests itself (2)
...then takes root in other areas
2. Language processing
3. Logical thought
4. Emotions
5. Senses
6. Older memories
7. Balance and coordination
Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA

Relevant symptoms
Confusion with
time/place
Spatial memory
Problems with words
Misplacing items
Decreased / poor
judgment
Withdrawal from
work
Mood change Difficulty
with familiar
tasksChallenges in planning
SpeechShort term memory loss
-20
years
-10
years Death
-15
years
-5
years
Earliest Alzheimer’s Mild to moderate Severe

Previously: Analysis of a single user’s emails
● An Alzheimer’s
disease sufferer’s
emails over 4 years
● Conversion of email
text to vectors
● Counts, lengths and
other metrics
Features
Memory, language and sentiment related metrics extracted

Results
Some “explainable” trends
Challenges
Single user: lack of data and likely bias
Scaling up: security concerns & deletion

Forum post scraping
First lxml, then BeautifulSoup
● Two sub-forums
● ~3,600 threads
● ~78,000 posts
○ Post content
○ Post metadata
○ User metadata

Data preparation
Content punctuation
sanitised by regexp
substitutions.
Sub forum
post data
(x2)

How do we label a user?
● Users frequently post in both sub-forums
● To differentiate:
○ Assume that OPs (thread starters) in a sub-forum are of that category
○ Otherwise look at ratio of posts (replies) between the two sub forums*
FP = First Post in thread SP = Subsequent Post in thread

How do we label a user?
Thread
Reply
Dementia Partner
Discard Unknown

Sentiment “polarity”
(out-of-the-box via
NLTK & TextBlob)
● Alternatively can
train your own text
classifier:
http://streamhacker.
com/2010/05/10/text-classification-
sentiment-analysis-naive-bayes-
classifier/
‘Mood change’
as a feature

● Average of sentence
sentiments per post
● Slightly higher
sentiment for
dementia sufferers’
posts

Language-oriented features
Lexical functions
Comprehension functions
Empty phrases
Paraphasias and
neologisms
Vocabulary-related
Readability
“Go ahead” phrases
Unintended or
invented words
Difficult words count
Dale-Chall readability
Flesch Kincaid
Flesch Reading Ease
Counts of “ummm...errr”
Words that are not in
common usage

Simple language features
● Sentence count
● Word count
● Words per sentence
● Unique word count
● Unique words to total ratio
● “Go Ahead” words (Empty phrases)

Readability
(package readability-lxml)
● Avg syllables per word
● Avg letter per word
● Flesch reading ease
● Flesch kincaid grade
● Polysyllabcount
● Automated readability index
● Number of “difficult” words
● Dale-chall readability score
● Gunning fog

Memory-oriented features
● Sort posts by username and timestamp, add a shifted column

Apply comparison function
between post and previous post:
○ NLTK edit_distance (fuzzy
match)
○ Cosine similarity between TF-
IDF vectors

Part of speech (POS) features
● Tag words and
tally up
frequencies
● Calculate
“rates”

Explanatory or predictive modelling ?
● Actually both.
● First ‘interpret’ a classifier (explanatory)
● Secondly need a ‘real-time’ detection system (predictive)

Data modelling strategy (used for initial ML runs)
Aggregation of posts
● pandas: groupby, agg by username
Balancing out the dataset
● Many more partner users than sufferers
● Subsample larger (partner) dataset to even things up
Validate using random train and test sets
● Randomly select 80% of users for training, 20% test

Model Results for Misc. Features
● Median values (aggregated over all posts per user)
Best: SVM Radial basis function classifier (with grid
search)
User classification accuracy: 57%

Model Results for Memory Features
Best: K-nearest neighbours Classifier

Model Results for Readability Features
● Median values (aggregated over all posts / user)
Best: K-nearest neighbours Classifier

Model Results for Part-Of-Speech Features
Best: SVM Radial basis function classifier (with grid
search)

Model Results for All Features
Best: Naïve Bayes Classifier

Re-think: Classify posts, not users
● Currently group by userID
● Some users post more than others
● Posts would utilise full “richness” of the dataset
● Double round of sampling required on post set:
○ 3 - 4 times more “partners” than dementia sufferers
○ Partners post approx. 3 times more posts than sufferers do

Model Results for All Features (by post)
● Filtered set of posts
Best: Random Forest Classifier
Accuracy of 68% percent in ability to classify a post

Results in summary
● Best performing feature group so far on aggregated set by user:
○ Memory-based features
● Best performing individual feature on aggregated set by user:
○ Verb rate = ratio of verbs to word count in post
● Best performing individual feature on individual post:
○ Cosine similarity to previous post
● Aligns with symptoms expected in early stage to mild dementia

Future avenues
● Data
○ Further data gathering (more blogs including non-alzheimer's topic blogs)
○ Better user identification (e.g. active learning)
● Features
○ More and better
○ Types of individual dementia distinguish
○ More memory-related features (e.g. LSI)
● Clustering of posts into ‘topics’ or users into ‘types’
○ gensim / LDA topic modelling
○ Early stage / medium condition / advanced condition posters
● Classification and modelling
○ Time series analysis
○ New sampling techniques, input validation and models

Future:
Time series analysis
● Noisy datasets
○ Apply numerical Bayesian
inference
● Are we looking for a steady
change in the mean?
○ Ramp detection
● Or a sudden change in
variance?
○ Step change detection
Dementia sufferer
Partner

Conclusions
● Introduction to Alzheimer’s and its impact
● Explanation of our technical approach and surrounding challenges
● Initial observations and predictions
● Tough problem and a worthwhile cause for data science
● Please contact us if you would like to help, or have ideas:
frank.kelly@cantab.net https://alzhack.wordpress.com/contribute-2/
Thank you!

Alz Hack II

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (7)

Similaire à Alz Hack II

Similaire à Alz Hack II (20)

Dernier

Dernier (20)

Alz Hack II