2. About Me
Position:
• Healthcare data scientist
Tools:
• Python (any library that is useful), R,
PySpark, SQL, Julia
• A well-worded email
Hobbies:
• Backpacking
• Long distance runner
• Eating a lot after running
5. Why healthcare data science is so difficult
Privacy
• Privacy is paramount
• Data is very difficult to anonymize
• Privacy is paramount
Data is the new gold
• Healthcare data is extremely valuable
• Models trained on sandboxed data are very valuable and it
is unclear to what extent they can be reverse-engineered
Healthcare Data Is Messy:
• EMR (Electronic Medical Records) data are notoriously
complicated and messy
• Not standardized
• Domain and specialty heavy
6. Problem
Statement
Getting from B -> A
● B: Be able to have a system
such as Google Smart Reply or
Health provider community
forum
● …
● …
● …
● …
● …
● A: Use open-sourced data to
demonstrate intelligent and
contextually health-type
responses
7. Where to get text data for discussions
about your health with medical
professionals that will not violate
anyone’s privacy?
10. The Data
Summary statistics:
• Data gathered from 2014 to 2018
• % of posts by Reddit certified medical professionals: 49.0%
• ~ 30k Threads
• ~ 106k posts (after cleaning)
• ~ 26.3k Users
• ~ 16.3k Medical Professionals
Notes about data:
• More intent in conversations than typically in normal forum or 1-to-1 conversations (especially
Reddit)
• Signs of more male presence, especially in male-specific problems
• Surprisingly tame for Reddit!
14. Scope
Descriptive vs prescriptive:
● Framing the problem for the “patient” as
opposed to trying to solve it for them
Addressing specifics within the
problem:
● Number of speakers
● Change in topic
● Outcome
15. Structuring For
ML
For each thread:
○ Thread start becomes initial query.
Following first-level responses
become become available
responses to initial query
○ E.g.
User1: initial question
User 2: response
User 3: response
User 2: response
User 3: response
User1:
response
16. Looking at Turns
● 96k query/response
pairs
Turns
● How dynamic a
thread/conversation
● Mean: 4.5
● Median: 4.0
● Min: 2
● Max: 106
17. Query:
Alright. Give me a bit. Any
food you can't eat? Egg
products?
Possible Answer 4:
Thanks It's an evolving situation. It seems that it takes awhile for
everyone to get on the same page regarding his medications and
test results. Apparently the positive staph results they got could
have very well been caused by poor procedure. It is all the little
things that are adding up, like the meals, or coming in and asking
a bedridden 93 year old incontinent man, suffering from
dementia, if he needs to get up and go to the rest room and then
leaving when he fails to answer. I'm trying to find a patient
advocate or something now. Supposed to meet with a 'social
worker' as well, but I don't know when.
Possible Answer 1:
I agree with @stephaniecaseys that it is a skin tag. Also not a doc, but I
had mine removed at the dermatologist. It was very quick. It was
messing with my bra strap. They can grow bigger as well, especially if
you have a tendency to mess with it. It will swell.
Possible Answer 3:
You're not the first person to have that happen to. Happens to me just
about every time I'm out in the bush. Unless you are in an area with
known cases of Lyme disease, I doubt it is of any concern.
Possible Answer 2:
I can eat pretty much anything
21. Modeling
Considerations
● Generative modeling is cool, but
has disadvantages:
○ Lots of data
○ May not be precise enough
health context
○ Pretty far away from being able
to augment responses
Keras Blog: Seq2Seq
Cryptic response with generative
approach:
Q: Husband deteriorating before my eyes,
doctors at a loss, no one will help; Reddit
docs, I need you.
A: I don't think this is a single pain is not
a doctor but I have a similar symptoms
and the story
22. Response Retrieval
Approach
● Easier to measure and optimize
than generative approach
● Easier to apply rules by
business user to directed
outcome:
● Allows clinician to automate
follow-up questions that might
often ask in email/phone/visit
exchanges
Changes from Original Paper: The Ubuntu Dialogue Corpus:
A Large Dataset for Research in Unstructured Multi-Turn
Dialogue Systems
Query Response
I’ve had pain in my
knee after running.
Why?
Non-relevant answer 1
Non-relevant answer 2
Relevant answer 1
Non-relevant answer 3
Relevant answer 2
23. Dual Encoder
● Embeddings are initialized with
pre-trained Common Crawl
Glove 840b, 300d and allowed
to train on new utterances &
responses
● 1 hidden layers: one for context,
one for response
● Different choices are made for
output, but often is a binary
probability over up to 10
possible responses
Image credit: The Ubuntu Dialogue Corpus: A Large Dataset for Research in
Unstructured Multi-Turn Dialogue Systems
24. How Good is our Dual Encoder Doing?
Method/Metric Expected
recall due to
chance
TF-IDF LSTM GRU
1 in 2 R@1 50.0% 64.0% 91.1% 92.1%
1 in 5 R@1 20.0% 54.5% 75.5% 79.6%
1 in 10 R@1 10.0% 48.7% 65.1% 68.7%
The above metrics seek to quantify how good the models are at extracting the important
responses. It is a standard metric for response retrieval models.
E.G. “1 in 2 R@1” translates to “At a rate of 1 correct answer out of 2 possible answers,
how what is the % of correct answers if only selecting 1.
Not bad!
25. Future Directions
1. Create intents for all utterances
using a smart indexing from
something like UMLS
2. Refine problem statement to
exploit the potential of data
3. User transfer learning from large
Glove vectors on large corpus of
Reddit comments to get more at
semantics of comments.
4. Experiment with predicting next
response given all thread history
5. Hierarchical attention instead of
RNN as these are showing a lot
of promise.
27. UMLS
Unified Medical Language System
● Brings together many health
vocabularies and standards
● Contains 3 tools:
○ Metathesaurus: Terms and codes from
many vocabularies, including CPT®,
ICD-10-CM, LOINC®, MeSH®, RxNorm,
and SNOMED CT®
○ Semantic Network: Broad categories
(semantic types) and their relationships
(semantic relations)
○ SPECIALIST Lexicon and Lexical
Tools: Natural language processing
tools
28. Modeling
Considerations
● Rule-based:
○ Expensive to build
○ Expensive to maintain
○ Requires precise knowledge
○ E.g.: “If patient has cough,
check for fever”, if in flu and
fever then patient has flu
30. Health Concept Entities
• 61 Total Distinct Concept Entities
• Mean of 20 health entity behaviors per
utterance
31. Identifying Health
Intents
● Ranking of utterances number
of health behavior content
normalized by length
[("Hey, how's your husband doing now? Hope everything is okay.", 0.0),
('I know this is beside your point, but youre 16 and know you have high cholesterol. How? Why?',
1.1831455647530769),
("So why are you posting on here then, if you had two 'real' doctors giving you advice? What answer
are you looking for here? ",
1.2548015599879458),
('How long ago did you change your diet, as in when did you have the kidney stones?',
1.2584445094145018),
('How old is your partner?nnDo you know her diagnosis (ie why they did her surgery)?',
1.2593890745195433),
('How long ago was your thyroid levels checked? Do you have any pain in your abdomen?',
1.2635706685998762),
('Obviously you got blood work done here, any strange findings?',
1.2784053615073325),
("And you want to know if he did the burn on purpose or not? There's absolutely no way of telling,
you'll have to ask him I guess.",
1.311204647849376),
('dysarthria (slurring your words) is not a common side effect of synthroid. Do you take birth control
pills? Do you have a history of migraines? nn Have you had any numbness, weakness, paralysis or
changes in the way you walk?nHave you noticed any other weird things going on with your body
when you have these episodes? nnHave you had trouble with your vision recently? Any trouble
swallowing? Any sudden loss of vision in one eye ever, even a long time ago? ',
1.312565699208395),
('How long have you been suffering? Hope you get rid of dat nasty pain soon.',
1.3201720990394112)]
33. # of Initial Words
vs Quantity of
Responses
● No real relationship between #
of words to number of
responses: that is, using a long
response doesn’t help or hurt
your chances of getting a lot of
input
34. Query:
Because TMAU and many
other metabolic diseases
are very rare. Your other
question is a bit too
general to answer, but
yes, there are advances
all the time in
gastroenterology and
metabolic diseases.
Possible Answer 4:
Could it be tonsil stones perhaps? I have both swollen tonsils
after Is it rare because people don't have the disease, or
because it goes undiagnosed? Or is it difficult to diagnose?
There's a large community over on the MEBO Research site,
with members easily in the thousands. What defines a "rare"
disease? What percentage of the population has to have a
similar or identical diagnosis for it be considered common?
Possible Answer 2:
Thanks a lot! :) You've given me peace of mind.
Possible Answer 3:
Just don't take anymore, you aren't withdrawing from anything. Cut the
b***.
Possible Answer 1:
Curiosity... closure... a few other reasons. I
understand that it would depend on the cause of
death and that if its not a P.E., they might not be
able to tell what it was now depending on how the
body was prepared. But... specifically regarding the
pulmonary embolism... can they at least say
whether it was a P.E. or not?
Notes de l'éditeur
Tried to structure the talk in proportion to the work that has been done somewhat far and also to give some perspective to people outside of healthcare why from my perspective it has unique challenges in data science
Healthcare data is extremely valuable: always consider that before you share your data with just any app.
You would think that Kaiser Permanente would be above using Reddit data. Not So!
You are highly susceptible I think to Apophenia: seeing patterns in random data
Two points to make:
1) The consideration of know who is who (medical professional or not)
2) Making sure the optimization is happening more on the clinician side as opposed to the ”customer” or
Becomes important as you consider how long the thread goes on
Trimethylaminuria is a disorder in which the body is unable to break down trimethylamine, a chemical compound that has a pungent odor. Trimethylamine has been described as smelling like rotting fish, rotting eggs, garbage, or urine
A Dual Encoder Sequence to Sequence Model for Open-Domain Dialogue Modeling: https://arxiv.org/pdf/1710.10520.pdf
E.g. “1 in 2 R@1” means for 1 correct recall of the recommendation out of 2 total recommendation where 1 in 2 are correct
Expensive to build:
Some things are easy: “If patient has cough, check for fever”, if in flu and fever then patient has flue
Health-care may be one of the worst examples for this however, since any one manifestation of a symptom probably means a bunch of things. In fact
Trimethylaminuria is a disorder in which the body is unable to break down trimethylamine, a chemical compound that has a pungent odor. Trimethylamine has been described as smelling like rotting fish, rotting eggs, garbage, or urine