Extracting medical attributes and finding relations

Extracting Medical
Attributes and ﬁnding
relations
Sanghamitra Deb
Accenture Technology Laboratory

drugs
side effects
Personalized Medicine
ethnicity
dosages
diseases
age group
compounds
gender
interacti
ons
?
?
?

It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
LOTEMAX is a corticosteroid indicated for the treatment
of post-operative inﬂammation and pain following
ocular surgery.
FDA Drug Labels: Examples

We present a case of a 10-year-old boy who had
severe relapsing pancreatitis three times in two
months within 3 weeks after starting treatment with
methylphenidate ( ritalin ) due to attention deﬁcit
hyperactivity disorder (adhd).
The boy was generally healthy except for that he
was newly diagnosed with adhd and started the
use of methylphenidate ( ritalin ) for the past
three weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering
from pancreatitis due to the use of ritalin is more
than this published case.
Physicians must pay attention regarding this
possible complication and it should be taken into
consideration in every patient with abdominal
pain who started consuming ritalin.
Meta Data
Dosage
single dose:
240 ml
Drug methylphenidate
# of vol 30mg
Clinical Trials: Meta Data

We present a case of a 10-year-old boy who had
severe relapsing pancreatitis three times in two
months within 3 weeks after starting treatment with
methylphenidate ( ritalin ) due to attention deﬁcit
hyperactivity disorder (adhd).
The boy was generally healthy except for that he
was newly diagnosed with adhd and started the use
of methylphenidate ( ritalin ) for the past three weeks
at a dose, of 30 mg daily.
We believe that the number of persons suffering
from pancreatitis due to the use of ritalin is more
than this published case.
Physicians must pay attention regarding this
possible complication and it should be taken into
consideration in every patient with abdominal pain
who started consuming ritalin.
Drug
Adverse
Effects
Ritalin
pancreatitis,abdomin
al pain
Tylenol
nausea, upper
stomach pain,
itching, loss of
appetite
Aspirin
rash, gastrointestinal
ulcerations,
abdominal pain,
upset stomach,
heartburn
Clinical Trials: Side Effects

Drug—Disease
• Of Label Drug Uses
• Database completion
• Design of clinical trials
relationship between meta- data
• How does heart disease correlate
with gender and age.?
• Which universities have the most
successful clinical trails for breast
cancer?
• How are genes and phenotypes
related?
• What dosage for ritalin was most
effective in treating ADHD with least
side effects?
Problems it Solves

Extract sentences that contain the
speciﬁc attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun.
Train a Machine Learning model to predict which unigrams,bigrams
or trigrams satisfy the speciﬁc relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
Course of Action

Creating Labelled Data
lemmatized_sentence: [‘maintenance’,
‘therapy','reduce','the','frequency','of', ‘manic', 'episode',
'and', 'diminish', 'the', 'intensity', 'of',
'those', 'episode', 'which', 'may', 'occur', '.']
Several Candidates
Typically one of them is the disease
that the drug treats. For every drug
we create a training data. One line of
the text produces 5 lines of training
data with one true positive.
Balancing the Training Data
Since the training data contains a
higher percentage of zero’s than
one’s it is important to balance it
before modeling, i.e in order to build
the model I choose equal number of
zeros and ones.
Candidat
e
Target
rule-
predictio
nmainten
ance
0 1
therapy 0 1
manic
episode
1 1
intensity 0 1
episode 0 1

Feature Extraction: Word Vectors, Disease Combinations
adhd + manic episode = bipolar disorder
respiratory disorder+allergy=common cold
coronary artery+heart disease=angina pectoris
high blood pressure+lipid=diabetes_management
Extract Features: Initialize vocabulary with pre-trained vectors
gensim: Train word2vec on medical corpus with unigrams,
bi-grams and trigrams
Produce word vectors

Pure Python stack
pandas
scikit-learn
gensim
stanford-nlp-
parser
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
# Pipeline for getting the position of the disease candidate
('position', Pipeline([
('selector', ItemSelector(column='candidate')),
('vect', DictVectorizer()),
])),
# Pipeline for getting words around candidates
('words_around', Pipeline([
('selector', ItemSelector(column='words_around')),
('count', CountVectorizer()),
]))
])),
('clf', ML_library(penalty=‘l1'))])

Data Cleaning and Tokenization
Machine Learning Workﬂow: Pure Python stack
pandas
scikit-learn
gensim
stanford-nlp-
parser
Feature Extraction/
Candidate Selection
Create Labelled Data
ML: Logistics Regression, …
HyperParameter Tuning
Calculate Metrics: precision,
recall, ROC curve, etc

Results: Examples
drug-name
disease
candidate
Candidates ML
Lithium
Carbonate
bipolar
disorder
1 1
Lithium
Carbonate
individual 1 0
Lithium
Carbonate
maintenance 1 0
Lithium
Carbonate
manic episode 1 1

Drug
Candidat
e
Target Predict
Silver
Sulfadiazine
third
degree 0 0
Silver
Sulfadiazine sepsis 0 1
Silver
Sulfadiazine burn 0 1
Silver
Sulfadiazine cream 0 0
Drug
Candidat
e
Target Predict
Diltiazem
Hydrochlori
de
spasm 1 0
Diltiazem
Hydrochlori
de
coronary
artery 1 0
Diltiazem
Hydrochlori
de
stable
angina 0 0
Diltiazem
Hydrochlori
de
angina 0 0
'silver sulfadiazine cream usp 1 % be a topical
antimicrobial drug indicate as a adjunct for the
prevention and treatment of wound sepsis in patient with
second and third degree burn .’
[‘Diltiazem', ‘hydrochloride', ‘tablet','USP', 'be',
‘indicate', 'for', 'the', ‘management', 'of', 'chronic',
'stable', 'angina', 'and', ‘angina', 'due', ‘to',
‘coronary', 'artery', 'spasm', '.']
Cases where it does not work

Exploring Modeling Technique
Method Precision Recall F1
ROC
Curve
Logistic
Regression
0.95 0.95 0.95 0.92
LR+
word2vec
0.94 0.94 0.94 0.9
SVM 0.96 0.95 0.95 0.92
Random
Forest
0.96 0.96 0.96 0.9

Clinical Trials Data
We present a case of a 10-year-old boy who had severe relapsing
pancreatitis three times in two months within 3 weeks after starting treatment
with methylphenidate ( ritalin ) due to attention deﬁcit hyperactivity
disorder (adhd).
The boy was generally healthy except for that he was newly diagnosed with
adhd and started the use of methylphenidate ( ritalin ) for the past three
weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering from pancreatitis due to the
use of ritalin is more than this published case.
Physicians must pay attention regarding this possible complication and it
should be taken into consideration in every patient with abdominal pain who
started consuming ritalin.

Clinical Trials Data: Labelled Data
Data Dosage Drug
Treats
Disease
Side
Effects
Age Gender Ethnicity duration
10-year-old 0 0 0 0 1 0 0 0
pancreatiti
s-ritalin
0 0 0 1 0 0 0 0
adhd-ritalin 0 0 1 0 0 0 0 0
ritalin 0 1 0 0 0 0 0 0
30 mg 1 0 0 0 0 0 0 0
past three
weeks
0 0 0 0 0 0 0 1
boy 0 0 0 0 0 1 0 0

Clinical Trials Data: Labelled Data Exist
Data Dosage Drug
Treats
Disease
Side
Effects
Age Gender Ethnicity duration
10-year-old 0 0 0 0 1 0 0 0
pancreatiti
s-ritalin
0 0 0 1 0 0 0 0
adhd-ritalin 0 0 1 0 0 0 0 0
ritalin 0 1 0 0 0 0 0 0
30 mg 1 0 0 0 0 0 0 0
past three
weeks
0 0 0 0 0 0 0 1
boy 0 0 0 0 0 1 0 0

Creating Labeled Data
Hand Label data that contain the
speciﬁc attribute ~100
Extract Candidates: POS tag and extract unigrams,
bigrams and trigrams centered on nouns
Generate rules: Automatic creation of
labels that satisfy the 100 hand labelled data
This process will create a smaller sample (say 5-10%) of data which
can be further crowdsourced for 100% accurate gold sample
Rule Based Model : with 95% accuracy
Iterate: Repeat process a few times

Example of rules:
Dosage:
(1) Sentence contains numbers
(2) Distance between numbers and “mg”, “milligrams”
<5 characters
(3)Contains the word “dose”
Age:
(1) Sentence contains numbers
(2)Contains the word “age”, “year-old” within 5 words of the
candidate

Deepdive: Extracting relationships between entities
pdf’s, textﬁles, semistuctured json, example: journals available at
pubmed and clinicaltrails.gov
Provide examples of data that need to be extracted
Structured data

Deepdive: Prototyping with ddlite
https://github.com/HazyResearch/ddlite

Deepdive: Prototyping with ddlite

Mind Tagger
Show ipython notebook

• NLP relationship extraction with ML techniques are
very successful in presence of gold labeled data
• It is very important to invest time and resources
towards harvesting good training data.
• There is an enormous amount data in pharma
(clinical trials, laboratory notes, doctors notes, drug
manufacturing documents,…). In order to pursue
personalized medicine it is important to centralize
this and make joint inferences across all data sets.
Final Remarks

Thank You: We are hiring …
blog: https://medium.com/@sangha_deb
@sangha_deb,sanghamitra.a.deb@accenture.com

Extracting medical attributes and finding relations

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Extracting medical attributes and finding relations

Similar to Extracting medical attributes and finding relations (20)

More from Sanghamitra Deb

More from Sanghamitra Deb (13)

Recently uploaded

Recently uploaded (20)

Extracting medical attributes and finding relations