Understanding the relationships between drugs and diseases, side effects, dosages is an important part of drug discovery and clinical trial design. Some of these relationships have been studied and curated in different formats such as the UMLS, bioportal, SNOWMED etc. Typically this data is not complete and distributed in various sources. I will adress different stages of the drug-disease, drug-side effects and drug-dosages relationship extraction. As a first step I will discuss medical attributes (diseases, dosages, side effects) extraction from FDA drug labels and clinical trials. As a next step I will use simple machine learning techniques to improve the precision and recall of this sample. I will also discuss bootstrapping a training sample from a smaller training set. As a next step I will use DeepDive, a dark data extraction framework to extract relationships between medical attributes and derive conclusive evidence on facts about them. The advantages of using deepdive is that it masks the complexities of the Machine Learning techniques and forces the user to think more about features in the data set. At the end of these steps we will have structured (queriable) data that answers questions such as What is the dosage of 'digoxin' for controling 'ventricular response rate' in a male adult at 'age 60' with weight '160lbs'.
4. It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
LOTEMAX is a corticosteroid indicated for the treatment
of post-operative inflammation and pain following
ocular surgery.
FDA Drug Labels: Examples
5. We present a case of a 10-year-old boy who had
severe relapsing pancreatitis three times in two
months within 3 weeks after starting treatment with
methylphenidate ( ritalin ) due to attention deficit
hyperactivity disorder (adhd).
The boy was generally healthy except for that he
was newly diagnosed with adhd and started the
use of methylphenidate ( ritalin ) for the past
three weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering
from pancreatitis due to the use of ritalin is more
than this published case.
Physicians must pay attention regarding this
possible complication and it should be taken into
consideration in every patient with abdominal
pain who started consuming ritalin.
Meta Data
Dosage
single dose:
240 ml
Drug methylphenidate
# of vol 30mg
Clinical Trials: Meta Data
6. We present a case of a 10-year-old boy who had
severe relapsing pancreatitis three times in two
months within 3 weeks after starting treatment with
methylphenidate ( ritalin ) due to attention deficit
hyperactivity disorder (adhd).
The boy was generally healthy except for that he
was newly diagnosed with adhd and started the use
of methylphenidate ( ritalin ) for the past three weeks
at a dose, of 30 mg daily.
We believe that the number of persons suffering
from pancreatitis due to the use of ritalin is more
than this published case.
Physicians must pay attention regarding this
possible complication and it should be taken into
consideration in every patient with abdominal pain
who started consuming ritalin.
Drug
Adverse
Effects
Ritalin
pancreatitis,abdomin
al pain
Tylenol
nausea, upper
stomach pain,
itching, loss of
appetite
Aspirin
rash, gastrointestinal
ulcerations,
abdominal pain,
upset stomach,
heartburn
Clinical Trials: Side Effects
7. Drug—Disease
• Of Label Drug Uses
• Database completion
• Design of clinical trials
relationship between meta- data
• How does heart disease correlate
with gender and age.?
• Which universities have the most
successful clinical trails for breast
cancer?
• How are genes and phenotypes
related?
• What dosage for ritalin was most
effective in treating ADHD with least
side effects?
Problems it Solves
9. Extract sentences that contain the
specific attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun.
Train a Machine Learning model to predict which unigrams,bigrams
or trigrams satisfy the specific relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
Course of Action
10. Creating Labelled Data
lemmatized_sentence: [‘maintenance’,
‘therapy','reduce','the','frequency','of', ‘manic', 'episode',
'and', 'diminish', 'the', 'intensity', 'of',
'those', 'episode', 'which', 'may', 'occur', '.']
Several Candidates
Typically one of them is the disease
that the drug treats. For every drug
we create a training data. One line of
the text produces 5 lines of training
data with one true positive.
Balancing the Training Data
Since the training data contains a
higher percentage of zero’s than
one’s it is important to balance it
before modeling, i.e in order to build
the model I choose equal number of
zeros and ones.
Candidat
e
Target
rule-
predictio
nmainten
ance
0 1
therapy 0 1
manic
episode
1 1
intensity 0 1
episode 0 1
11. Feature Extraction: Word Vectors, Disease Combinations
adhd + manic episode = bipolar disorder
respiratory disorder+allergy=common cold
coronary artery+heart disease=angina pectoris
high blood pressure+lipid=diabetes_management
Extract Features: Initialize vocabulary with pre-trained vectors
gensim: Train word2vec on medical corpus with unigrams,
bi-grams and trigrams
Produce word vectors
12. Pure Python stack
pandas
scikit-learn
gensim
stanford-nlp-
parser
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
# Pipeline for getting the position of the disease candidate
('position', Pipeline([
('selector', ItemSelector(column='candidate')),
('vect', DictVectorizer()),
])),
# Pipeline for getting words around candidates
('words_around', Pipeline([
('selector', ItemSelector(column='words_around')),
('count', CountVectorizer()),
]))
])),
('clf', ML_library(penalty=‘l1'))])
13. Data Cleaning and Tokenization
Machine Learning Workflow: Pure Python stack
pandas
scikit-learn
gensim
stanford-nlp-
parser
Feature Extraction/
Candidate Selection
Create Labelled Data
ML: Logistics Regression, …
HyperParameter Tuning
Calculate Metrics: precision,
recall, ROC curve, etc
15. Drug
Candidat
e
Target Predict
Silver
Sulfadiazine
third
degree 0 0
Silver
Sulfadiazine sepsis 0 1
Silver
Sulfadiazine burn 0 1
Silver
Sulfadiazine cream 0 0
Drug
Candidat
e
Target Predict
Diltiazem
Hydrochlori
de
spasm 1 0
Diltiazem
Hydrochlori
de
coronary
artery 1 0
Diltiazem
Hydrochlori
de
stable
angina 0 0
Diltiazem
Hydrochlori
de
angina 0 0
'silver sulfadiazine cream usp 1 % be a topical
antimicrobial drug indicate as a adjunct for the
prevention and treatment of wound sepsis in patient with
second and third degree burn .’
[‘Diltiazem', ‘hydrochloride', ‘tablet','USP', 'be',
‘indicate', 'for', 'the', ‘management', 'of', 'chronic',
'stable', 'angina', 'and', ‘angina', 'due', ‘to',
‘coronary', 'artery', 'spasm', '.']
Cases where it does not work
17. Clinical Trials Data
We present a case of a 10-year-old boy who had severe relapsing
pancreatitis three times in two months within 3 weeks after starting treatment
with methylphenidate ( ritalin ) due to attention deficit hyperactivity
disorder (adhd).
The boy was generally healthy except for that he was newly diagnosed with
adhd and started the use of methylphenidate ( ritalin ) for the past three
weeks at a dose, of 30 mg daily.
We believe that the number of persons suffering from pancreatitis due to the
use of ritalin is more than this published case.
Physicians must pay attention regarding this possible complication and it
should be taken into consideration in every patient with abdominal pain who
started consuming ritalin.
20. Creating Labeled Data
Hand Label data that contain the
specific attribute ~100
Extract Candidates: POS tag and extract unigrams,
bigrams and trigrams centered on nouns
Generate rules: Automatic creation of
labels that satisfy the 100 hand labelled data
This process will create a smaller sample (say 5-10%) of data which
can be further crowdsourced for 100% accurate gold sample
Rule Based Model : with 95% accuracy
Iterate: Repeat process a few times
21. Example of rules:
Dosage:
(1) Sentence contains numbers
(2) Distance between numbers and “mg”, “milligrams”
<5 characters
(3)Contains the word “dose”
Age:
(1) Sentence contains numbers
(2)Contains the word “age”, “year-old” within 5 words of the
candidate
22. Deepdive: Extracting relationships between entities
pdf’s, textfiles, semistuctured json, example: journals available at
pubmed and clinicaltrails.gov
Provide examples of data that need to be extracted
Structured data
26. • NLP relationship extraction with ML techniques are
very successful in presence of gold labeled data
• It is very important to invest time and resources
towards harvesting good training data.
• There is an enormous amount data in pharma
(clinical trials, laboratory notes, doctors notes, drug
manufacturing documents,…). In order to pursue
personalized medicine it is important to centralize
this and make joint inferences across all data sets.
Final Remarks
27. Thank You: We are hiring …
blog: https://medium.com/@sangha_deb
@sangha_deb,sanghamitra.a.deb@accenture.com