Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Discovery in Radiology

Leveraging Machine Learning
Techniques
Predictive Analytics for Knowledge Discovery in Radiology
Barbaros Selnur Erdal, PhD
Luciano M.S Prevedello, M.D. MPH
Kevin Mader, PhD
Joshy Cyriac
Bram Stieltjes, M.D. PhD

Materials
a.Slides
i. http://bit.ly/2zK0qFm
b.KNIME + Workflows + Data (zip file to extract)
i. Not required for the lab computers but needed at home (Windows 7 and above)
1.https://www.dropbox.com/s/3fcjvr0lfxfzmgd/knime_3.4.1.zip?dl=0
ii.Just the workflows - for Mac and Linux requires KNIME (free from knime.com)
1.https://www.dropbox.com/s/rjp5qmb56q9fjfr/PredictiveAnalytics.knar?dl=0
c.Kaggle Competition
i. http://bit.ly/2zJMVps

Learning Objectives (from RSNA Abstract)
1. Review the basic principles of predictive analytics.
2. Be exposed to some of the existing validation methodologies to test predictive
models.
3. Understand how to incorporate radiology data sources (PACS, RIS, etc) into
predictive modeling
4. Learn how to interpret results and make visualizations.

Outline
• Introduction / Starting KNIME (Kevin)
• Why ML and Predictive Analytics are important? (Luciano)
• Framework Overview (Kevin)
• Value Prop, Decision, ML Task
• Data Sources
• Collecting Data - Preprocessing
• Collecting Data from PACS
• Features
• Data Wrangling
• Building Models
• From Double-Blind to Competitions
• Conclusion / Outlook (Luciano/Selnur)

Python might be good, but ...
• Devices, Sessions, Graphs, Ops, Context
Managers?
• tf.Session(config = tf.ConfigProto(gpu_options =
tf.GPUOptions(per_process_gpu_memory_fraction = 0.5)))
• Code gets messy very quickly
• Poor variable names
• Minimal documentation
• Custom functions / scattered .py
• Multiple library versions

KNIME + Workflows
• Medical workflows are complicated involving a large number of steps
• We want transparent, reproducible pipelines for running analysis in
research and production settings

Should I learn KNIME?
• Supports
• Matlab, R, Python scripts
• Java code snippets
• Writing your own plugins (Eclipse)
• Natural Language Processing
• Image Processing (full ImageJ / FIJI support, ImgLib2 integration)
• Machine Learning Models (WEKA, scikit-learn, Decision Trees,
PMML)
• Deep Learning (DL4J, Keras model import, and full keras support
coming)
• JavaScript Visualization
• Report Generation
• Excel Input / Output
• Database connectivity

Notes
• Please do not save workflows since the class tomorrow needs them
• You will need to change the path in

Benefits of Being Analytical
• Guide through turbulent times
• Improve decision making - Know what is working vs not
• Manage risks
• Improve Quality
• Cut cost – increase efficiency
• Anticipate change – competitive advantage

Is now the right time?
• YES!!!
• Fee for service to value-based Care
• Analytics - Guide through turbulent times

Value
Cost
Manage risks
Improve decision making
Improve Quality
Cut cost
Increase
efficiency

Appointment No-Shows
https://www.kaggle.com/joniarroba/noshowappointments/
• 80% accuracy based on Age, Time of Day, Disease

Indian Health Statistics
https://www.kaggle.com/rajanand/key-indicators-of-annual-health-survey
• Chronic disease likelihood based
on region, age,
• Establish better baseline
statistics for patients to be more
efficient with diagnostics

Canvas (machinelearningcanvas.com / louis@dorard.me)

Moveimportant incoming
emails to a dedicated section at
the top of the inbox
We want to be ableto answerthe
question
“Is this email important?”
beforethe usergets a chance to
see the email
• Input: email
• Output: “Important”
(Positiveclass) or “Regular”
-> Binary Classification
Makeit easierfor users of an
email client to identify
new important emails in their
inbox, by automatically
detecting themand making
them more visible in the inbox
(this detection must happen
beforeusersees email)
The objectiveis that users spend
less themin theirinbox and
reply to important emails more
quickly
• Previous email messages (as
mbox files or in othertypeof
database)
• Address book
• Calendar
• Explicit labelling: users can
manuallylabel emails as
important or not, by clicking
on an icon next to each
email’s subject
• Implicit labelling: heuristics
based on user behaviorafter
getting the email (e.g. replying
fast, deleting without reading,
etc.)
Every time we receive an email
addressed to our user, which
starts a new thread (otherwise
the importance is just the same
as that of the thread)
We aim to rapidly deliverthe
email in the right section of the
inbox, within a 2s period
Use last 3 months of emails for
test and 12 months beforefor
training. We makePI option
availableto user if…
• Cost < baseline heuristic (e.g.
“if senderin address book
then important”): FP costs 1,
FN costs 3
• No morethan 1 errorper X
emails
One model per user, initially
built on last 12 months of email
data, that we update…
• When an error is signaled by
the user via manual labelling
• Every 5’ by adding new data
from implicit labelling, if any
Perweek:
• Ratio: #errors explicitly signaled by user/ #emails received
• Same w. errors seen via implicit labelling
• Averagetime taken to reply to important emails
• Total time spent on inbox
Priority Inbox (PI) Louis Dorard Jan. 2017 1
• Content features: subject,
body, attachments, size
• Social features: based on info
about sender (e.g. in address
book?), previous interactions,
contextual (e.g. upcoming
meeting w. sender)
• Email labels (typically
assigned via manual rules
defined by user)

Our Goal
1. NSCLC Patients have a large number of scans of the course of their visit to a
hospital.
a. We want to predict which scans a patient will have after diagnosis to try and minimize the
number of required visits.
b. Go from a collection with metadata from over 60K scans to a model

Value Proposition
• We want to schedule Radiologists better at our Lung Cancer
Center so we have faster turn around times at lower cost

Decision
• How many scans to we expect that will need to be read?
• How many radiologists need to be on duty in a given week?

ML Task
• Given a patient history and diagnosis predict the number of
future scans
• Input:
• Patient History
• Patient Information
• Output
• Number of CT scans needed post diagnosis

Data Sources
1.This course
a. CSV File of Scans with DICOM Headers
b. CSV File extracted from Tumor Board
2. Your own hospital
a. PACS
b. Tumor Board
c. Other Interesting Sources
i. RIS Reports
ii. Pathology Reports

Collecting Data
• This Course
• Data is already prepared we just have to join it
• At your hospital
• Take list of patients from Tumor Board
• Find all scans for each patient in PACS
• Extract DICOM Header as Table

Collecting Data (PACS)
• Beyond this course
• DCMTK
• Python
• https://github.com/joshy/pypacscrawler
• RCC42C - Joshy Cyriac - Open Source Tools for
Rapidly Indexing, Searching, and Processing
Image Data from the PACS

Collecting Data
• This Course
• We read in scans from a PACS Output
• We read a list of patients from a Tumor Board Output
• We join the two tables on Patient ID
• We convert strings into dates

Features
• High Value Features
• Number of previous scans (hypothesis that having more scans before mean
there is less need to scan later)
• Age (older patients will have more complications?)
• Gender (could be gender differences)
• Interesting Features
• Referring Physician (maybe some physicians order more scans than others)
• Institution (some hospitals might order more scans than others)
• No
• Accession Number, Patient ID, Patient Name

Pivoting / Pivot Tables
1. Many times the data we want isn’t in the right format
a. We have a fully expanded list of scans
b. We want the number of unique studies per patient organized by scan type
c. This requires a number of different operations

Offline Evaluations
• Predict the number of scans
• Penalize the wrong number of scans linearly
• Even better
• Predict the number of scans per week
• Penalize by radiologist hour mismatch per week

Making Predictions
• As a patient is diagnosed with NSCLC gather the whole patient
history
• Predict the number of scans required in the future to plan better
for capacity

Building a Model
1.Models
a. Partitioning
i. Training Data
ii. Validation Data
b. Model Selection
i. Model
Representation
c. Scoring
i. Confusion Matrix
ii. R^2
iii. ROC Curve

Exploring (http://playground.tensorflow.org)

Processing Data
● Pipeline / Workflow
○ Data processing should be a clearly defined, transparent workflow
○ Where is data read from
○ How can it be combined (Patient ID? KIS ID? Accession Number?)
○ Which fields/columns should be transformed and how
○ How can it be reorganized (pivoting)
○ How can we apply this to any new data and make it clear for people unfamiliar

Applying to new data
● We have built a model and tested it a bit
● Now we want to apply it to some new data
● We can take the entire workflow and make a ‘meta-node’ out of it (Node of Nodes)

Train, Validation, Test
● We have a training and testing dataset
● We partition the training into a training set and a validation

Saving Predictions on Test (CSV Writer)
● The CSV Writer node will export the table from KNIME to a
CSV File.
● The node has to be reconfigured in order to export the results
so right click and select Configure…
● Then click the Browse… button and save the file on the
desktop

From Double Blind to Challenges

Our ‘In-Class’ Competition http://bit.ly/2zJMVps
Sign up yourself or
Guest Account
Username: rsna2017
Password: rsna2017
(but you won’t be on
The leaderboard and
your results will be
deleted and only the
first 10 can use it)

Submit Predictions http://bit.ly/2zJMVps
Select Submit
Predictions

Other Models
● Random Forest Regressor (not classifier)
○ Replace the Learner and Predictors (both)
https://www.dropbox.com/s/pfgz0z8kt6tbdcw
/AdvancedWorkflows.knar?dl=0

Review Important Points
● Clearly defined Goal
○ Predict a category
■ Classification (disease type, high/low risk patients)
○ Predict a number
■ Regression (risk factor, life expectancy, treatment dose, number of scans)
○ Think about workflow integration
■ Predicting the past isn’t helpful
○ What is accuracy?
● Collecting and Organizing Data
○ Pipeline Thinking
○ Finding a representative data-set
● Deciding on a validation strategy
○ Train / Test split
○ Cross-validation
● Evaluating Outcomes / Improving Models

Above and Beyond
● Just the beginning of predictive analytics and visualization
● Here are some other things we can do with the data
○ Timeline
■ Look at the timeline of events that happen to a given patient
○ Different scan types
■ What things are more likely / less likely

Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Discovery in Radiology

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Discovery in Radiology

Similaire à Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Discovery in Radiology (20)

Dernier

Dernier (20)

Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Discovery in Radiology