Review the basic principles of predictive analytics.
Be exposed to some of the existing validation methodologies to test predictive models.
Understand how to incorporate radiology data sources (PACS, RIS, etc) into predictive modeling
Learn how to interpret results and make visualizations.
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Discovery in Radiology
1. Leveraging Machine Learning
Techniques
Predictive Analytics for Knowledge Discovery in Radiology
Barbaros Selnur Erdal, PhD
Luciano M.S Prevedello, M.D. MPH
Kevin Mader, PhD
Joshy Cyriac
Bram Stieltjes, M.D. PhD
2. Materials
a.Slides
i. http://bit.ly/2zK0qFm
b.KNIME + Workflows + Data (zip file to extract)
i. Not required for the lab computers but needed at home (Windows 7 and above)
1.https://www.dropbox.com/s/3fcjvr0lfxfzmgd/knime_3.4.1.zip?dl=0
ii.Just the workflows - for Mac and Linux requires KNIME (free from knime.com)
1.https://www.dropbox.com/s/rjp5qmb56q9fjfr/PredictiveAnalytics.knar?dl=0
c.Kaggle Competition
i. http://bit.ly/2zJMVps
3. Learning Objectives (from RSNA Abstract)
1. Review the basic principles of predictive analytics.
2. Be exposed to some of the existing validation methodologies to test predictive
models.
3. Understand how to incorporate radiology data sources (PACS, RIS, etc) into
predictive modeling
4. Learn how to interpret results and make visualizations.
4. Outline
• Introduction / Starting KNIME (Kevin)
• Why ML and Predictive Analytics are important? (Luciano)
• Framework Overview (Kevin)
• Value Prop, Decision, ML Task
• Data Sources
• Collecting Data - Preprocessing
• Collecting Data from PACS
• Features
• Data Wrangling
• Building Models
• From Double-Blind to Competitions
• Conclusion / Outlook (Luciano/Selnur)
6. KNIME + Workflows
• Medical workflows are complicated involving a large number of steps
• We want transparent, reproducible pipelines for running analysis in
research and production settings
7. Should I learn KNIME?
• Supports
• Matlab, R, Python scripts
• Java code snippets
• Writing your own plugins (Eclipse)
• Natural Language Processing
• Image Processing (full ImageJ / FIJI support, ImgLib2 integration)
• Machine Learning Models (WEKA, scikit-learn, Decision Trees,
PMML)
• Deep Learning (DL4J, Keras model import, and full keras support
coming)
• JavaScript Visualization
• Report Generation
• Excel Input / Output
• Database connectivity
8. Notes
• Please do not save workflows since the class tomorrow needs them
• You will need to change the path in
10. Benefits of Being Analytical
• Guide through turbulent times
• Improve decision making - Know what is working vs not
• Manage risks
• Improve Quality
• Cut cost – increase efficiency
• Anticipate change – competitive advantage
11. Is now the right time?
• YES!!!
• Fee for service to value-based Care
• Analytics - Guide through turbulent times
16. Moveimportant incoming
emails to a dedicated section at
the top of the inbox
We want to be ableto answerthe
question
“Is this email important?”
beforethe usergets a chance to
see the email
• Input: email
• Output: “Important”
(Positiveclass) or “Regular”
-> Binary Classification
Makeit easierfor users of an
email client to identify
new important emails in their
inbox, by automatically
detecting themand making
them more visible in the inbox
(this detection must happen
beforeusersees email)
The objectiveis that users spend
less themin theirinbox and
reply to important emails more
quickly
• Previous email messages (as
mbox files or in othertypeof
database)
• Address book
• Calendar
• Explicit labelling: users can
manuallylabel emails as
important or not, by clicking
on an icon next to each
email’s subject
• Implicit labelling: heuristics
based on user behaviorafter
getting the email (e.g. replying
fast, deleting without reading,
etc.)
Every time we receive an email
addressed to our user, which
starts a new thread (otherwise
the importance is just the same
as that of the thread)
We aim to rapidly deliverthe
email in the right section of the
inbox, within a 2s period
Use last 3 months of emails for
test and 12 months beforefor
training. We makePI option
availableto user if…
• Cost < baseline heuristic (e.g.
“if senderin address book
then important”): FP costs 1,
FN costs 3
• No morethan 1 errorper X
emails
One model per user, initially
built on last 12 months of email
data, that we update…
• When an error is signaled by
the user via manual labelling
• Every 5’ by adding new data
from implicit labelling, if any
Perweek:
• Ratio: #errors explicitly signaled by user/ #emails received
• Same w. errors seen via implicit labelling
• Averagetime taken to reply to important emails
• Total time spent on inbox
Priority Inbox (PI) Louis Dorard Jan. 2017 1
• Content features: subject,
body, attachments, size
• Social features: based on info
about sender (e.g. in address
book?), previous interactions,
contextual (e.g. upcoming
meeting w. sender)
• Email labels (typically
assigned via manual rules
defined by user)
17. Our Goal
1. NSCLC Patients have a large number of scans of the course of their visit to a
hospital.
a. We want to predict which scans a patient will have after diagnosis to try and minimize the
number of required visits.
b. Go from a collection with metadata from over 60K scans to a model
18. Value Proposition
• We want to schedule Radiologists better at our Lung Cancer
Center so we have faster turn around times at lower cost
19. Decision
• How many scans to we expect that will need to be read?
• How many radiologists need to be on duty in a given week?
20. ML Task
• Given a patient history and diagnosis predict the number of
future scans
• Input:
• Patient History
• Patient Information
• Output
• Number of CT scans needed post diagnosis
21. Data Sources
1.This course
a. CSV File of Scans with DICOM Headers
b. CSV File extracted from Tumor Board
2. Your own hospital
a. PACS
b. Tumor Board
c. Other Interesting Sources
i. RIS Reports
ii. Pathology Reports
23. Collecting Data
• This Course
• Data is already prepared we just have to join it
• At your hospital
• Take list of patients from Tumor Board
• Find all scans for each patient in PACS
• Extract DICOM Header as Table
24. Collecting Data (PACS)
• Beyond this course
• DCMTK
• Python
• https://github.com/joshy/pypacscrawler
• RCC42C - Joshy Cyriac - Open Source Tools for
Rapidly Indexing, Searching, and Processing
Image Data from the PACS
25. Collecting Data
• This Course
• We read in scans from a PACS Output
• We read a list of patients from a Tumor Board Output
• We join the two tables on Patient ID
• We convert strings into dates
26. Features
• High Value Features
• Number of previous scans (hypothesis that having more scans before mean
there is less need to scan later)
• Age (older patients will have more complications?)
• Gender (could be gender differences)
• Interesting Features
• Referring Physician (maybe some physicians order more scans than others)
• Institution (some hospitals might order more scans than others)
• No
• Accession Number, Patient ID, Patient Name
27. Pivoting / Pivot Tables
1. Many times the data we want isn’t in the right format
a. We have a fully expanded list of scans
b. We want the number of unique studies per patient organized by scan type
c. This requires a number of different operations
38. Features
• High Value Features
• Number of previous scans (hypothesis that having more scans before mean
there is less need to scan later)
• Age (older patients will have more complications?)
• Gender (could be gender differences)
• Interesting Features
• Referring Physician (maybe some physicians order more scans than others)
• Institution (some hospitals might order more scans than others)
• No
• Accession Number, Patient ID, Patient Name
39. Offline Evaluations
• Predict the number of scans
• Penalize the wrong number of scans linearly
• Even better
• Predict the number of scans per week
• Penalize by radiologist hour mismatch per week
40. Making Predictions
• As a patient is diagnosed with NSCLC gather the whole patient
history
• Predict the number of scans required in the future to plan better
for capacity
41. Building a Model
1.Models
a. Partitioning
i. Training Data
ii. Validation Data
b. Model Selection
i. Model
Representation
c. Scoring
i. Confusion Matrix
ii. R^2
iii. ROC Curve
44. Processing Data
● Pipeline / Workflow
○ Data processing should be a clearly defined, transparent workflow
○ Where is data read from
○ How can it be combined (Patient ID? KIS ID? Accession Number?)
○ Which fields/columns should be transformed and how
○ How can it be reorganized (pivoting)
○ How can we apply this to any new data and make it clear for people unfamiliar
46. Applying to new data
● We have built a model and tested it a bit
● Now we want to apply it to some new data
● We can take the entire workflow and make a ‘meta-node’ out of it (Node of Nodes)
47. Train, Validation, Test
● We have a training and testing dataset
● We partition the training into a training set and a validation
48. Saving Predictions on Test (CSV Writer)
● The CSV Writer node will export the table from KNIME to a
CSV File.
● The node has to be reconfigured in order to export the results
so right click and select Configure…
● Then click the Browse… button and save the file on the
desktop
50. Our ‘In-Class’ Competition http://bit.ly/2zJMVps
Sign up yourself or
Guest Account
Username: rsna2017
Password: rsna2017
(but you won’t be on
The leaderboard and
your results will be
deleted and only the
first 10 can use it)
52. Other Models
● Random Forest Regressor (not classifier)
○ Replace the Learner and Predictors (both)
https://www.dropbox.com/s/pfgz0z8kt6tbdcw
/AdvancedWorkflows.knar?dl=0
53. Review Important Points
● Clearly defined Goal
○ Predict a category
■ Classification (disease type, high/low risk patients)
○ Predict a number
■ Regression (risk factor, life expectancy, treatment dose, number of scans)
○ Think about workflow integration
■ Predicting the past isn’t helpful
○ What is accuracy?
● Collecting and Organizing Data
○ Pipeline Thinking
○ Finding a representative data-set
● Deciding on a validation strategy
○ Train / Test split
○ Cross-validation
● Evaluating Outcomes / Improving Models
54. Above and Beyond
● Just the beginning of predictive analytics and visualization
● Here are some other things we can do with the data
○ Timeline
■ Look at the timeline of events that happen to a given patient
○ Different scan types
■ What things are more likely / less likely