DataScience Lab, 13 мая 2017
Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов
Виктор Сарапин (CEO at V.I.Tech)
Как эффективно определять дубликаты на десятках миллионов пациентов, и как определять пропущенные диагнозы и лечебные действия.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
2. www.vitech.com.ua
The Problem
You have a database of 30M patients with all medical records.
Each patient described by 250K of binary features.
You need a system for finding N most similar patients to a
given one.
Jesus, it’s Big Data, get Hadoop!
4. www.vitech.com.ua
Extremes
Pre-compute
none
Pre-compute
none
Pre-compute
all
Pre-compute
all
450+ trillion pairs450+ trillion pairs
Stored as key-values,
more than 1Pb for
values only
Stored as key-values,
more than 1Pb for
values only
Compare 30 million
pairs by 250K
features
Compare 30 million
pairs by 250K
features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
Jesus, it’s Big Data, get Hadoop!
5. www.vitech.com.ua
Extremes: What to do?
Ideas:
1.we don’t need the meaning of
each feature, we only care about
similarity of the patients;
2.we don’t want to compare very
different patients, we want to
compare only the most similar
ones.
8. www.vitech.com.ua
Idea 1: Reduce dimensionality
Decrease dimensionality of the data while preserving
similarities: LSH with MinHashing
9. www.vitech.com.ua
Idea 2: Group similar
1. Can’t have ungrouped patients
2. Need to work in minibatches (chunks)
3. Need stochastic guarantees
Size matters.
14. www.vitech.com.ua
Idea 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too
15. www.vitech.com.ua
The Solution
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar
16. www.vitech.com.ua
The Results
50000 clusters up to ~1000 patients per
cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a
regular office laptop