DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

•

0 j'aime•460 vues

DataScience Lab, 13 мая 2017 Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов Виктор Сарапин (CEO at V.I.Tech) Как эффективно определять дубликаты на десятках миллионов пациентов, и как определять пропущенные диагнозы и лечебные действия. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

Technologie

www.vitech.com.ua
The Problem
You have a database of 30M patients with all medical records.
Each patient described by 250K of binary features.
You need a system for finding N most similar patients to a
given one.
Jesus, it’s Big Data, get Hadoop!

www.vitech.com.ua
Extremes
Pre-compute
none
Pre-compute
none
Pre-compute
all
Pre-compute
all
450+ trillion pairs450+ trillion pairs
Stored as key-values,
more than 1Pb for
values only
Stored as key-values,
more than 1Pb for
values only
Compare 30 million
pairs by 250K
features
Compare 30 million
pairs by 250K
features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
Jesus, it’s Big Data, get Hadoop!

www.vitech.com.ua
Extremes: What to do?
Ideas:
1.we don’t need the meaning of
each feature, we only care about
similarity of the patients;
2.we don’t want to compare very
different patients, we want to
compare only the most similar
ones.

www.vitech.com.ua
Idea 1: Reduce dimensionality
Patient 1 Patient 2 Patient 3
Dictionary Code 1 1 1 0
Dictionary Code 2 0 1 0
Dictionary Code 3 1 0 1
Data representation

www.vitech.com.ua
Idea 1: Reduce dimensionality
Jaccard Similarity as metric
J(X,Y) = |X∩Y| / |X Y|∪

www.vitech.com.ua
Idea 1: Reduce dimensionality
Decrease dimensionality of the data while preserving
similarities: LSH with MinHashing

www.vitech.com.ua
Idea 2: Group similar
1. Can’t have ungrouped patients
2. Need to work in minibatches (chunks)
3. Need stochastic guarantees
Size matters.

www.vitech.com.ua
Idea 2: Group similar
Estimating mean
Hoeffding's inequality


   mp 2
max 2exp2ˆ  
ˆ

www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes
Joint deviation probability:


        
         
   
 
ij
ijijijij
ijijijij
D
mm
mm
ccpccp
ccccp









22
22
maxmax
max
22exp4
2exp22exp2
ˆˆ
ˆ,ˆ



 ijcˆ
 ijcˆ

www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes

www.vitech.com.ua
Idea 2: Group similar
Stochastic k-modes - convergence
1 74 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97
0
2000
4000
6000
8000
10000
12000
14000
benchmark features changed k-modes features changed

www.vitech.com.ua
Idea 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too

www.vitech.com.ua
The Solution
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar

www.vitech.com.ua
The Results
50000 clusters up to ~1000 patients per
cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a
regular office laptop

www.vitech.com.ua
What’s next?
Other metrics

Purpose-specific metrics

Time introduction

Hierarchical structuring

Cause-effect introduction

www.vitech.com.ua
What’s next?

Care gaps detection

Risk/cost management

Diagnosis recommendation by pattern

Intervention recommendation
Other applications

Recommandé

Secondary storagewardjo

Object similarity with office laptopSergey Shelpuk

Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack

How to Scale from Workstation through Cloud to HPC in Cryo-EM Processinginside-BigData.com

Gossip & Key Value StoreSajeev P

20110620 amst rdam_kpbKonrad Banachewicz

Black friday logs - Scaling ElasticsearchSylvain Wallez

Mysqlnd query cache plugin benchmark reportUlf Wendel

Recommandé

Secondary storagewardjo

Object similarity with office laptopSergey Shelpuk

Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack

How to Scale from Workstation through Cloud to HPC in Cryo-EM Processinginside-BigData.com

Gossip & Key Value StoreSajeev P

20110620 amst rdam_kpbKonrad Banachewicz

Black friday logs - Scaling ElasticsearchSylvain Wallez

Mysqlnd query cache plugin benchmark reportUlf Wendel

eScience: A Transformed Scientific MethodDuncan Hull

DepMiner: Automatic Recommendation of Transformation Rules for Method Depreca...Oleksandr Zaitsev

Computer systempranavkumar1452

Computer systemSunil Kumar

OpenPOWER Workshop in Silicon ValleyGanesan Narayanasamy

The Future of Computing is DistributedAlluxio, Inc.

Computation and KnowledgeIan Foster

Cluster Filesystems and the next 1000 human genomesGuy Coates

Building a Biomedical Knowledge Garden Benjamin Good

Cluster Analysis : Assignment & UpdateBilly Yang

SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference

RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor

E Science As A Lens On The World Lazowskaguest43b4df3

E Science As A Lens On The World LazowskaWCET

Big Data for Big DiscoveriesGovnet Events

Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds

Deep Learning with MXNet - Dmitry LarkoSri Ambati

Tna how taxonomy applications were builtJeremie Charlet

Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Alex Conway

Quantum computing for CS students: the unitary circuit modelBruno Fedrici, PhD

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...GeeksLab Odessa

DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...GeeksLab Odessa

Contenu connexe

Similaire à DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

eScience: A Transformed Scientific MethodDuncan Hull

DepMiner: Automatic Recommendation of Transformation Rules for Method Depreca...Oleksandr Zaitsev

Computer systempranavkumar1452

Computer systemSunil Kumar

OpenPOWER Workshop in Silicon ValleyGanesan Narayanasamy

The Future of Computing is DistributedAlluxio, Inc.

Computation and KnowledgeIan Foster

Cluster Filesystems and the next 1000 human genomesGuy Coates

Building a Biomedical Knowledge Garden Benjamin Good

Cluster Analysis : Assignment & UpdateBilly Yang

SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference

RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor

E Science As A Lens On The World Lazowskaguest43b4df3

E Science As A Lens On The World LazowskaWCET

Big Data for Big DiscoveriesGovnet Events

Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds

Deep Learning with MXNet - Dmitry LarkoSri Ambati

Tna how taxonomy applications were builtJeremie Charlet

Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Alex Conway

Quantum computing for CS students: the unitary circuit modelBruno Fedrici, PhD

Similaire à DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин (20)

eScience: A Transformed Scientific Method

DepMiner: Automatic Recommendation of Transformation Rules for Method Depreca...

Computer system

OpenPOWER Workshop in Silicon Valley

The Future of Computing is Distributed

Computation and Knowledge

Cluster Filesystems and the next 1000 human genomes

Building a Biomedical Knowledge Garden

Cluster Analysis : Assignment & Update

SFSCON23 - Michele Finelli - Management of large genomic data with free software

RUCK 2017 MxNet과 R을 연동한 딥러닝 소개

E Science As A Lens On The World Lazowska

Big Data for Big Discoveries

Recommender Systems from A to Z – Real-Time Deployment

Deep Learning with MXNet - Dmitry Larko

Tna how taxonomy applications were built

Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...

Quantum computing for CS students: the unitary circuit model

Plus de GeeksLab Odessa

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...GeeksLab Odessa

DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...GeeksLab Odessa

DataScience Lab 2017_Блиц-доклад_Турский ВикторGeeksLab Odessa

DataScience Lab 2017_Обзор методов детекции лиц на изображениеGeeksLab Odessa

DataScienceLab2017_Блиц-докладGeeksLab Odessa

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa

DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...GeeksLab Odessa

DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко GeeksLab Odessa

DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...GeeksLab Odessa

DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...GeeksLab Odessa

DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...GeeksLab Odessa

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...GeeksLab Odessa

DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...GeeksLab Odessa

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa

DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот GeeksLab Odessa

JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...GeeksLab Odessa

JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js GeeksLab Odessa

JS Lab2017_Redux: время двигаться дальше?_Екатерина ЛизогубоваGeeksLab Odessa

Plus de GeeksLab Odessa (20)

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...

DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...

DataScience Lab 2017_Блиц-доклад_Турский Виктор

DataScience Lab 2017_Обзор методов детекции лиц на изображение

DataScienceLab2017_Блиц-доклад

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...

DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...

DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко

DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...

DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...

DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...

DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...

DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот

JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...

JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js

JS Lab2017_Redux: время двигаться дальше?_Екатерина Лизогубова

Dernier

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

"ML in Production",Oleksandr BaganFwdays

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Advanced Computer Architecture – An IntroductionDilum Bandara

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Dernier (20)

Advanced Test Driven-Development @ php[tek] 2024

"ML in Production",Oleksandr Bagan

Vertex AI Gemini Prompt Engineering Tips

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Dev Dives: Streamline document processing with UiPath Studio Web

DevEX - reference for building teams, processes, and platforms

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Designing IA for AI - Information Architecture Conference 2024

SAP Build Work Zone - Overview L2-L3.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Gen AI in Business - Global Trends Report 2024.pdf

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

What's New in Teams Calling, Meetings and Devices March 2024

Advanced Computer Architecture – An Introduction

Human Factors of XR: Using Human Factors to Design XR Systems

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Debugging python applications inside k8s environment", Andrii Soldatenko

Connect Wave/ connectwave Pitch Deck Presentation

DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание пропущенных диагнозов_Виктор Сарапин

1. Patient Similarity on office laptop

2. www.vitech.com.ua The Problem You have a database of 30M patients with all medical records. Each patient described by 250K of binary features. You need a system for finding N most similar patients to a given one. Jesus, it’s Big Data, get Hadoop!

3. www.vitech.com.ua Extremes

4. www.vitech.com.ua Extremes Pre-compute none Pre-compute none Pre-compute all Pre-compute all 450+ trillion pairs450+ trillion pairs Stored as key-values, more than 1Pb for values only Stored as key-values, more than 1Pb for values only Compare 30 million pairs by 250K features Compare 30 million pairs by 250K features 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time) 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time) Jesus, it’s Big Data, get Hadoop!

5. www.vitech.com.ua Extremes: What to do? Ideas: 1.we don’t need the meaning of each feature, we only care about similarity of the patients; 2.we don’t want to compare very different patients, we want to compare only the most similar ones.

6. www.vitech.com.ua Idea 1: Reduce dimensionality Patient 1 Patient 2 Patient 3 Dictionary Code 1 1 1 0 Dictionary Code 2 0 1 0 Dictionary Code 3 1 0 1 Data representation

7. www.vitech.com.ua Idea 1: Reduce dimensionality Jaccard Similarity as metric J(X,Y) = |X∩Y| / |X Y|∪

8. www.vitech.com.ua Idea 1: Reduce dimensionality Decrease dimensionality of the data while preserving similarities: LSH with MinHashing

9. www.vitech.com.ua Idea 2: Group similar 1. Can’t have ungrouped patients 2. Need to work in minibatches (chunks) 3. Need stochastic guarantees Size matters.

10. www.vitech.com.ua Idea 2: Group similar Estimating mean Hoeffding's inequality      mp 2 max 2exp2ˆ   ˆ

11. www.vitech.com.ua Idea 2: Group similar Stochastic k-modes Joint deviation probability:                            ij ijijijij ijijijij D mm mm ccpccp ccccp          22 22 maxmax max 22exp4 2exp22exp2 ˆˆ ˆ,ˆ     ijcˆ  ijcˆ

12. www.vitech.com.ua Idea 2: Group similar Stochastic k-modes

13. www.vitech.com.ua Idea 2: Group similar Stochastic k-modes - convergence 1 74 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 0 2000 4000 6000 8000 10000 12000 14000 benchmark features changed k-modes features changed

14. www.vitech.com.ua Idea 2: Group similar Group similar patients and store groups as separate files Store centroids of each cluster in a separate file, too

15. www.vitech.com.ua The Solution 1. Load a patient 2. Reduce dimensionality with minhashing 3. Load centroid file 4. Compare patient to every centroid 5. Load cluster file of the closest centroid 6. Compare patient with patients in the cluster 7. Show top N similar

16. www.vitech.com.ua The Results 50000 clusters up to ~1000 patients per cluster ~500Kb-1Mb of every cluster file ~18Mb centroid file To do similarity search you need: ~20Gb HDD ~20Mb RAM Search works in ~100 milliseconds on a regular office laptop

17. www.vitech.com.ua What’s next? Other metrics  Purpose-specific metrics  Time introduction  Hierarchical structuring  Cause-effect introduction

18. www.vitech.com.ua What’s next?  Care gaps detection  Risk/cost management  Diagnosis recommendation by pattern  Intervention recommendation Other applications

19. www.vitech.com.ua