Contenu connexe
Similaire à Near Duplicate Detection for Medical Imaging Data Warehouse Construction (20)
Plus de Pradeeban Kathiravelu, Ph.D. (20)
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
- 1. RESEARCH POSTER PRESENTATION DESIGN © 2015
www.PosterPresentations.com
Introduction
Distributed Near Duplicate Detection
●
Integrate medical data from various heterogeneous medical data sources and private
archives using the public APIs.
●
Curate the integrated data into a data warehouse for public access.
●
Store the detected duplicate pairs into a separate data source.
●
Duplicate detection by analyzing the potential data pairs from the original data sources,
using similarity matrices for textual data.
●
Hierarchical meta data attached to the binary medical data to identify, classify, and find
duplicates among the binary raw data.
●
Considers the inconsistencies in representation.
– Usage of acronyms instead of the full form of the attributes.
– Using different measurement units.
●
Data is published to various data sources by the medical data publishers
– through the respective write APIs of the data sources.
●
Connects to the original data sources through their read APIs.
●
Output of consolidated data and duplicate pairs
– stored through the relevant write APIs.
●
Medical data consumers consume the data from the warehouse composed by MediCurator
through its read API.
●
The data warehouse is considered to be free from the duplicates
– False positives and false negatives.
– based on the effectiveness of the similarity matrices and similarity join algorithms used.
References
●
Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficient similarity joins for near-
duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15.
●
"Kathiravelu, Pradeeban; Galhardas, Helena; Veiga, Luís; ",∂u∂u Multi-Tenanted Framework:
Distributed Near Duplicate Detection for Big Data, On the Move to Meaningful Internet
Systems: OTM 2015 Conferences, 237-256, 2015, Springer International Publishing
●
"Kathiravelu, Pradeeban; Sharma, Ashish;", MEDIator: A Data Sharing Synchronization
Platform for Heterogeneous Medical Image Archives, "Workshop on Connected Health at Big
Data Era (BigCHat'15) , co-located with 21 st ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD 2015)", 2015, ACM.
●
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle
M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public
Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013,
pp 1045-1057.
●
Hazelcast for a distributed near duplicate detection.
●
Meta Data attached to the binary images in Medical Image Archives
– The Cancer Imaging Archive (TCIA)
●
●
●
●
●
●
●
●
●
●
●
●
Pradeeban Kathiravelu Ashish Sharma
Medical Imaging Data Warehouse Construction
Near Duplicate Detection for
●
Medical data warehouses and image archives are constructed by integrating multiple private
and public data sources.
●
Finding almost identical entries is crucial for warehouse construction.
●
Medical image archives are huge and consist of structured and hierarchical data, which may
be accessed by querying the metadata.
●
Existing solutions tend to be too specific.
– Master Patient Index (MPI) for patient records.
●
Multiple dimensions and attributes
– including medications, clinical, and pathological data
– should be considered for a complete duplicate detection and elimination.
●
MediCurator is a near duplicate detection framework for heterogeneous medical data
sources in constructing data warehouses.
●
MediCurator has been developed to retrieve medical data from
– various data sources, including: MySQL, MongoDB, CSV files, and
– medical image archives such as TCIA
●
MediCurator fits as part of the ETL process.
– Duplicates are detected in-memory.
– Merged data stored into data warehouses hosted in Hadoop Distributed File System
(HDFS).
MediCurator Approach
Design
Implementation
●
A prototype has been implemented.
– Hazelcast as the distributed execution framework.
– Distributed execution of research near duplicate detection algorithms on metadata.
– Speed-up of ten-folds, compared to the existing solutions such as MPI systems.
●
MediCurator functions as an integration middleware
– for data warehouse construction
– with duplicate detection and elimination
– from the raw textual medical data, or the binary data by leveraging the meta data
attached to it.
●
{pkathi2, ashish.sharma} @ emory.edu
Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA.
Acknowledgments
* Google Summer of Code 2015
* NCI U01 [1U01CA187013-01], Resources for development and validation of
Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory)