Im symposium presentation - OCR and Text analytics for Medical Chart Review Process
1. OCR and Text Analytics for
Medical Chart Review Process
Alex Zeltov
Darwin Leung
Ravi Chawla
Somesh Nigam
2. 2
BIOGRAPHY
Alex Zeltov
Research Scientist, Advanced Analytics
Independence Blue Cross
Lead the development and research of Big Data initiative
and predictive analytics across the Informatics Division for
Independence Blue Cross.
Contact Info:
Phone:215.241.9885
Email: alex.zeltov@ibx.com
3. 3
BIOGRAPHY
Darwin Leung
Director, Informatics Application Development and
Operations
Independence Blue Cross
Responsible for the development of analytical applications
across the Informatics Division for Independence Blue
Cross.
Contact Info:
Phone:215.241.2255
Email: darwin.leung@ibx.com
4. Background on Text Analytics and
Medical Documents
Providers have different levels of technology readiness –
varying from Electronic Medical Records (EMR) to paper
charts.
We want to apply text analytics to all information available for
different business cases.
Need to bring all information collected to a level where our
technologies can be applied.
5. OCR for medical documents
OCR (Optical Character Recognition) for medical documents
is useful because this software provides invaluable benefits in
terms of cost savings and even increases productivity.
High Speed Provided by OCR
OCR software can provide very good accuracy rates as
manual data entry but in a fraction of the time
6. DB
OCR + Text Analytics Process
IMG/PDF/TIF
DropBox (Share)
ImageMagic
+ OCR
HADOOP Cluster
Store text
+pdf version
of EMR in
HADOOP
Text Analytics / NLP
processing
Results
Clinical
Ontology
Predictive
Models
7. Custom Distributed OCR Application:
High Performance distributed OCR process runs in the background,
sharing resources with the Informatics Big Data HADOOP cluster.
Customized open source tools used in the OCR process:
• Custom distribution and parrallezation framework for OCR
• PDFtk: for normalizing pdf headers and splitting up the PDF
pages
• ImageMagick: used to resize, rotate, increase dpi, apply
various special effects to enhance quality of images. Creates
an image version of the pdf (single page).
• Tesseract OCR:
• extracts the text from a the image file and generates a text files
• generate searchable pdfs by creating meta-data in original pdf
image files
8. OCR Performance Statistics
Per Each Server Node:
• Image Enhancement and Document Slicing + OCR: ≈ 2 sec/pg
• 1,800 pages/hr on 1 node
18 HADOOP Cluster Nodes that run in parallel OCR process:
• 32,400 pages/hr on cluster
• Assuming typical chart 100 pages ≈ 324 charts/hr
9. Text Analytics Components:
Custom text analysis code using Java and Python
• Lucene – tokenization, shingles, n-gramming
• Weka - collection of machine learning algorithms for data mining.
• Advanced Query Language (AQL) - powerful text analytics engine
developed by IBM and used by IBM Watson. Executes extractors in a highly
efficient manner by using the parallelism provided by Informatics HADOOP
platform.
• OpenNLP - hosts a variety of java-based NLP tools which perform sentence
detection, tokenization, part-of-speech tagging, chunking and parsing,
named-entity detection.
10. 10
Clinical
Ontology
DB Repo
Load Ontology
Terms Per Medical
Condition
Tokenize
Stop Word Filters
Ngram / Shingles
Stemming
Generate Token
Permutations
Intermediate
Ontology Tokens
Per Job Type
Hadoop GPFS
Ontology and Preprocessing
Hadoop Text
Analytics MR
Jobs
11. HADOOP
• HADOOP framework is a mechanism for analyzing huge
datasets, which do not have be housed in a datastore
• HADOOP scales out to myriad nodes and can handle all
of the activity and coordination related to data
processing.
• HADOOP Map Reduce is a way to process large data
sets by distributing the work across a large number of
nodes
.
12. HADOOP Components:
• Common – contains libraries and utilities needed by other Hadoop
modules.
• Hadoop Distributed File System (HDFS)
– Distributed file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster.
– HDFS creates multiple replicas of each data block and
distributes them on computers throughout a cluster to enable
reliable and rapid access.
• MapReduce – a programming model for large scale data
processing.
13. HADOOP Components:
• Hbase – is a distributed, column oriented NOSQL database.
• Hive – is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets.
• Sqoop – is a tool designed for efficiently transferring bulk data between
Hadoop and structured datastores such as relational databases.
• Pig – Scripting platform .
• Oozie – Workflow scheduler.
• Zookeeper – Cluster coordination.
• Mahout – Machine learning library.
14. Map Reduce
14
Map Reduce is a way to process large data sets by distributing the
work across a large number of nodes
• Map:
o Master node partitions the input into smaller sub-problems
o Distributes the sub-problems to the worker nodes
o Worker nodes may do the same process
• Reduce:
o Master node then takes the answers to all the sub-problems
o Combines them in some way to get the output
15. Map Reduce - Word Count Example
http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
16. Business Cases
Product Recall
Entity Extraction from Medical Charts
Nurse Chart Review Process
18. • The text mining process helps identify the manufacturers that
are on recall list.
• Scheduled report alerts with potential identified members that
match the recall manufacturers.
• Create a database of extracted patient and manufacturer
information.
• The OCR + Text mining process analyzes charts 300+ pages
long on average
Business Case 1: Product Recall
19. Business Case 1: Product Recall
• Generated reports on the OCR results
• BigSheets - Web-based spreadsheet look and feel
20. Business Case 1: Entity Extraction
• Generated reports on the Entity Extraction results
• Create a database of extracted entity information accessible
via jdbc/odbc.
21. Business Case 2: Nurse Chart Review
Process
• The text mining process helps identify conditions and
diagnoses based on the medical ontology matches for the
nurse review.
• The text analytics priorities the charts for nurse review, the
highest scored EMR charts are presented first for the nurse
review process.
• The nurse has the ability to open the text version of the chart
that was created part of the OCR process to the exact
location of the matched terms in the scanned version of chart.
22. Summary
OCR software
It can operate at high speeds and often can process batches of
medical documents in various formats (jpg, tiff, gif, pdf, etc.)
The text data can be stored in a database and then be used for
analytics, predictive modeling and data mining
This technology provides invaluable benefits in terms of cost
savings and productivity.
26. AQL: Advanced Text Analytics
• Powerful Text Analytics engine developed by IBM and used by IBM
Watson on the Jeopardy quiz show.
• A declarative Annotation Query Language (AQL) with familiar SQL-
similar syntax for specifying text analytics extraction programs (or
extractors) with rich, clean rule semantics.
• A runtime engine for executing extractors in a highly efficient
manner by using the parallelism provided by the IBM InfoSphere
BigInsights engine using HADOOP platform.
• Built-in multilingual support for tokenization and part-of-speech
analysis.
• The text analytics system extracts information from unstructured and
semi structured data.
28. Sample AQL
/* Dictionary of minor conditions */
create dictionary minorConditions
from file 'minorConditions.dict'
with language as 'en';
/* Dictionary of major conditions */
create dictionary majorConditions
from file 'majorConditions.dict'
with language as 'en';
/* Extract instances of minor conditions and 'score' 1 for each instance */
create view minor as
extract 1 as disposition,
dictionary 'minorConditions' on R.text as match
from Document R;
/* Extract instances of major conditions and 'score' 2 for each instance */
create view major as
extract 2 as disposition,
dictionary 'majorConditions' on R.text as match
from Document R;
/* Union together all instances */
create view RawDisposition as
(select * from minor)
union all
(select * from major);
/* Aggregate per document score */
create view ConsolidatedDisposition as
select Sum(R.disposition) as disposition
from RawDisposition R;
export view ConsolidatedDisposition;
Introduction of Darwin and Alex
Darwin: Responsible for Informatics Solutions Development and Data Warehouse Operations
Alex: Big Data Solutions Architect, Research Scientist
Informatics: A centralized end-to-end development and service unit to provide data architecture, DW development, BI solutions development and business support. We provide support to the enterprise from UW, Actuary, Finance, Sales & Marketing
Over the years, we have been very successful collecting and storing data from various data sources; building DW/ETL/BI tools to channel information to our business constituents
But that is only scratching the service in terms of data collection.
In our business, there are still paper, charts, PDFs, images, photos, faxes that contain vital information that we have not tapped into
Today, we are presenting how we use the big data technologies to apply unstructured documents
As we know, providers have different level of electronic readiness – we still receiving a large amount of documents in faxes, PDS or photo copies
We want to apply text analytics to all information available from these structured documents– not only on the native electronic forms such as claims submissions
We want to be able make these unstructured information to all of our BI solutions
Before the use of OCR for medical documents manual data entry was used when capturing data from medical records.
This was a time consuming and costly process as data entry teams spent thousands of hours on capturing data and ensuring its accuracy.
This manual process also meant that large data entry teams had to be maintained and paid
The OCR installation can process documents faster than an entire data entry team, and since it can operate 24/7 data documents can enter a database faster and be used for text Analytics and Data mining.
Unstructured documents as patient records, test results and patient activities and historical medical records can all be scanned and converted into text searchable format.
This ultimately results in not only faster conversion of documents into text searchable format, resulting data set can be pushed into databases, something which is not possible in the case of scanned documents.
Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce's massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting.
One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets.
Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce's massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting.
One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets.
Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce's massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting.
One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets.
/* Dictionary of minor conditions */create dictionary minorConditionsfrom file 'minorConditions.dict'with language as 'en';/* Dictionary of major conditions */create dictionary majorConditionsfrom file 'majorConditions.dict'with language as 'en';/* Extract instances of minor conditions and 'score' 1 for each instance */create view minor as extract 1 as disposition, dictionary 'minorConditions' on R.text as matchfrom Document R;/* Extract instances of major conditions and 'score' 2 for each instance */create view major as extract 2 as disposition, dictionary 'majorConditions' on R.text as matchfrom Document R;/* Union together all instances */create view RawDisposition as (select * from minor)union all (select * from major); /* Aggregate per document score */create view ConsolidatedDisposition as select Sum(R.disposition) as dispositionfrom RawDisposition R;export view ConsolidatedDisposition;