SlideShare une entreprise Scribd logo
1  sur  31
OCR and Text Analytics for
Medical Chart Review Process
Alex Zeltov
Darwin Leung
Ravi Chawla
Somesh Nigam
2
BIOGRAPHY
Alex Zeltov
 Research Scientist, Advanced Analytics
 Independence Blue Cross
 Lead the development and research of Big Data initiative
and predictive analytics across the Informatics Division for
Independence Blue Cross.
 Contact Info:
Phone:215.241.9885
Email: alex.zeltov@ibx.com
3
BIOGRAPHY
Darwin Leung
 Director, Informatics Application Development and
Operations
 Independence Blue Cross
 Responsible for the development of analytical applications
across the Informatics Division for Independence Blue
Cross.
 Contact Info:
Phone:215.241.2255
Email: darwin.leung@ibx.com
Background on Text Analytics and
Medical Documents
 Providers have different levels of technology readiness –
varying from Electronic Medical Records (EMR) to paper
charts.
 We want to apply text analytics to all information available for
different business cases.
 Need to bring all information collected to a level where our
technologies can be applied.
OCR for medical documents
 OCR (Optical Character Recognition) for medical documents
is useful because this software provides invaluable benefits in
terms of cost savings and even increases productivity.
 High Speed Provided by OCR
 OCR software can provide very good accuracy rates as
manual data entry but in a fraction of the time
DB
OCR + Text Analytics Process
IMG/PDF/TIF
DropBox (Share)
ImageMagic
+ OCR
HADOOP Cluster
Store text
+pdf version
of EMR in
HADOOP
Text Analytics / NLP
processing
Results
Clinical
Ontology
Predictive
Models
Custom Distributed OCR Application:
High Performance distributed OCR process runs in the background,
sharing resources with the Informatics Big Data HADOOP cluster.
Customized open source tools used in the OCR process:
• Custom distribution and parrallezation framework for OCR
• PDFtk: for normalizing pdf headers and splitting up the PDF
pages
• ImageMagick: used to resize, rotate, increase dpi, apply
various special effects to enhance quality of images. Creates
an image version of the pdf (single page).
• Tesseract OCR:
• extracts the text from a the image file and generates a text files
• generate searchable pdfs by creating meta-data in original pdf
image files
OCR Performance Statistics
Per Each Server Node:
• Image Enhancement and Document Slicing + OCR: ≈ 2 sec/pg
• 1,800 pages/hr on 1 node
18 HADOOP Cluster Nodes that run in parallel OCR process:
• 32,400 pages/hr on cluster
• Assuming typical chart 100 pages ≈ 324 charts/hr
Text Analytics Components:
Custom text analysis code using Java and Python
• Lucene – tokenization, shingles, n-gramming
• Weka - collection of machine learning algorithms for data mining.
• Advanced Query Language (AQL) - powerful text analytics engine
developed by IBM and used by IBM Watson. Executes extractors in a highly
efficient manner by using the parallelism provided by Informatics HADOOP
platform.
• OpenNLP - hosts a variety of java-based NLP tools which perform sentence
detection, tokenization, part-of-speech tagging, chunking and parsing,
named-entity detection.
10
Clinical
Ontology
DB Repo
Load Ontology
Terms Per Medical
Condition
Tokenize
Stop Word Filters
Ngram / Shingles
Stemming
Generate Token
Permutations
Intermediate
Ontology Tokens
Per Job Type
Hadoop GPFS
Ontology and Preprocessing
Hadoop Text
Analytics MR
Jobs
HADOOP
• HADOOP framework is a mechanism for analyzing huge
datasets, which do not have be housed in a datastore
• HADOOP scales out to myriad nodes and can handle all
of the activity and coordination related to data
processing.
• HADOOP Map Reduce is a way to process large data
sets by distributing the work across a large number of
nodes
.
HADOOP Components:
• Common – contains libraries and utilities needed by other Hadoop
modules.
• Hadoop Distributed File System (HDFS)
– Distributed file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster.
– HDFS creates multiple replicas of each data block and
distributes them on computers throughout a cluster to enable
reliable and rapid access.
• MapReduce – a programming model for large scale data
processing.
HADOOP Components:
• Hbase – is a distributed, column oriented NOSQL database.
• Hive – is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets.
• Sqoop – is a tool designed for efficiently transferring bulk data between
Hadoop and structured datastores such as relational databases.
• Pig – Scripting platform .
• Oozie – Workflow scheduler.
• Zookeeper – Cluster coordination.
• Mahout – Machine learning library.
Map Reduce
14
Map Reduce is a way to process large data sets by distributing the
work across a large number of nodes
• Map:
o Master node partitions the input into smaller sub-problems
o Distributes the sub-problems to the worker nodes
o Worker nodes may do the same process
• Reduce:
o Master node then takes the answers to all the sub-problems
o Combines them in some way to get the output
Map Reduce - Word Count Example
http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
Business Cases
 Product Recall
 Entity Extraction from Medical Charts
 Nurse Chart Review Process
Business Case 1: Product Recall
• The text mining process helps identify the manufacturers that
are on recall list.
• Scheduled report alerts with potential identified members that
match the recall manufacturers.
• Create a database of extracted patient and manufacturer
information.
• The OCR + Text mining process analyzes charts 300+ pages
long on average
Business Case 1: Product Recall
Business Case 1: Product Recall
• Generated reports on the OCR results
• BigSheets - Web-based spreadsheet look and feel
Business Case 1: Entity Extraction
• Generated reports on the Entity Extraction results
• Create a database of extracted entity information accessible
via jdbc/odbc.
Business Case 2: Nurse Chart Review
Process
• The text mining process helps identify conditions and
diagnoses based on the medical ontology matches for the
nurse review.
• The text analytics priorities the charts for nurse review, the
highest scored EMR charts are presented first for the nurse
review process.
• The nurse has the ability to open the text version of the chart
that was created part of the OCR process to the exact
location of the matched terms in the scanned version of chart.
Summary
 OCR software
 It can operate at high speeds and often can process batches of
medical documents in various formats (jpg, tiff, gif, pdf, etc.)
 The text data can be stored in a database and then be used for
analytics, predictive modeling and data mining
 This technology provides invaluable benefits in terms of cost
savings and productivity.
Q & A
Appendix
 HADOOP Ecosystem
 AQL
HADOOP
Ecosystem
AQL: Advanced Text Analytics
• Powerful Text Analytics engine developed by IBM and used by IBM
Watson on the Jeopardy quiz show.
• A declarative Annotation Query Language (AQL) with familiar SQL-
similar syntax for specifying text analytics extraction programs (or
extractors) with rich, clean rule semantics.
• A runtime engine for executing extractors in a highly efficient
manner by using the parallelism provided by the IBM InfoSphere
BigInsights engine using HADOOP platform.
• Built-in multilingual support for tokenization and part-of-speech
analysis.
• The text analytics system extracts information from unstructured and
semi structured data.
AQL
Sample AQL
/* Dictionary of minor conditions */
create dictionary minorConditions
from file 'minorConditions.dict'
with language as 'en';
/* Dictionary of major conditions */
create dictionary majorConditions
from file 'majorConditions.dict'
with language as 'en';
/* Extract instances of minor conditions and 'score' 1 for each instance */
create view minor as
extract 1 as disposition,
dictionary 'minorConditions' on R.text as match
from Document R;
/* Extract instances of major conditions and 'score' 2 for each instance */
create view major as
extract 2 as disposition,
dictionary 'majorConditions' on R.text as match
from Document R;
/* Union together all instances */
create view RawDisposition as
(select * from minor)
union all
(select * from major);
/* Aggregate per document score */
create view ConsolidatedDisposition as
select Sum(R.disposition) as disposition
from RawDisposition R;
export view ConsolidatedDisposition;
Developing/Testing AQL query
Entity Integration
END

Contenu connexe

Tendances

Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 

Tendances (20)

Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 

Similaire à Im symposium presentation - OCR and Text analytics for Medical Chart Review Process

WSO2 Data Analytics Server - Product Overview
WSO2 Data Analytics Server - Product OverviewWSO2 Data Analytics Server - Product Overview
WSO2 Data Analytics Server - Product OverviewWSO2
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?EDB
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare Mostafa
 
Database project edi
Database project ediDatabase project edi
Database project ediRey Jefferson
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper Vasu S
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxVanshGupta597842
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 

Similaire à Im symposium presentation - OCR and Text analytics for Medical Chart Review Process (20)

WSO2 Data Analytics Server - Product Overview
WSO2 Data Analytics Server - Product OverviewWSO2 Data Analytics Server - Product Overview
WSO2 Data Analytics Server - Product Overview
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Database project edi
Database project ediDatabase project edi
Database project edi
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Database project
Database projectDatabase project
Database project
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptx
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 

Plus de Alex Zeltov

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetupAlex Zeltov
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...Alex Zeltov
 

Plus de Alex Zeltov (6)

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetup
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
 

Dernier

Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real MeetChandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meetpriyashah722354
 
Ozhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Ozhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetOzhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Ozhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetCall Girls Service
 
Call Girl Raipur 📲 9999965857 whatsapp live cam sex service available
Call Girl Raipur 📲 9999965857 whatsapp live cam sex service availableCall Girl Raipur 📲 9999965857 whatsapp live cam sex service available
Call Girl Raipur 📲 9999965857 whatsapp live cam sex service availablegragmanisha42
 
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591adityaroy0215
 
Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★
Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★
Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★indiancallgirl4rent
 
Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...
Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...
Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...seemahedar019
 
Call Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Call Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real MeetCall Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Call Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real Meetpriyashah722354
 
Jaipur Call Girls 9257276172 Call Girl in Jaipur Rajasthan
Jaipur Call Girls 9257276172 Call Girl in Jaipur RajasthanJaipur Call Girls 9257276172 Call Girl in Jaipur Rajasthan
Jaipur Call Girls 9257276172 Call Girl in Jaipur Rajasthanindiancallgirl4rent
 
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In FaridabadCall Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabadgragmanisha42
 
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510Vipesco
 
Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.
Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.
Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.ktanvi103
 
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Service In Goa 💋 9316020077💋 Goa Call Girls By Russian Call Girl...
Call Girls Service In Goa  💋 9316020077💋 Goa Call Girls  By Russian Call Girl...Call Girls Service In Goa  💋 9316020077💋 Goa Call Girls  By Russian Call Girl...
Call Girls Service In Goa 💋 9316020077💋 Goa Call Girls By Russian Call Girl...russian goa call girl and escorts service
 
Udaipur Call Girls 📲 9999965857 Call Girl in Udaipur
Udaipur Call Girls 📲 9999965857 Call Girl in UdaipurUdaipur Call Girls 📲 9999965857 Call Girl in Udaipur
Udaipur Call Girls 📲 9999965857 Call Girl in Udaipurseemahedar019
 
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near MeVIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Memriyagarg453
 
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meetraisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetCall Girls Service
 
Mangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Mangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetMangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Mangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetCall Girls Service
 
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...Call Girls Noida
 
VIP Call Girl Sector 10 Noida Call Me: 9711199171
VIP Call Girl Sector 10 Noida Call Me: 9711199171VIP Call Girl Sector 10 Noida Call Me: 9711199171
VIP Call Girl Sector 10 Noida Call Me: 9711199171Call Girls Service Gurgaon
 

Dernier (20)

Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real MeetChandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
 
Ozhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Ozhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetOzhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Ozhukarai Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
 
Call Girl Raipur 📲 9999965857 whatsapp live cam sex service available
Call Girl Raipur 📲 9999965857 whatsapp live cam sex service availableCall Girl Raipur 📲 9999965857 whatsapp live cam sex service available
Call Girl Raipur 📲 9999965857 whatsapp live cam sex service available
 
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
 
Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★
Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★
Enjoyment ★ 8854095900 Indian Call Girls In Dehradun 🍆🍌 By Dehradun Call Girl ★
 
Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...
Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...
Jodhpur Call Girls 📲 9999965857 Jodhpur best beutiful hot girls full satisfie...
 
Call Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Call Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real MeetCall Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Call Girls Chandigarh 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
 
Jaipur Call Girls 9257276172 Call Girl in Jaipur Rajasthan
Jaipur Call Girls 9257276172 Call Girl in Jaipur RajasthanJaipur Call Girls 9257276172 Call Girl in Jaipur Rajasthan
Jaipur Call Girls 9257276172 Call Girl in Jaipur Rajasthan
 
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In FaridabadCall Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
 
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
 
Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510
 
Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.
Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.
Call Now ☎ 9999965857 !! Call Girls in Hauz Khas Escort Service Delhi N.C.R.
 
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Service In Goa 💋 9316020077💋 Goa Call Girls By Russian Call Girl...
Call Girls Service In Goa  💋 9316020077💋 Goa Call Girls  By Russian Call Girl...Call Girls Service In Goa  💋 9316020077💋 Goa Call Girls  By Russian Call Girl...
Call Girls Service In Goa 💋 9316020077💋 Goa Call Girls By Russian Call Girl...
 
Udaipur Call Girls 📲 9999965857 Call Girl in Udaipur
Udaipur Call Girls 📲 9999965857 Call Girl in UdaipurUdaipur Call Girls 📲 9999965857 Call Girl in Udaipur
Udaipur Call Girls 📲 9999965857 Call Girl in Udaipur
 
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near MeVIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
 
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meetraisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
 
Mangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Mangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetMangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
Mangalore Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
 
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
 
VIP Call Girl Sector 10 Noida Call Me: 9711199171
VIP Call Girl Sector 10 Noida Call Me: 9711199171VIP Call Girl Sector 10 Noida Call Me: 9711199171
VIP Call Girl Sector 10 Noida Call Me: 9711199171
 

Im symposium presentation - OCR and Text analytics for Medical Chart Review Process

  • 1. OCR and Text Analytics for Medical Chart Review Process Alex Zeltov Darwin Leung Ravi Chawla Somesh Nigam
  • 2. 2 BIOGRAPHY Alex Zeltov  Research Scientist, Advanced Analytics  Independence Blue Cross  Lead the development and research of Big Data initiative and predictive analytics across the Informatics Division for Independence Blue Cross.  Contact Info: Phone:215.241.9885 Email: alex.zeltov@ibx.com
  • 3. 3 BIOGRAPHY Darwin Leung  Director, Informatics Application Development and Operations  Independence Blue Cross  Responsible for the development of analytical applications across the Informatics Division for Independence Blue Cross.  Contact Info: Phone:215.241.2255 Email: darwin.leung@ibx.com
  • 4. Background on Text Analytics and Medical Documents  Providers have different levels of technology readiness – varying from Electronic Medical Records (EMR) to paper charts.  We want to apply text analytics to all information available for different business cases.  Need to bring all information collected to a level where our technologies can be applied.
  • 5. OCR for medical documents  OCR (Optical Character Recognition) for medical documents is useful because this software provides invaluable benefits in terms of cost savings and even increases productivity.  High Speed Provided by OCR  OCR software can provide very good accuracy rates as manual data entry but in a fraction of the time
  • 6. DB OCR + Text Analytics Process IMG/PDF/TIF DropBox (Share) ImageMagic + OCR HADOOP Cluster Store text +pdf version of EMR in HADOOP Text Analytics / NLP processing Results Clinical Ontology Predictive Models
  • 7. Custom Distributed OCR Application: High Performance distributed OCR process runs in the background, sharing resources with the Informatics Big Data HADOOP cluster. Customized open source tools used in the OCR process: • Custom distribution and parrallezation framework for OCR • PDFtk: for normalizing pdf headers and splitting up the PDF pages • ImageMagick: used to resize, rotate, increase dpi, apply various special effects to enhance quality of images. Creates an image version of the pdf (single page). • Tesseract OCR: • extracts the text from a the image file and generates a text files • generate searchable pdfs by creating meta-data in original pdf image files
  • 8. OCR Performance Statistics Per Each Server Node: • Image Enhancement and Document Slicing + OCR: ≈ 2 sec/pg • 1,800 pages/hr on 1 node 18 HADOOP Cluster Nodes that run in parallel OCR process: • 32,400 pages/hr on cluster • Assuming typical chart 100 pages ≈ 324 charts/hr
  • 9. Text Analytics Components: Custom text analysis code using Java and Python • Lucene – tokenization, shingles, n-gramming • Weka - collection of machine learning algorithms for data mining. • Advanced Query Language (AQL) - powerful text analytics engine developed by IBM and used by IBM Watson. Executes extractors in a highly efficient manner by using the parallelism provided by Informatics HADOOP platform. • OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection.
  • 10. 10 Clinical Ontology DB Repo Load Ontology Terms Per Medical Condition Tokenize Stop Word Filters Ngram / Shingles Stemming Generate Token Permutations Intermediate Ontology Tokens Per Job Type Hadoop GPFS Ontology and Preprocessing Hadoop Text Analytics MR Jobs
  • 11. HADOOP • HADOOP framework is a mechanism for analyzing huge datasets, which do not have be housed in a datastore • HADOOP scales out to myriad nodes and can handle all of the activity and coordination related to data processing. • HADOOP Map Reduce is a way to process large data sets by distributing the work across a large number of nodes .
  • 12. HADOOP Components: • Common – contains libraries and utilities needed by other Hadoop modules. • Hadoop Distributed File System (HDFS) – Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. – HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access. • MapReduce – a programming model for large scale data processing.
  • 13. HADOOP Components: • Hbase – is a distributed, column oriented NOSQL database. • Hive – is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets. • Sqoop – is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. • Pig – Scripting platform . • Oozie – Workflow scheduler. • Zookeeper – Cluster coordination. • Mahout – Machine learning library.
  • 14. Map Reduce 14 Map Reduce is a way to process large data sets by distributing the work across a large number of nodes • Map: o Master node partitions the input into smaller sub-problems o Distributes the sub-problems to the worker nodes o Worker nodes may do the same process • Reduce: o Master node then takes the answers to all the sub-problems o Combines them in some way to get the output
  • 15. Map Reduce - Word Count Example http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
  • 16. Business Cases  Product Recall  Entity Extraction from Medical Charts  Nurse Chart Review Process
  • 17. Business Case 1: Product Recall
  • 18. • The text mining process helps identify the manufacturers that are on recall list. • Scheduled report alerts with potential identified members that match the recall manufacturers. • Create a database of extracted patient and manufacturer information. • The OCR + Text mining process analyzes charts 300+ pages long on average Business Case 1: Product Recall
  • 19. Business Case 1: Product Recall • Generated reports on the OCR results • BigSheets - Web-based spreadsheet look and feel
  • 20. Business Case 1: Entity Extraction • Generated reports on the Entity Extraction results • Create a database of extracted entity information accessible via jdbc/odbc.
  • 21. Business Case 2: Nurse Chart Review Process • The text mining process helps identify conditions and diagnoses based on the medical ontology matches for the nurse review. • The text analytics priorities the charts for nurse review, the highest scored EMR charts are presented first for the nurse review process. • The nurse has the ability to open the text version of the chart that was created part of the OCR process to the exact location of the matched terms in the scanned version of chart.
  • 22. Summary  OCR software  It can operate at high speeds and often can process batches of medical documents in various formats (jpg, tiff, gif, pdf, etc.)  The text data can be stored in a database and then be used for analytics, predictive modeling and data mining  This technology provides invaluable benefits in terms of cost savings and productivity.
  • 23. Q & A
  • 26. AQL: Advanced Text Analytics • Powerful Text Analytics engine developed by IBM and used by IBM Watson on the Jeopardy quiz show. • A declarative Annotation Query Language (AQL) with familiar SQL- similar syntax for specifying text analytics extraction programs (or extractors) with rich, clean rule semantics. • A runtime engine for executing extractors in a highly efficient manner by using the parallelism provided by the IBM InfoSphere BigInsights engine using HADOOP platform. • Built-in multilingual support for tokenization and part-of-speech analysis. • The text analytics system extracts information from unstructured and semi structured data.
  • 27. AQL
  • 28. Sample AQL /* Dictionary of minor conditions */ create dictionary minorConditions from file 'minorConditions.dict' with language as 'en'; /* Dictionary of major conditions */ create dictionary majorConditions from file 'majorConditions.dict' with language as 'en'; /* Extract instances of minor conditions and 'score' 1 for each instance */ create view minor as extract 1 as disposition, dictionary 'minorConditions' on R.text as match from Document R; /* Extract instances of major conditions and 'score' 2 for each instance */ create view major as extract 2 as disposition, dictionary 'majorConditions' on R.text as match from Document R; /* Union together all instances */ create view RawDisposition as (select * from minor) union all (select * from major); /* Aggregate per document score */ create view ConsolidatedDisposition as select Sum(R.disposition) as disposition from RawDisposition R; export view ConsolidatedDisposition;
  • 31. END

Notes de l'éditeur

  1. Introduction of Darwin and Alex Darwin: Responsible for Informatics Solutions Development and Data Warehouse Operations Alex: Big Data Solutions Architect, Research Scientist Informatics: A centralized end-to-end development and service unit to provide data architecture, DW development, BI solutions development and business support. We provide support to the enterprise from UW, Actuary, Finance, Sales & Marketing Over the years, we have been very successful collecting and storing data from various data sources; building DW/ETL/BI tools to channel information to our business constituents But that is only scratching the service in terms of data collection. In our business, there are still paper, charts, PDFs, images, photos, faxes that contain vital information that we have not tapped into Today, we are presenting how we use the big data technologies to apply unstructured documents As we know, providers have different level of electronic readiness – we still receiving a large amount of documents in faxes, PDS or photo copies We want to apply text analytics to all information available from these structured documents– not only on the native electronic forms such as claims submissions We want to be able make these unstructured information to all of our BI solutions
  2. Before the use of OCR for medical documents manual data entry was used when capturing data from medical records. This was a time consuming and costly process as data entry teams spent thousands of hours on capturing data and ensuring its accuracy.  This manual process also meant that large data entry teams had to be maintained and paid The OCR installation can process documents faster than an entire data entry team, and since it can operate 24/7 data documents can enter a database faster and be used for text Analytics and Data mining. Unstructured documents as patient records, test results and patient activities and historical medical records can all be scanned and converted into text searchable format.  This ultimately results in not only faster conversion of documents into text searchable format, resulting data set can be pushed into databases, something which is not possible in the case of scanned documents.
  3. Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce's massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting. One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. 
  4. Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce's massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting. One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. 
  5. Apache's Hadoop framework is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce's massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting. One of the enabling technologies of the big data revolution is MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. 
  6. /* Dictionary of minor conditions */ create dictionary minorConditions from file 'minorConditions.dict' with language as 'en'; /* Dictionary of major conditions */ create dictionary majorConditions from file 'majorConditions.dict' with language as 'en'; /* Extract instances of minor conditions and 'score' 1 for each instance */ create view minor as   extract 1 as disposition,   dictionary 'minorConditions' on R.text as match from Document R; /* Extract instances of major conditions and 'score' 2 for each instance */ create view major as   extract 2 as disposition,   dictionary 'majorConditions' on R.text as match from Document R; /* Union together all instances */ create view RawDisposition as   (select * from minor) union all   (select * from major);   /* Aggregate per document score */ create view ConsolidatedDisposition as   select Sum(R.disposition) as disposition from RawDisposition R; export view ConsolidatedDisposition;