Alyona Medelyan (Pingar), Anna Divoli (Pingar)
presented at Strata O'Reilly Making Data Work Conference on March 1, 2012
The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.
Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.
In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.
In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.
And read a full interview with Alyona and Anna at http://radar.oreilly.com/2012/02/unstructured-data-analysis-tools.html
2. Problem 1
New York London
How do lawyers scan, file, store & share
client’s case documents efficiently?
Images: Ambro / FreeDigitalPhotos.net
3. slambo_42@flickr
Anoto AB@flickr
EHR
EMR
PHR
How do doctors, patients &
researchers distribute & share
medical records efficiently?
4. The FATCA Legislation Problem 3
Takes effect 1 January 2013
annual
report
30%
witholding
tax
Foreign
Financial
waiver
Ins.tu.on
with
IRS
agreement
U.S.
account
holders
U.S.
ownership
en..es
with
without
Custodian
bank
waiver
waiver
without
IRS
agreement
30%
witholding
tax
How can a financial institution find U.S. citizens
in masses of paperwork efficiently?
5. How much time do we actually spend on …
Searching,
gathering
info
17
Wri.ng
emails
14
Crea.ng
docs
13
Analyzing
info
10
Reviewing
docs
9
Organizing
docs
7
Crea.ng
presenta.ons
7
Edi.ng
images
6
Entering
data
6
Translates
to
annual
costs:
Search:
17h
/
week
=
$37,000
/
year
Approving
docs
4
Publishing
docs
4
IDC: Hidden cost of information
Transla.ng
docs
1 average hours / week
6. introduction
conclusions unstructured data
real life problems
compliance unstructured data
in finance & text analytics
healthcare metadata
records issues in legal domain
7. Social
News
Emails
Media
Audio
Images
Databases
Videos
Literature
Blogs
8. unstructured data
Linguistics Search
Statistics Data Extraction
Text Processing Document Organization
Machine Learning Business Intelligence
Natural Language Processing Opinion Mining
Text Mining
9. What can one mine
from unstructured data?
keywords text text text
text text text
tags text text text
text text text sentiment
text text text
text text text
genre
categories
taxonomy terms
entities
names biochemical
patterns … entities text text text
text text text
text text text
text text text
text text text
text text text
10. Social
News
Emails
Media
Audio
Images
Databases
Videos
Literature
Blogs
11. text text text
text text text
text text text
text text text
text text text
text text text
People U.S. politicians News about
U.S. politicians
News
Structured & unstructured data interplay
Unique
iden.fiers
Structured
biological
Literature
references
data
Experts’
annota.on
(free
text)
12. introduction
conclusions unstructured data
real life problems
compliance
unstructured data
in finance
& text analytics
healthcare metadata
records issues in legal domain
13. Legal document processing pipeline
scan
save
ocr
New York metadata
London
dms
Images: Ambro / FreeDigitalPhotos.net
14. jacockshaw@flickr
Assigning metadata
(approximation)
15 docs per day
3 min per doc
0.75 h per day
240 working days per year
$200 hourly charge
$36,000 per year per lawyer
Keyword extraction
0.0027 min per doc
10 min for yearly worth of docs
15. Integra.ng
metadata
extrac.on
with
scanning
h[p://www.youtube.com/watch?v=kluVp25upag
17. introduction
conclusions unstructured data
real life problems
compliance
in finance unstructured data
& text analytics
healthcare metadata
records issues in legal domain
19. Na.onal
Alliance
for
Health
Informa.on
Technology
EMR
(NAHIT)
defini.ons
EHR
PHR
?
Discon.nued!
1. Name,
birth
date,
blood
type
2. Emergency
contact(s)
3. Primary
caregiver/phone
number
4. Medicines,
dosages,
and
how
long
taken
5. Allergies/allergic
reac.ons
6. Date
of
last
physical
7. Dates/results
of
tests
and
screenings
8. Major
illnesses/surgeries
and
their
dates
9. Chronic
diseases
PHI
10. Family
illness
history
11. …
h?p://www.nlm.nih.gov/medlineplus/magazine/
de-‐idenHficaHon
process
20. Medical
researchers
…
records
with
removed
PHI:
use
pa.ent
records
informa.on
from
structured
fields
for
discoveries…
but
mostly
from
free
text!
AMIA
2012
21.
siliconangle.com/blog/
www.hcpro.com
www.informaHon-‐age.com
“The
Health
Insurance
Portability
and
Accountability
Act
of
1996
(HIPAA)
Privacy
and
Security
Rules”
“The
Pa.ent
Safety
and
Quality
Improvement
Act
of
2005
(PSQIA)
Pa.ent
Safety
Rule”
22. 18 identifiers!
PHI
Names
Vehicle
iden.fiers
&
serial
numbers,
incl.
license
Geographic
subdivisions
plate
numbers
smaller
than
a
State:
street
address,
city,
county,
precinct,
zip
code…
Device
iden.fiers
&
Dates
(except
year):
birth,
serial
numbers
admission,
discharge…
URLs
/
IP
addresses
Phone
/
Fax
numbers
Email
addresses
Biometric
iden.fiers,
including
finger
and
voice
prints
Social
security
#
Face
photo
images
Medical
records
#
&
any
comparable
images
Health
plan
beneficiary#
Any
other
unique
IDs
etc.
Accounts
#
23. slambo_42@flickr Thanks
for
discussions:
Nigam
Shah,
Stanford
Eneida
Mendonca,
UWinscosin,
Madison
Irena
Spasic,
Cardiff
University
text text text
text text text
text text text
text text text
text text text
text text text
keywords
tags
Anoto AB@flickr
24. introduction
conclusions unstructured data
real life problems
compliance
in finance unstructured data
& text analytics
healthcare metadata
records issues in legal domain
25. The FATCA Legislation
Takes effect 1 January 2013
annual
report
30%
witholding
tax
waiver
Foreign
Financial
Ins.tu.on
with
IRS
agreement
U.S.
account
holders
U.S.
ownership
en..es
with
without
Custodian
bank
waiver
waiver
30%
witholding
tax
without
IRS
agreement
27. Recommended Solution
from FATCA Legislation:
• “Query an electronic database using
standard queries in programming languages”
• “Adopt similar approaches as used for the
Anti-money-laundering and Know-your-customer
requirements”
• “Note that information, data, or files are not
electronically searchable if they are stored as
images”
28. walmink,
thomwatson@flikr
FATCA COMPLIANCE – STEP 2
Contact client for additional info or a waver
29. Actual Solution
for the FATCA Legislation:
link
analysis
gather
the
trail
client’s
data
ocr
convert
all
images
to
text
en.ty
extrac.on
detect
loca.ons,
bank
numbers
analysis
auto-‐categorize
check
resolve
inconsistencies
31. introduction
conclusions unstructured data
real life problems
compliance
in finance unstructured data
& text analytics
healthcare metadata
records issues in legal domain
32. Alyona Medelyan, PhD Anna Divoli, PhD
@zelandiya @annadivoli
Natural Language Processing Biomedical Text Mining
Text Mining Search User Interfaces
Wikipedia Mining Human Factors
Machine Learning Knowledge Discovery
Try out text analytics provided by the Pingar API!
Online demo: apidemo.pingar.com
Free Sandbox account: pingar.com/get-the-api
Notes de l'éditeur
To summarize:In this talk we gave a brief overview of what text analytics is and how powerful it is when dealing with unstructured data.We presented 3 real world examples, where text analytics eliminates manual boring error-prone labor.In the legal domain, keyword and taxonomy term extraction facilitates automated metadata assignment.Healthcare benefits from automated entity extraction for de-identification (sanitization) and mining useful associations.In the area of compliance & forensics, text analytics helpsscanning from massive amounts of data.No matter how much further our technology develops, we will always continue to communicate in human language. The amount of unstructured data will only increase. Already there are areas where manual analytics is not sustainable. And there will be even more need for efficient text analytics in the future.