SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Mining Unstructured Data:
Practical Applications




Alyona Medelyan @zelandiya
Anna Divoli @annadivoli
Problem 1




          New York                               London

How do lawyers scan, file, store & share
client’s case documents efficiently?
                                           Images: Ambro / FreeDigitalPhotos.net
slambo_42@flickr
Anoto AB@flickr
                     	
  
                      EHR	
  
                      EMR	
  
                      PHR	
  




                                How do doctors, patients &
                                researchers distribute & share
                                medical records efficiently?
The FATCA Legislation                                                                       Problem 3
                 Takes effect 1 January 2013



                                      annual	
  report            	
  	
  	
  30%	
  witholding	
  tax	
  

                                              Foreign	
  Financial	
  
    waiver	
  
                                                 Ins.tu.on	
  
                                               with	
  IRS	
  agreement	
  


         U.S.	
  account	
  holders	
  
        U.S.	
  ownership	
  en..es	
  

        with	
              without	
                                                                  Custodian	
  bank	
  
       waiver	
             waiver	
                                                               without	
  IRS	
  agreement	
  
                                               30%	
  witholding	
  tax	
  



How can a financial institution find U.S. citizens
in masses of paperwork efficiently?
How much time do we actually spend on …
Searching,	
  gathering	
  info	
                                                                17	
  
              Wri.ng	
  emails	
                                                      14	
  
               Crea.ng	
  docs	
                                                   13	
  
              Analyzing	
  info	
                                         10	
  
            Reviewing	
  docs	
                                   9	
  
            Organizing	
  docs	
                          7	
  
  Crea.ng	
  presenta.ons	
                               7	
  
              Edi.ng	
  images	
                  6	
  
               Entering	
  data	
                 6	
                        Translates	
  to	
  annual	
  costs:	
  
                                                                             Search:	
  17h	
  /	
  week	
  =	
  $37,000	
  /	
  year	
  
            Approving	
  docs	
           4	
  
            Publishing	
  docs	
          4	
  
                                                                                    IDC: Hidden cost of information
            Transla.ng	
  docs	
      1                                                     average hours / week
introduction


   conclusions                              unstructured data
                                            real life problems



compliance                                     unstructured data
  in finance                                   & text analytics



          healthcare                  metadata
      records issues                  in legal domain
Social	
           News	
  
                            Emails	
                  Media	
  




                                                             Audio	
                      Images	
  
Databases	
  
                                         Videos	
  




                                                                         Literature	
  
                Blogs	
  
unstructured data



Linguistics                               Search
   Statistics                        Data Extraction
 Text Processing                   Document Organization
Machine Learning                  Business Intelligence
Natural Language Processing        Opinion Mining
     Text Mining
What can one mine
                 from unstructured data?

 keywords                       text text text
                                text text text
   tags                         text text text
                                text text text                   sentiment
                                text text text
                                text text text




                                                                      genre
   categories
taxonomy terms
                          entities


             names                               biochemical
                     patterns        …             entities    text text text
                                                               text text text 	
  
                                                               text text text 	
  
                                                               text text text 	
  
                                                               text text text	
  
                                                               text text text	
  
Social	
           News	
  
                            Emails	
                  Media	
  




                                                             Audio	
                      Images	
  
Databases	
  
                                         Videos	
  




                                                                         Literature	
  
                Blogs	
  
text text text
text text text
text text text
text text text
text text text
text text text


                 People                        U.S. politicians     News about
                                                                   U.S. politicians
 News




Structured & unstructured data interplay
                                                              Unique	
  iden.fiers	
  

                          Structured	
  	
  
                          biological	
  
                                                                  Literature	
  references	
  
                          data	
  

                                                                          Experts’	
  
                                                                          annota.on	
  
                                                                          (free	
  text)	
  
introduction


   conclusions                             unstructured data
                                           real life problems



compliance
                                             unstructured data
  in finance
                                             & text analytics


          healthcare                  metadata
      records issues                  in legal domain
Legal document processing pipeline




            scan	
  
                       save	
  
             ocr	
  

 New York   metadata	
                London


                 dms	
  
                                  Images: Ambro / FreeDigitalPhotos.net
jacockshaw@flickr

                    Assigning metadata
                         (approximation)

                          15 docs per day
                           3 min per doc
                           0.75 h per day
                      240 working days per year
                         $200 hourly charge

                     $36,000 per year per lawyer




                    Keyword extraction
                         0.0027 min per doc
                    10 min for yearly worth of docs
Integra.ng	
  
	
  	
  
metadata	
  	
  
extrac.on	
  
	
  
with	
  	
  
scanning	
  
   h[p://www.youtube.com/watch?v=kluVp25upag	
  
Efficient (legal) document processing pipeline




   keywords
     tags


                metadata	
  

                   dms	
  
introduction


   conclusions                            unstructured data
                                          real life problems



compliance
  in finance                                unstructured data
                                            & text analytics


        healthcare                  metadata
    records issues                  in legal domain
EMR	
  
PHR	
  
EHR	
  
	
  
 slambo_42@flickr   Anoto AB@flickr
Na.onal	
  Alliance	
  for	
  Health	
  Informa.on	
  Technology	
  
EMR	
                                                                                                                   (NAHIT)	
  
                                                                                                                      defini.ons	
  	
  
	
                                             EHR	
  
                                               	
  
                                                                                          PHR	
                       ?	
  
                                                                                          	
  
       	
                                                                                                          Discon.nued!	
  
       1.   Name,	
  birth	
  date,	
  blood	
  type	
                                    	
  
       2.   Emergency	
  contact(s)	
                                                     	
  
       3.   Primary	
  caregiver/phone	
  number	
  
       4.   Medicines,	
  dosages,	
  and	
  how	
  long	
                                	
  
            taken	
  
       5.  Allergies/allergic	
  reac.ons	
  
                                                                                          	
  
       6.  Date	
  of	
  last	
  physical	
  
       7.  Dates/results	
  of	
  tests	
  and	
  
            screenings	
  
       8.  Major	
  illnesses/surgeries	
  and	
  their	
  
            dates	
  
       9.  Chronic	
  diseases	
                                                                           PHI	
  
       10.  Family	
  illness	
  history	
  
       11.  …	
  

       h?p://www.nlm.nih.gov/medlineplus/magazine/	
  
                                                                                                 de-­‐idenHficaHon	
  process	
  
Medical	
  researchers	
       …	
  records	
  with	
  removed	
  PHI:	
  
use	
  pa.ent	
  records	
     informa.on	
  from	
  structured	
  fields	
  
for	
  	
  discoveries…	
      but	
  mostly	
  from	
  free	
  text!	
  




                                                                 AMIA	
  2012	
  
 



                                                                            	
  


       siliconangle.com/blog/	
  


                  	
  
                                                                                                                 www.hcpro.com	
  




                  www.informaHon-­‐age.com	
  




                          “The	
  Health	
  Insurance	
  Portability	
  and	
  Accountability	
  Act	
  of	
  
                          1996	
  (HIPAA)	
  Privacy	
  and	
  Security	
  Rules”	
  
                          	
  
                          “The	
  Pa.ent	
  Safety	
  and	
  Quality	
  Improvement	
  Act	
  of	
  2005	
  
                          (PSQIA)	
  Pa.ent	
  Safety	
  Rule”	
  
                          	
  
18 identifiers!
PHI	
  
          Names          	
                                          Vehicle	
  iden.fiers	
  &	
  
                                                                     serial	
  numbers,	
  incl.	
  license	
  
          	
  


          Geographic	
  subdivisions	
                               plate	
  numbers	
  
          smaller	
  than	
  a	
  State:	
  street	
  address,	
     	
  
                                                                     	
  
          city,	
  county,	
  precinct,	
  zip	
  code…	
  
          	
  
          	
  
                                                                     Device	
  iden.fiers	
  &	
  
          Dates	
  (except	
  year):	
  birth,	
                     serial	
  numbers	
  
                                                                     	
  
          admission,	
  discharge…	
                                 	
  


                                                                     URLs	
  	
  	
  	
  /	
  	
  	
  	
  	
  	
  	
  IP	
  addresses	
  
          	
  
          	
  


          Phone	
  /	
  Fax	
  numbers
                                                                     	
  
                                                   	
                	
  


          Email	
  addresses	
                                       Biometric	
  iden.fiers,	
  
          	
                                                         including	
  finger	
  and	
  voice	
  prints	
  
          	
                                                         	
  


          Social	
  security	
  #	
  
                                                                     	
  


                                                                     Face	
  photo	
  images	
  	
  
          Medical	
  records	
  	
  #	
                              &	
  any	
  comparable	
  images	
  
          Health	
  plan	
  beneficiary#	
  
                                                                     	
  
                                                                     	
  

                                                                     Any	
  other	
  unique	
  IDs	
  etc.	
  
          Accounts	
  	
  #	
  
slambo_42@flickr                           Thanks	
  for	
  discussions:	
  
                                             	
  	
  	
  Nigam	
  Shah,	
  Stanford	
  
                                             	
  	
  	
  Eneida	
  Mendonca,	
  UWinscosin,	
  Madison	
  
                                             	
  	
  	
  Irena	
  Spasic,	
  Cardiff	
  University	
  

                     text text text
                     text text text 	
  
                     text text text 	
  
                     text text text 	
  
                     text text text	
  
                     text text text	
  




                                  keywords
                                    tags
Anoto AB@flickr
introduction


   conclusions                              unstructured data
                                            real life problems



compliance
 in finance                                   unstructured data
                                              & text analytics


          healthcare                  metadata
      records issues                  in legal domain
The FATCA Legislation
        Takes effect 1 January 2013




                          annual	
  report        	
  	
  	
  30%	
  witholding	
  tax	
  


      waiver	
  
                                 Foreign	
  Financial	
  
                                    Ins.tu.on	
  
                                  with	
  IRS	
  agreement	
  


 U.S.	
  account	
  holders	
  
U.S.	
  ownership	
  en..es	
  

  with	
           without	
                                               Custodian	
  bank	
  
 waiver	
          waiver	
        30%	
  witholding	
  tax	
           without	
  IRS	
  agreement	
  
FATCA COMPLIANCE – STEP 1
Detect U.S. citizenship indicators
Recommended Solution
from FATCA Legislation:




          •  “Query an electronic database using
             standard queries in programming languages”

          •  “Adopt similar approaches as used for the
             Anti-money-laundering and Know-your-customer
             requirements”

          •  “Note that information, data, or files are not
             electronically searchable if they are stored as
             images”
walmink,	
  thomwatson@flikr	
  




                                  FATCA COMPLIANCE – STEP 2
                                  Contact client for additional info or a waver
Actual Solution
for the FATCA Legislation:
link	
  analysis	
   gather	
  the	
  trail	
  client’s	
  data	
  
ocr	
   convert	
  all	
  images	
  to	
  text	
  
en.ty	
  extrac.on	
   detect	
  loca.ons,	
  bank	
  numbers	
  
analysis	
   auto-­‐categorize	
  

check	
   resolve	
  inconsistencies	
  
Efficient FATCA Compliance
introduction


 conclusions                                unstructured data
                                            real life problems



compliance
  in finance                                  unstructured data
                                              & text analytics


          healthcare                  metadata
      records issues                  in legal domain
Alyona Medelyan, PhD                Anna Divoli, PhD
       @zelandiya                          @annadivoli
       Natural Language Processing         Biomedical Text Mining
       Text Mining                         Search User Interfaces
       Wikipedia Mining                    Human Factors
       Machine Learning                    Knowledge Discovery




Try out text analytics provided by the Pingar API!

             Online demo: apidemo.pingar.com
     Free Sandbox account: pingar.com/get-the-api

Contenu connexe

En vedette

Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataMarco Gralike
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehousephanleson
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BIMonaheng Diaho
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementDataWorks Summit
 

En vedette (6)

Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 

Similaire à Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Andreas Haimböck-Tichy
Andreas Haimböck-TichyAndreas Haimböck-Tichy
Andreas Haimböck-TichyLucia Garcia
 
Linked data and the future of scientific publishing
Linked data and the future of scientific publishingLinked data and the future of scientific publishing
Linked data and the future of scientific publishingBradley Allen
 
Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101Health 2.0
 
Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)Health 2.0
 
Manual vs automatic vs intelligent
Manual vs automatic vs intelligentManual vs automatic vs intelligent
Manual vs automatic vs intelligentLinlin Cai
 
2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-FinalBeat Meyer
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysisikanow
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics AKAGroup
 
Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...Peter Conradie
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_finalKristi Holmes
 
Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010PrattSILS
 
From Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban futureFrom Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban futureMirko Lorenz
 
Advancing Identity Management (2007)
Advancing Identity Management (2007)Advancing Identity Management (2007)
Advancing Identity Management (2007)Duane Blackburn
 
KMWorld Martin Briefing
KMWorld Martin BriefingKMWorld Martin Briefing
KMWorld Martin Briefingmartingarland
 

Similaire à Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012 (20)

Andreas Haimböck-Tichy
Andreas Haimböck-TichyAndreas Haimböck-Tichy
Andreas Haimböck-Tichy
 
Linked data and the future of scientific publishing
Linked data and the future of scientific publishingLinked data and the future of scientific publishing
Linked data and the future of scientific publishing
 
Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101
 
Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)
 
Big data
Big dataBig data
Big data
 
Manual vs automatic vs intelligent
Manual vs automatic vs intelligentManual vs automatic vs intelligent
Manual vs automatic vs intelligent
 
2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics
 
Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_final
 
Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010
 
Neil Fraser
Neil FraserNeil Fraser
Neil Fraser
 
From Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban futureFrom Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban future
 
Internet Research Ethics and IRBs
Internet Research Ethics and IRBsInternet Research Ethics and IRBs
Internet Research Ethics and IRBs
 
Advancing Identity Management (2007)
Advancing Identity Management (2007)Advancing Identity Management (2007)
Advancing Identity Management (2007)
 
Data Ownership: Who Owns 'My Data'?
Data Ownership: Who Owns 'My Data'?Data Ownership: Who Owns 'My Data'?
Data Ownership: Who Owns 'My Data'?
 
active|watch - revolution in internet intelligence
active|watch - revolution in internet intelligenceactive|watch - revolution in internet intelligence
active|watch - revolution in internet intelligence
 
KMWorld Martin Briefing
KMWorld Martin BriefingKMWorld Martin Briefing
KMWorld Martin Briefing
 

Plus de Peter Wren-Hilton

How Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataHow Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataPeter Wren-Hilton
 
Case Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million DocumentsCase Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million DocumentsPeter Wren-Hilton
 
Discover New Value from Unstructured Data
Discover New Value from Unstructured DataDiscover New Value from Unstructured Data
Discover New Value from Unstructured DataPeter Wren-Hilton
 
Search interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesSearch interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesPeter Wren-Hilton
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics Peter Wren-Hilton
 
Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Peter Wren-Hilton
 

Plus de Peter Wren-Hilton (6)

How Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataHow Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big data
 
Case Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million DocumentsCase Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million Documents
 
Discover New Value from Unstructured Data
Discover New Value from Unstructured DataDiscover New Value from Unstructured Data
Discover New Value from Unstructured Data
 
Search interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesSearch interface feature evaluation in biosciences
Search interface feature evaluation in biosciences
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics
 
Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

  • 1. Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli
  • 2. Problem 1 New York London How do lawyers scan, file, store & share client’s case documents efficiently? Images: Ambro / FreeDigitalPhotos.net
  • 3. slambo_42@flickr Anoto AB@flickr   EHR   EMR   PHR   How do doctors, patients & researchers distribute & share medical records efficiently?
  • 4. The FATCA Legislation Problem 3 Takes effect 1 January 2013 annual  report      30%  witholding  tax   Foreign  Financial   waiver   Ins.tu.on   with  IRS  agreement   U.S.  account  holders   U.S.  ownership  en..es   with   without   Custodian  bank   waiver   waiver   without  IRS  agreement   30%  witholding  tax   How can a financial institution find U.S. citizens in masses of paperwork efficiently?
  • 5. How much time do we actually spend on … Searching,  gathering  info   17   Wri.ng  emails   14   Crea.ng  docs   13   Analyzing  info   10   Reviewing  docs   9   Organizing  docs   7   Crea.ng  presenta.ons   7   Edi.ng  images   6   Entering  data   6   Translates  to  annual  costs:   Search:  17h  /  week  =  $37,000  /  year   Approving  docs   4   Publishing  docs   4   IDC: Hidden cost of information Transla.ng  docs   1 average hours / week
  • 6. introduction conclusions unstructured data real life problems compliance unstructured data in finance & text analytics healthcare metadata records issues in legal domain
  • 7. Social   News   Emails   Media   Audio   Images   Databases   Videos   Literature   Blogs  
  • 8. unstructured data Linguistics Search Statistics Data Extraction Text Processing Document Organization Machine Learning Business Intelligence Natural Language Processing Opinion Mining Text Mining
  • 9. What can one mine from unstructured data? keywords text text text text text text tags text text text text text text sentiment text text text text text text genre categories taxonomy terms entities names biochemical patterns … entities text text text text text text   text text text   text text text   text text text   text text text  
  • 10. Social   News   Emails   Media   Audio   Images   Databases   Videos   Literature   Blogs  
  • 11. text text text text text text text text text text text text text text text text text text People U.S. politicians News about U.S. politicians News Structured & unstructured data interplay Unique  iden.fiers   Structured     biological   Literature  references   data   Experts’   annota.on   (free  text)  
  • 12. introduction conclusions unstructured data real life problems compliance unstructured data in finance & text analytics healthcare metadata records issues in legal domain
  • 13. Legal document processing pipeline scan   save   ocr   New York metadata   London dms   Images: Ambro / FreeDigitalPhotos.net
  • 14. jacockshaw@flickr Assigning metadata (approximation) 15 docs per day 3 min per doc 0.75 h per day 240 working days per year $200 hourly charge $36,000 per year per lawyer Keyword extraction 0.0027 min per doc 10 min for yearly worth of docs
  • 15. Integra.ng       metadata     extrac.on     with     scanning   h[p://www.youtube.com/watch?v=kluVp25upag  
  • 16. Efficient (legal) document processing pipeline keywords tags metadata   dms  
  • 17. introduction conclusions unstructured data real life problems compliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 18. EMR   PHR   EHR     slambo_42@flickr Anoto AB@flickr
  • 19. Na.onal  Alliance  for  Health  Informa.on  Technology   EMR   (NAHIT)   defini.ons       EHR     PHR   ?       Discon.nued!   1.  Name,  birth  date,  blood  type     2.  Emergency  contact(s)     3.  Primary  caregiver/phone  number   4.  Medicines,  dosages,  and  how  long     taken   5.  Allergies/allergic  reac.ons     6.  Date  of  last  physical   7.  Dates/results  of  tests  and   screenings   8.  Major  illnesses/surgeries  and  their   dates   9.  Chronic  diseases   PHI   10.  Family  illness  history   11.  …   h?p://www.nlm.nih.gov/medlineplus/magazine/   de-­‐idenHficaHon  process  
  • 20. Medical  researchers   …  records  with  removed  PHI:   use  pa.ent  records   informa.on  from  structured  fields   for    discoveries…   but  mostly  from  free  text!   AMIA  2012  
  • 21.     siliconangle.com/blog/     www.hcpro.com   www.informaHon-­‐age.com   “The  Health  Insurance  Portability  and  Accountability  Act  of   1996  (HIPAA)  Privacy  and  Security  Rules”     “The  Pa.ent  Safety  and  Quality  Improvement  Act  of  2005   (PSQIA)  Pa.ent  Safety  Rule”    
  • 22. 18 identifiers! PHI   Names   Vehicle  iden.fiers  &   serial  numbers,  incl.  license     Geographic  subdivisions   plate  numbers   smaller  than  a  State:  street  address,       city,  county,  precinct,  zip  code…       Device  iden.fiers  &   Dates  (except  year):  birth,   serial  numbers     admission,  discharge…     URLs        /              IP  addresses       Phone  /  Fax  numbers       Email  addresses   Biometric  iden.fiers,     including  finger  and  voice  prints       Social  security  #     Face  photo  images     Medical  records    #   &  any  comparable  images   Health  plan  beneficiary#       Any  other  unique  IDs  etc.   Accounts    #  
  • 23. slambo_42@flickr Thanks  for  discussions:        Nigam  Shah,  Stanford        Eneida  Mendonca,  UWinscosin,  Madison        Irena  Spasic,  Cardiff  University   text text text text text text   text text text   text text text   text text text   text text text   keywords tags Anoto AB@flickr
  • 24. introduction conclusions unstructured data real life problems compliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 25. The FATCA Legislation Takes effect 1 January 2013 annual  report      30%  witholding  tax   waiver   Foreign  Financial   Ins.tu.on   with  IRS  agreement   U.S.  account  holders   U.S.  ownership  en..es   with   without   Custodian  bank   waiver   waiver   30%  witholding  tax   without  IRS  agreement  
  • 26. FATCA COMPLIANCE – STEP 1 Detect U.S. citizenship indicators
  • 27. Recommended Solution from FATCA Legislation: •  “Query an electronic database using standard queries in programming languages” •  “Adopt similar approaches as used for the Anti-money-laundering and Know-your-customer requirements” •  “Note that information, data, or files are not electronically searchable if they are stored as images”
  • 28. walmink,  thomwatson@flikr   FATCA COMPLIANCE – STEP 2 Contact client for additional info or a waver
  • 29. Actual Solution for the FATCA Legislation: link  analysis   gather  the  trail  client’s  data   ocr   convert  all  images  to  text   en.ty  extrac.on   detect  loca.ons,  bank  numbers   analysis   auto-­‐categorize   check   resolve  inconsistencies  
  • 31. introduction conclusions unstructured data real life problems compliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 32. Alyona Medelyan, PhD Anna Divoli, PhD @zelandiya @annadivoli Natural Language Processing Biomedical Text Mining Text Mining Search User Interfaces Wikipedia Mining Human Factors Machine Learning Knowledge Discovery Try out text analytics provided by the Pingar API! Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api

Notes de l'éditeur

  1. To summarize:In this talk we gave a brief overview of what text analytics is and how powerful it is when dealing with unstructured data.We presented 3 real world examples, where text analytics eliminates manual boring error-prone labor.In the legal domain, keyword and taxonomy term extraction facilitates automated metadata assignment.Healthcare benefits from automated entity extraction for de-identification (sanitization) and mining useful associations.In the area of compliance & forensics, text analytics helpsscanning from massive amounts of data.No matter how much further our technology develops, we will always continue to communicate in human language. The amount of unstructured data will only increase. Already there are areas where manual analytics is not sustainable. And there will be even more need for efficient text analytics in the future.