SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Health Insurance Predictive Analysis
         with MapReduce and Machine Learning


                                                         Julien Cabot
                                                     Managing Director
                                                               OCTO
                                                                      jcabot@octo.com
                                                                         @julien_cabot


              50, avenue des Champs-Elysées   Tél : +33 (0)1 58 56 10 00
                       75008 Paris - FRANCE   Fax : +33 (0)1 58 56 10 01                 1
© OCTO 2012                                   www.octo.com
Internet as a Data Source…




              Internet as the voice of the crowd
© OCTO 2012                                                2
… in Healthcare




              71% about
              • Illness
              • Symptom
              • Medecine
              • Advice / opinion

              Main sources are old school
              forums, not social network




© OCTO 2012                                 3
Benefits for Insurance Company?


Understand the subject of interest of the
patient to design customer-centric products
and marketing actions

Anticipate the psycho-social effect due to
Internet to prevent excessive consultations
(and reimbursements)

Predict the claims while monitoring the
request about symptoms and drugs

                                                 4
How to run the predictive analysis?




                                      5
The data problem


Understand the semantic field of
Healthcare…used on Internet

Find correlation between the evolution of
claims and … many millions of unidentified
external variables

Find correlated variables… anticipating the
claims

We need some help from Machine Learning !
                                                  6
Correlation search in external datasets



Automated tokenization of       Google search             Socio-economical
message per posted date       volume of symptom           context from Open
 and semantic tagging         and drugs keywords            Data initiatives




    Trends of medical          Trends of medical
                                                           Trends of socio-
    keywords used in         keywords searched in
                                                          economical factors
         forums                     Google




                                                                  Determination
 Health claims by             Correlation
                                                                 coeff. (R²) sorted
   act typology             Search Machine                            matrix




                                                                                      7
Understand the semantic field of Healthcare

   Message                Word stemming, tagging   Timelines of
 tokenization               and common word         healthcare
   by date                  filtering with NTLK     key words
                           How to tag Healthcare
                                    words?




1-Build a first list of
keywords
                               Healthcare
                                semantic
2-Enrich the list
with highly                       field
searched keywords              keywords
                               database
3-Learn
automatically from
Wikipedia Medical
Categories
                                                                   8
How to find correlations between time series?
    Compare the evolution of the variable and the claims over the time
    Find non linear regression and learn a polymorphic predictive function
    f(x) from the dataset with Support Vector Regression (SVR)

y                                         Problem to solve

                             f(x) + ε                1 𝑇
                                                 min  𝑤 . 𝑤
                             f(x)                 w 2
                             f(x) - ε
                                                 𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε
                                                 (𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε
                                          Resolution
                         x                • Stochastic gradient descendent
                                          • Test the response through the coef.
                                            of determination R²


              Open source ML library helps!
                                                                              9
Data Processing Profiles



The current volume of external data grabbed is large but not so huge (~10 Gb)

Data aggregation
      Eg. Select … Group By Date
                                      Data volume



Correlation search         ~5Gb . 123 = 8,64 Tb
      Eg. SVR computing




                                       Data volume

          We need Parallel Computing to divide
         RAM requirement and time processing !
                                                                                10
How to build the platform?




                             11
IT drivers

                      Requirements   IT drivers
  Aggregate data
from Mb to Gb file        Data
 while sequential                     IO Elasticity
                       aggregation
     reading

    SVR, NLP           Large Tasks
 execution time is                   CPU Elasticity
 ~100ms by task         execution

Process many Tb        Large RAM
 in memory data                      RAM Elasticity
                        execution

                                     Commodity HW
Increase the ROI of    Low CAPEX
    the research                       OSS SW
    project while
  decreasing the
        TCO
                        Low OPEX     Cost Elasticity

                                                             12
Available solutions




                                                                   RAM Elasticity




                                                                                                OSS Software
                                             CPU Elasticity




                                                                                                                   Cost Elasticity
                        IO Elasticity




                                                                                    Commodity
                                                                                     Hardware
RDBMS

In Memory analytics

HPC

Hadoop
                                            With                  With                                             With
                                        repartitioning        repartitioning                                   repartitioning

AWS Elastic MapReduce
                                        Through Task Through Task




                                                                                                                                     13
AWS Elastic MapReduce Architecture




           Source: AWS

                                     14
Hadoop components




                           Custom App             Dataming tools           BI tools
                           Java, C#, PHP, …       R, SAS                   Tableau, Pentaho, …


              Hue                    Pig                    Streaming               Hive
              Hadoop GUI             Flow processing        MR scripting            SQL-like querying

Oozie                  MapReduce                                                                  Zookeeper
MR workflow            Parallel processing framework                                              Coordination service

Mahout                                                                                            Sqoop
Machine Learning
                                                                                                  RDBMS integration

Hama
Bulk synchronous                                                                                  Flume
processing                                                                                        Data stream integration
                                              Solr                     HBase
                                              Full text search         NoSQL on HDFS
                       HDFS
                       Distributed file storage


                                   Grid of commodity hardware – storage and processing

                                                                                                                         15
General architecture of the platform

                                 DataViz Application

                                                          •   Store detailed
                                                              results for
•   Store raw                                                 drill down
    data              AWS S3            Redis
•   Store results
    files

                         Core            Task           Master
                      Instance 1      Instance 1       Instance

                         Core            Task
                      Instance 2      Instance 2

                                        Task              •   For SVR and
                    2 x m2.4xlarge
                                     Instances 3              NLP
                                                              processing,
                                         &4                   only
                                     4 x m2.4xlarge
                                                                                      16
Data aggregation with Pig Job flow

Num_of_messages_by_date.pig

records = LOAD ‘/input/forums/messages.txt’
AS (str_date:chararray, message:chararray,
url:chararray);

date_grouped = GROUP records BY str_date

results = FOREACH date_grouped GENERATE
group, COUNT(records);

DUMP results;




                                                                   17
Hadoop streaming



Hadoop streaming runs map/reduce jobs with any
executables or scripts through standard input and
standard output

It looks like that (on a cluster) :
   cat input.txt | map.py | sort | reduce.py



Why Hadoop streaming?
   Intensive use of NLTK for Natural Language Processing
   Intensive use of NumPy and Sklearn for Machine Learning



                                                                  18
Stemmed word distribution with Hadoop streaming, mapper.py

Stem_distribution_by_date/mapper.py
import sys
import nltk
from nltk.tokenize import regexp_tokenize
from nltk.stem.snowball import FrenchStemmer

# input comes from STDIN (standard input)
for line in sys.stdin:
    line = line.strip()
    str_date, message, url = line.split(";")

   stemmer = FrenchStemmer("french")
   tokens = regexp_tokenize(message, pattern='w+')
   for token in tokens:
       word = stemmer.stem(token)
       if len(word) >= 3:
           print '%s;%s' % (word, str_date)




                                                                      19
Stemmed word distribution with Hadoop streaming, reducer.py

Stem_distribution_by_date/reducer.py
import sys
import json
from itertools import groupby
from operator import itemgetter
from nltk.probability import FreqDist

def read(f):
    for line in f:
        line = line.strip()
        yield line.split(';')

data = read(sys.stdin)

for current_stem, group in groupby(data, itemgetter(0)):
    values = [item[1] for item in group]
    freq_dist = FreqDist()

   print "%s;%s" % (current_stem, json.dumps(freq_dist))



                                                                       20
Conclusions




              21
Conclusions


 The correlation search identifies currently 462 variables correlated with a R² >= 80%
   and a lag >= 1 month

 Amazon Elastic MapReduce provides the elasticity required by the morphology of
   the jobs and the cost elasticity
     Monthly cost with zero activity : < 5 €
     Monthly cost with intensive activity : < 1 000 €
     The equivalent cost of the platform would be around 50 000 €


 The S3 transfer overhead is not a problem due the volume of stored data

 While Correlation search processing, only 80% max of the virtual CPU are
   used due to job scheduling with a parallelism factor of 36 instead of 48
   regarding SMP



                                                                                          22
Future works


Data mining

    Increase the number of data sources
    Testing the robustness of the predictive model over the time
    Reducing the over fitting of the correlation
    Enhance the correlation search for word while testing combinations

IT
 Switch only the correlation search to a map reduce engine for SMP
  architecture and cluster of cores, inspired by the Stanford Phoenix and the
  Nokia Disco engine
 Industrialize the data mining components as a platform for generalization to
  IARD insurance, banking, e-commerce, telecoms and retails



                                                                                 23
OCTO in a nutshell

          Big data Analytics Offer
   Business case and benchmark studies
   Business Proof of Concept
   Data feeds : Web Trends
   Big Data and Analytics architecture design
   Big data project delivery
   Training, seminar : Big Data, Hadoop



               IT Consulting firm                OCTO offices
       Established in 1998
       175 employees
       19,5 million turnover worldwide (2011)
       Verticals-based organization
             Banking – Financial Services
             Insurance
             Media – Internet – Leisure
             Industry – Distribution
             Telecom – Services

                                                                          24
Thank you!




             25

Contenu connexe

Tendances

Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured DataDataWorks Summit
 
Using Big Data to create a data drive organization
Using Big Data to create a data drive organizationUsing Big Data to create a data drive organization
Using Big Data to create a data drive organizationEdward Chenard
 
Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.SlideTeam.net
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesSlideTeam.net
 
The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries  The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries CONFENIS 2012
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation ITPaul Muller
 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotInside Analysis
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
Big Data World Forum
Big Data World ForumBig Data World Forum
Big Data World Forumbigdatawf
 
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...Digiday
 
Opening Keynote: Putting IBM Watson to Work
Opening Keynote: Putting IBM Watson to WorkOpening Keynote: Putting IBM Watson to Work
Opening Keynote: Putting IBM Watson to WorkInnoTech
 
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...Foviance
 
Debs 2012 uncertainty tutorial
Debs 2012 uncertainty tutorialDebs 2012 uncertainty tutorial
Debs 2012 uncertainty tutorialOpher Etzion
 
Jarrar.lecture notes.ontologyintroduction
Jarrar.lecture notes.ontologyintroductionJarrar.lecture notes.ontologyintroduction
Jarrar.lecture notes.ontologyintroductionSinaInstitute
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMEGigaom
 

Tendances (19)

Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured Data
 
Using Big Data to create a data drive organization
Using Big Data to create a data drive organizationUsing Big Data to create a data drive organization
Using Big Data to create a data drive organization
 
Big data primer
Big data primerBig data primer
Big data primer
 
Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.Data mining process powerpoint ppt slides.
Data mining process powerpoint ppt slides.
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templates
 
The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries  The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation IT
 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's Not
 
Informatics technologies in an evolving r & d landscape
Informatics technologies in an evolving r & d landscapeInformatics technologies in an evolving r & d landscape
Informatics technologies in an evolving r & d landscape
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
Big Data World Forum
Big Data World ForumBig Data World Forum
Big Data World Forum
 
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
 
Opening Keynote: Putting IBM Watson to Work
Opening Keynote: Putting IBM Watson to WorkOpening Keynote: Putting IBM Watson to Work
Opening Keynote: Putting IBM Watson to Work
 
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
 
Debs 2012 uncertainty tutorial
Debs 2012 uncertainty tutorialDebs 2012 uncertainty tutorial
Debs 2012 uncertainty tutorial
 
Jarrar.lecture notes.ontologyintroduction
Jarrar.lecture notes.ontologyintroductionJarrar.lecture notes.ontologyintroduction
Jarrar.lecture notes.ontologyintroduction
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
 

En vedette

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Modern Data Stack France
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiModern Data Stack France
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainModern Data Stack France
 

En vedette (20)

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
 
Cascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand DechouxCascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand Dechoux
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Dépasser map() et reduce()
Dépasser map() et reduce()Dépasser map() et reduce()
Dépasser map() et reduce()
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
 
Hadoop chez Kobojo
Hadoop chez KobojoHadoop chez Kobojo
Hadoop chez Kobojo
 
Big Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent HeuschlingBig Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent Heuschling
 
HCatalog
HCatalogHCatalog
HCatalog
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 
Retour Hadoop Summit 2012
Retour Hadoop Summit 2012Retour Hadoop Summit 2012
Retour Hadoop Summit 2012
 

Similaire à Analyse prédictive en assurance santé par Julien Cabot

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Big Data Spain
 
Extending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context AwarenessExtending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context AwarenessVictor Codina
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Appistry WGDAS Presentation
Appistry WGDAS PresentationAppistry WGDAS Presentation
Appistry WGDAS Presentationelasticdave
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingJeremy Yang
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip finalDeborah McGuinness
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxParvathyparu25
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptxayush309565
 
Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10Mario Cho
 
20120419 linkedopendataandteamsciencemcguinnesschicago
20120419 linkedopendataandteamsciencemcguinnesschicago20120419 linkedopendataandteamsciencemcguinnesschicago
20120419 linkedopendataandteamsciencemcguinnesschicagoDeborah McGuinness
 
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...Amazon Web Services
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the webJose Manuel Gómez-Pérez
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering SystemIRJET Journal
 

Similaire à Analyse prédictive en assurance santé par Julien Cabot (20)

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
 
Extending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context AwarenessExtending Recommendation Systems With Semantics And Context Awareness
Extending Recommendation Systems With Semantics And Context Awareness
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Appistry WGDAS Presentation
Appistry WGDAS PresentationAppistry WGDAS Presentation
Appistry WGDAS Presentation
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10
 
data.2.pptx
data.2.pptxdata.2.pptx
data.2.pptx
 
20120419 linkedopendataandteamsciencemcguinnesschicago
20120419 linkedopendataandteamsciencemcguinnesschicago20120419 linkedopendataandteamsciencemcguinnesschicago
20120419 linkedopendataandteamsciencemcguinnesschicago
 
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the web
 
Big data
Big dataBig data
Big data
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
395 404
395 404395 404
395 404
 

Plus de Modern Data Stack France

Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandationModern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France
 

Plus de Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 

Analyse prédictive en assurance santé par Julien Cabot

  • 1. Health Insurance Predictive Analysis with MapReduce and Machine Learning Julien Cabot Managing Director OCTO jcabot@octo.com @julien_cabot 50, avenue des Champs-Elysées Tél : +33 (0)1 58 56 10 00 75008 Paris - FRANCE Fax : +33 (0)1 58 56 10 01 1 © OCTO 2012 www.octo.com
  • 2. Internet as a Data Source… Internet as the voice of the crowd © OCTO 2012 2
  • 3. … in Healthcare 71% about • Illness • Symptom • Medecine • Advice / opinion Main sources are old school forums, not social network © OCTO 2012 3
  • 4. Benefits for Insurance Company? Understand the subject of interest of the patient to design customer-centric products and marketing actions Anticipate the psycho-social effect due to Internet to prevent excessive consultations (and reimbursements) Predict the claims while monitoring the request about symptoms and drugs 4
  • 5. How to run the predictive analysis? 5
  • 6. The data problem Understand the semantic field of Healthcare…used on Internet Find correlation between the evolution of claims and … many millions of unidentified external variables Find correlated variables… anticipating the claims We need some help from Machine Learning ! 6
  • 7. Correlation search in external datasets Automated tokenization of Google search Socio-economical message per posted date volume of symptom context from Open and semantic tagging and drugs keywords Data initiatives Trends of medical Trends of medical Trends of socio- keywords used in keywords searched in economical factors forums Google Determination Health claims by Correlation coeff. (R²) sorted act typology Search Machine matrix 7
  • 8. Understand the semantic field of Healthcare Message Word stemming, tagging Timelines of tokenization and common word healthcare by date filtering with NTLK key words How to tag Healthcare words? 1-Build a first list of keywords Healthcare semantic 2-Enrich the list with highly field searched keywords keywords database 3-Learn automatically from Wikipedia Medical Categories 8
  • 9. How to find correlations between time series? Compare the evolution of the variable and the claims over the time Find non linear regression and learn a polymorphic predictive function f(x) from the dataset with Support Vector Regression (SVR) y Problem to solve f(x) + ε 1 𝑇 min 𝑤 . 𝑤 f(x) w 2 f(x) - ε 𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε (𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε Resolution x • Stochastic gradient descendent • Test the response through the coef. of determination R² Open source ML library helps! 9
  • 10. Data Processing Profiles The current volume of external data grabbed is large but not so huge (~10 Gb) Data aggregation Eg. Select … Group By Date Data volume Correlation search ~5Gb . 123 = 8,64 Tb Eg. SVR computing Data volume We need Parallel Computing to divide RAM requirement and time processing ! 10
  • 11. How to build the platform? 11
  • 12. IT drivers Requirements IT drivers Aggregate data from Mb to Gb file Data while sequential IO Elasticity aggregation reading SVR, NLP Large Tasks execution time is CPU Elasticity ~100ms by task execution Process many Tb Large RAM in memory data RAM Elasticity execution Commodity HW Increase the ROI of Low CAPEX the research OSS SW project while decreasing the TCO Low OPEX Cost Elasticity 12
  • 13. Available solutions RAM Elasticity OSS Software CPU Elasticity Cost Elasticity IO Elasticity Commodity Hardware RDBMS In Memory analytics HPC Hadoop With With With repartitioning repartitioning repartitioning AWS Elastic MapReduce Through Task Through Task 13
  • 14. AWS Elastic MapReduce Architecture Source: AWS 14
  • 15. Hadoop components Custom App Dataming tools BI tools Java, C#, PHP, … R, SAS Tableau, Pentaho, … Hue Pig Streaming Hive Hadoop GUI Flow processing MR scripting SQL-like querying Oozie MapReduce Zookeeper MR workflow Parallel processing framework Coordination service Mahout Sqoop Machine Learning RDBMS integration Hama Bulk synchronous Flume processing Data stream integration Solr HBase Full text search NoSQL on HDFS HDFS Distributed file storage Grid of commodity hardware – storage and processing 15
  • 16. General architecture of the platform DataViz Application • Store detailed results for • Store raw drill down data AWS S3 Redis • Store results files Core Task Master Instance 1 Instance 1 Instance Core Task Instance 2 Instance 2 Task • For SVR and 2 x m2.4xlarge Instances 3 NLP processing, &4 only 4 x m2.4xlarge 16
  • 17. Data aggregation with Pig Job flow Num_of_messages_by_date.pig records = LOAD ‘/input/forums/messages.txt’ AS (str_date:chararray, message:chararray, url:chararray); date_grouped = GROUP records BY str_date results = FOREACH date_grouped GENERATE group, COUNT(records); DUMP results; 17
  • 18. Hadoop streaming Hadoop streaming runs map/reduce jobs with any executables or scripts through standard input and standard output It looks like that (on a cluster) : cat input.txt | map.py | sort | reduce.py Why Hadoop streaming? Intensive use of NLTK for Natural Language Processing Intensive use of NumPy and Sklearn for Machine Learning 18
  • 19. Stemmed word distribution with Hadoop streaming, mapper.py Stem_distribution_by_date/mapper.py import sys import nltk from nltk.tokenize import regexp_tokenize from nltk.stem.snowball import FrenchStemmer # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() str_date, message, url = line.split(";") stemmer = FrenchStemmer("french") tokens = regexp_tokenize(message, pattern='w+') for token in tokens: word = stemmer.stem(token) if len(word) >= 3: print '%s;%s' % (word, str_date) 19
  • 20. Stemmed word distribution with Hadoop streaming, reducer.py Stem_distribution_by_date/reducer.py import sys import json from itertools import groupby from operator import itemgetter from nltk.probability import FreqDist def read(f): for line in f: line = line.strip() yield line.split(';') data = read(sys.stdin) for current_stem, group in groupby(data, itemgetter(0)): values = [item[1] for item in group] freq_dist = FreqDist() print "%s;%s" % (current_stem, json.dumps(freq_dist)) 20
  • 22. Conclusions  The correlation search identifies currently 462 variables correlated with a R² >= 80% and a lag >= 1 month  Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity  Monthly cost with zero activity : < 5 €  Monthly cost with intensive activity : < 1 000 €  The equivalent cost of the platform would be around 50 000 €  The S3 transfer overhead is not a problem due the volume of stored data  While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP 22
  • 23. Future works Data mining  Increase the number of data sources  Testing the robustness of the predictive model over the time  Reducing the over fitting of the correlation  Enhance the correlation search for word while testing combinations IT  Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine  Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails 23
  • 24. OCTO in a nutshell Big data Analytics Offer  Business case and benchmark studies  Business Proof of Concept  Data feeds : Web Trends  Big Data and Analytics architecture design  Big data project delivery  Training, seminar : Big Data, Hadoop IT Consulting firm OCTO offices  Established in 1998  175 employees  19,5 million turnover worldwide (2011)  Verticals-based organization  Banking – Financial Services  Insurance  Media – Internet – Leisure  Industry – Distribution  Telecom – Services 24