Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Analyse prédictive en assurance santé par Julien Cabot

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 25 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (19)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Analyse prédictive en assurance santé par Julien Cabot (20)

Plus par HUG France (20)

Publicité

Analyse prédictive en assurance santé par Julien Cabot

  1. 1. Health Insurance Predictive Analysis with MapReduce and Machine Learning Julien Cabot Managing Director OCTO jcabot@octo.com @julien_cabot 50, avenue des Champs-Elysées Tél : +33 (0)1 58 56 10 00 75008 Paris - FRANCE Fax : +33 (0)1 58 56 10 01 1 © OCTO 2012 www.octo.com
  2. 2. Internet as a Data Source… Internet as the voice of the crowd © OCTO 2012 2
  3. 3. … in Healthcare 71% about • Illness • Symptom • Medecine • Advice / opinion Main sources are old school forums, not social network © OCTO 2012 3
  4. 4. Benefits for Insurance Company? Understand the subject of interest of the patient to design customer-centric products and marketing actions Anticipate the psycho-social effect due to Internet to prevent excessive consultations (and reimbursements) Predict the claims while monitoring the request about symptoms and drugs 4
  5. 5. How to run the predictive analysis? 5
  6. 6. The data problem Understand the semantic field of Healthcare…used on Internet Find correlation between the evolution of claims and … many millions of unidentified external variables Find correlated variables… anticipating the claims We need some help from Machine Learning ! 6
  7. 7. Correlation search in external datasets Automated tokenization of Google search Socio-economical message per posted date volume of symptom context from Open and semantic tagging and drugs keywords Data initiatives Trends of medical Trends of medical Trends of socio- keywords used in keywords searched in economical factors forums Google Determination Health claims by Correlation coeff. (R²) sorted act typology Search Machine matrix 7
  8. 8. Understand the semantic field of Healthcare Message Word stemming, tagging Timelines of tokenization and common word healthcare by date filtering with NTLK key words How to tag Healthcare words? 1-Build a first list of keywords Healthcare semantic 2-Enrich the list with highly field searched keywords keywords database 3-Learn automatically from Wikipedia Medical Categories 8
  9. 9. How to find correlations between time series? Compare the evolution of the variable and the claims over the time Find non linear regression and learn a polymorphic predictive function f(x) from the dataset with Support Vector Regression (SVR) y Problem to solve f(x) + ε 1 𝑇 min 𝑤 . 𝑤 f(x) w 2 f(x) - ε 𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε (𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε Resolution x • Stochastic gradient descendent • Test the response through the coef. of determination R² Open source ML library helps! 9
  10. 10. Data Processing Profiles The current volume of external data grabbed is large but not so huge (~10 Gb) Data aggregation Eg. Select … Group By Date Data volume Correlation search ~5Gb . 123 = 8,64 Tb Eg. SVR computing Data volume We need Parallel Computing to divide RAM requirement and time processing ! 10
  11. 11. How to build the platform? 11
  12. 12. IT drivers Requirements IT drivers Aggregate data from Mb to Gb file Data while sequential IO Elasticity aggregation reading SVR, NLP Large Tasks execution time is CPU Elasticity ~100ms by task execution Process many Tb Large RAM in memory data RAM Elasticity execution Commodity HW Increase the ROI of Low CAPEX the research OSS SW project while decreasing the TCO Low OPEX Cost Elasticity 12
  13. 13. Available solutions RAM Elasticity OSS Software CPU Elasticity Cost Elasticity IO Elasticity Commodity Hardware RDBMS In Memory analytics HPC Hadoop With With With repartitioning repartitioning repartitioning AWS Elastic MapReduce Through Task Through Task 13
  14. 14. AWS Elastic MapReduce Architecture Source: AWS 14
  15. 15. Hadoop components Custom App Dataming tools BI tools Java, C#, PHP, … R, SAS Tableau, Pentaho, … Hue Pig Streaming Hive Hadoop GUI Flow processing MR scripting SQL-like querying Oozie MapReduce Zookeeper MR workflow Parallel processing framework Coordination service Mahout Sqoop Machine Learning RDBMS integration Hama Bulk synchronous Flume processing Data stream integration Solr HBase Full text search NoSQL on HDFS HDFS Distributed file storage Grid of commodity hardware – storage and processing 15
  16. 16. General architecture of the platform DataViz Application • Store detailed results for • Store raw drill down data AWS S3 Redis • Store results files Core Task Master Instance 1 Instance 1 Instance Core Task Instance 2 Instance 2 Task • For SVR and 2 x m2.4xlarge Instances 3 NLP processing, &4 only 4 x m2.4xlarge 16
  17. 17. Data aggregation with Pig Job flow Num_of_messages_by_date.pig records = LOAD ‘/input/forums/messages.txt’ AS (str_date:chararray, message:chararray, url:chararray); date_grouped = GROUP records BY str_date results = FOREACH date_grouped GENERATE group, COUNT(records); DUMP results; 17
  18. 18. Hadoop streaming Hadoop streaming runs map/reduce jobs with any executables or scripts through standard input and standard output It looks like that (on a cluster) : cat input.txt | map.py | sort | reduce.py Why Hadoop streaming? Intensive use of NLTK for Natural Language Processing Intensive use of NumPy and Sklearn for Machine Learning 18
  19. 19. Stemmed word distribution with Hadoop streaming, mapper.py Stem_distribution_by_date/mapper.py import sys import nltk from nltk.tokenize import regexp_tokenize from nltk.stem.snowball import FrenchStemmer # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() str_date, message, url = line.split(";") stemmer = FrenchStemmer("french") tokens = regexp_tokenize(message, pattern='w+') for token in tokens: word = stemmer.stem(token) if len(word) >= 3: print '%s;%s' % (word, str_date) 19
  20. 20. Stemmed word distribution with Hadoop streaming, reducer.py Stem_distribution_by_date/reducer.py import sys import json from itertools import groupby from operator import itemgetter from nltk.probability import FreqDist def read(f): for line in f: line = line.strip() yield line.split(';') data = read(sys.stdin) for current_stem, group in groupby(data, itemgetter(0)): values = [item[1] for item in group] freq_dist = FreqDist() print "%s;%s" % (current_stem, json.dumps(freq_dist)) 20
  21. 21. Conclusions 21
  22. 22. Conclusions  The correlation search identifies currently 462 variables correlated with a R² >= 80% and a lag >= 1 month  Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity  Monthly cost with zero activity : < 5 €  Monthly cost with intensive activity : < 1 000 €  The equivalent cost of the platform would be around 50 000 €  The S3 transfer overhead is not a problem due the volume of stored data  While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP 22
  23. 23. Future works Data mining  Increase the number of data sources  Testing the robustness of the predictive model over the time  Reducing the over fitting of the correlation  Enhance the correlation search for word while testing combinations IT  Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine  Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails 23
  24. 24. OCTO in a nutshell Big data Analytics Offer  Business case and benchmark studies  Business Proof of Concept  Data feeds : Web Trends  Big Data and Analytics architecture design  Big data project delivery  Training, seminar : Big Data, Hadoop IT Consulting firm OCTO offices  Established in 1998  175 employees  19,5 million turnover worldwide (2011)  Verticals-based organization  Banking – Financial Services  Insurance  Media – Internet – Leisure  Industry – Distribution  Telecom – Services 24
  25. 25. Thank you! 25

×