Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

DataXDay - A data scientist journey to industrialization of machine learning

51 vues

Publié le

Join the journey of a data scientist on the way to industrialization... From notebook to proof of concept, from proof of concept to production, we will cover what happened at Air France. It won’t be golden rules, but a true story. What is exactly industrializing data science? How to package data science models? How to articulate data scientists and data engineers roles? Is continuous integration a wild dream for data scientists? This journey will feed you with key concepts which worked at Air France, and might give you a new light to guide you through your own data science journey.

Pauline Ballereau - Air France & Nicolas Laille - Xebia
https://dataxday.fr/

video available: https://www.youtube.com/watch?v=ESx6wR6g4ukx

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

DataXDay - A data scientist journey to industrialization of machine learning

  1. 1. A DATA SCIENTIST JOURNEY TO INDUSTRIALIZATION OF MACHINE LEARNING MODELS DataXDay 2018 17th May 2018
  2. 2. @DataXDay DATA SCIENCE FOUNDATIONS FOR DATA SCIENCE AT AIR FRANCE 3 Adoption of Operations Research for crew scheduling Extension to other business domains: Revenue Management, Cargo, Ground services, … Adoption of Hadoop Focus on Machine Learning Ops Research is now 120 engineers in Paris and Amsterdam Adoption of data science within AFKL IT was favored by existing Operations Research practice
  3. 3. @DataXDay DATA SCIENCE MACHINE LEARNING, SPONSORED BY ORGANIZATION 4 Organization, through Customer Data Management, is one of the key sponsors of industrialized data science within AFKL Customer Data Management Customer data strategy Customer knowledge PersonalizationCoordinates IT efforts
  4. 4. @DataXDay DATA SCIENCE STARTING POINT FOR DATA SCIENCE PROJECT IS A POC LOGIC DWH Historical Data Business Intelligence LOCAL Data Sample Proof of Concept 5
  5. 5. @DataXDay DATA SCIENCE WHAT IS AN « INDUSTRIALIZED » ENGINE? Jupyter notebook, R Executable package On my own Integrated within AFKL IT live ecosystem Manual launch or crontab Automated calibration and prediction I guess my code is flawless Unit tested Theoretical performance Live feedback on performance 6
  6. 6. @DataXDay LOCAL Data Sample Proof of Concept LIVE Data feed DATA SCIENCE FROM LOCAL STUDIES… TO A ROBUST LIVE DATA PRODUCT DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA API 7
  7. 7. @DataXDay DATA SCIENTISTS X DATA ENGINEERS Fellowship
  8. 8. @DataXDay DATA SCIENTISTS X DATA ENGINEERS IT TAKES TWO TO BRING DATA PRODUCTS LIVE (AT LEAST) 9 PoC Start of industrialization Help! How to ingest and expose data? Live Product V1 Translates business ideas into data science Stats, ML, AI Data Scientist Dev, Big data, project architecture Data Engineer
  9. 9. @DataXDay DATA SCIENTISTS X DATA ENGINEERS KEEP THE FRONTIER LOOSE 10 Data scientist and data engineer are roles, not persons Awareness of data scientist role on live environments is key
  10. 10. @DataXDay LIVE Data feed DATA SCIENTISTS X DATA ENGINEERS A LIVE ECOSYSTEM DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA API 11
  11. 11. @DataXDay PACKAGING DATA SCIENCE Spark and PEX
  12. 12. @DataXDay PACKAGING DATA SCIENCE WHAT DO YOU EXPECT? 13 Features engineering Algorithm « Model » Model Training data Trained model Trained model Prediction data Predictions Setup Train Predict We are expecting two main functionalities, training and predicting
  13. 13. @DataXDay PACKAGING DATA SCIENCE STANDARDIZATION WITH THE PIPELINE PATTERN 14 LogisticRegressionModel .transform(dataset) LogisticRegression .fit(dataset) Model training Dataset Dataset + Predictions SQLTransformer VectorAssembler Feature Engineering Pipeline Model
  14. 14. @DataXDay PACKAGING DATA SCIENCE PEX, JUST LIKE UBERJAR 15 PEX Project package External packages Company packages Company packages Company packages Company packages External packages External packages External packages
  15. 15. @DataXDay LIVE Data feed PACKAGING DATA SCIENCE A LIVE ECOSYSTEM DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA 16 API
  16. 16. @DataXDay LIVE Data feed PACKAGING DATA SCIENCE A LIVE ECOSYSTEM… BUT TRAINING DATA AND LIVE DATA ARE DIFFERENT DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA 17 API
  17. 17. @DataXDay FROM DWH TO DATALAKE A detour
  18. 18. @DataXDay FROM DWH TO DATALAKE TRAINING DATA MUST BE THE SAME AS PRODUCTION • Data warehouse has a full historical data • Production platform processes just what is needed from raw data for live apps • Data processing on both side are not identical • Production platform has to create a full historical data 19
  19. 19. @DataXDay LIVE Data feed FROM DWH TO DATALAKE FROM A HISTORICAL/LIVE SYSTEM DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA API 20
  20. 20. @DataXDay LIVE FROM DWH TO DATALAKE TO A FULL LIVE SYSTEM EXPLORATION Historical Data Proof of Concept Predictions DATA 21 Data feed Historical Data API MODELS Repository
  21. 21. @DataXDay CONTINUOUS IMPROVEMENT Growing up 22
  22. 22. @DataXDay CONTINUOUS IMPROVEMENT FROM BUD TO FLOWER • Ease to deploy new model • Ease to extract new feature • Ease to access new data • Stay innovative • Time To Market 23
  23. 23. @DataXDay CONTINUOUS IMPROVEMENT CRAFTSMANSHIP FROM DATA SCIENTIST SIDE 24
  24. 24. @DataXDay Goal Make sure each code modification is not breaking anything What to do ? Regularly fetch sources, build project and run tests Needs Tools to automate all tedious and repetitive tasks Because we are lazy CONTINUOUS IMPROVEMENT CONTINUOUS INTEGRATION 25
  25. 25. @DataXDay CDCIDevelopment CONTINUOUS IMPROVEMENT DATA SCIENTIST - SOFTWARE FACTORY 26 Exploration Build PEX Expose PEX for other IT teams
  26. 26. @DataXDay CONTINUOUS IMPROVEMENT TRACK MODEL VERSIONING • Calibration meta data • Dataset used • Timestamp + Code version • Keep track between models and predictions • Model used • Unique ID of prediction • Input dataset 27
  27. 27. @DataXDay LIVE CONTINUOUS IMPROVEMENT FEEDBACK LOOP EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA 28 Data feed Historical Data API feedback Metrics
  28. 28. @DataXDay NEXT STEP Improve and share best practices
  29. 29. @DataXDay NEXT STEP TOO MANY JOURNEYS • How to maintain the momentum, after few teams started the adventure ? • Every teams experienced a different journey • But every teams find different paths 30

×