Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Data Ops:從實驗室走進生產線, 談談怎麼和資料科學家合作

Agile Taichung 2019-05-18 Meetup

  • Identifiez-vous pour voir les commentaires

Data Ops:從實驗室走進生產線, 談談怎麼和資料科學家合作

  1. 1. 從實驗室走進生產線 ——談談怎麼和資料科學家合作
  2. 2. 安捏母湯資料科學家 Source: http://codewithmax.com/2018/03/06/basic-example-of-a-neural-network-with-tensorflow-and-keras/
  3. 3. 我以為的資料科學家
  4. 4. 實際上的資料科學家 Source: Sculley et al.: Hidden Technical Debt in Machine Learning Systems
  5. 5. 當我想做一個「資料科學」專案的時候
  6. 6. 當我想做一個「資料科學」專案的時候 資料清洗 資料分析 資料驗證 資料切分 訓練模型 驗證模型
  7. 7. 當我想做一個「資料科學」產品的時候 Source: https://udn.com/news/story/11320/3222213
  8. 8. 當我想做一個「資料科學」產品的時候 資料清洗 資料分析 資料驗證 資料切分 訓練模型 驗證模型
  9. 9. 當我想做一個「資料科學」產品的時候 資料清洗 資料分析 資料驗證 資料切分 訓練模型 驗證模型 規模訓練 模型更新 模型上線 模型監控 模型日誌 模型優化
  10. 10. 資料科學工作流程 • 一致的編排與環境 • 可擴張的團隊建模協作 • 持續滿足需求 • 改進迭代週期自動部署 • 可重現的結果 • 監控品質與效能測試監控
  11. 11. 開發+運維 • 開發+運維=DevOps • 使用者、開發人員、QA、以及運維人員協力解決 軟體遞交的問題。
  12. 12. 資料+運維 • 資料+運維=DataOps • 讓所有資料從業人員(包含資料分析師、資料科學 家、資料工程師和 IT 人員等等)一起來持續地遞 交有品質的資料給應用及商業流程。
  13. 13. 資料+運維 Source: https://medium.com/data-ops/dataops-is-not-just-devops-for-data-6e03083157b7
  14. 14. 資料+運維 Source: https://medium.com/data-ops/dataops-is-not-just-devops-for-data-6e03083157b7
  15. 15. 資料+運維 Source: https://medium.com/data-ops/dataops-is-not-just-devops-for-data-6e03083157b7
  16. 16. 實踐 DataOps Source: https://www.kubeflow.org/
  17. 17. 實踐 DataOps Source: KubeCon Europe 2018
  18. 18. 實踐 DataOps Source: https://blog.paperspace.com/ci-cd-for-machine-learning-ai/
  19. 19. 實踐 DataOps Source: https://blog.paperspace.com/ci-cd-for-machine-learning-ai/
  20. 20. 實踐 DataOps https://www.infuseai.io
  21. 21. SQL DB Cosmos DB Datawarehouse Data lake Blob storage … Prepare Data Build & Train Deploy Machine Learning Process
  22. 22. How much is this car worth? Machine Learning Problem Example
  23. 23. Model Creation Is Typically Time-Consuming Mileage Condition Car brand Year of make Regulations … Parameter 1 Parameter 2 Parameter 3 Parameter 4 … Gradient Boosted Nearest Neighbors SVM Bayesian Regression LGBM … Mileage Gradient Boosted Criterion Loss Min Samples Split Min Samples Leaf Others Model Which algorithm? Which parameters?Which features? Car brand Year of make
  24. 24. Criterion Loss Min Samples Split Min Samples Leaf Others N Neighbors Weights Metric P Others Which algorithm? Which parameters?Which features? Mileage Condition Car brand Year of make Regulations … Gradient Boosted Nearest Neighbors SVM Bayesian Regression LGBM … Nearest Neighbors Model Iterate Gradient BoostedMileage Car brand Year of make Car brand Year of make Condition Model Creation Is Typically Time-Consuming
  25. 25. Which algorithm? Which parameters?Which features? Iterate Model Creation Is Typically Time-Consuming
  26. 26. Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html Machine Learning Complexity
  27. 27. Dataset Training Algorithm 1 Hyperparameter Values – config 1 Model 1 Hyperparameter Values – config 2 Model 2 Hyperparameter Values – config 3 Model 3 Model Training InfrastructureTraining Algorithm 2 Hyperparameter Values – config 4 Model 4 Model Selection & Hyperparameter Tuning
  28. 28. Introducing Automated Machine Learning Dataset Optimization Metric Constraints (Time/Cost) ML ModelAutomated ML Accessible & Faster
  29. 29. Enter data Define goals Apply constraints Output Automated ML Accelerates Model Development Input Intelligently test multiple models in parallel Optimized model
  30. 30. Automated ML Customer Testimonials • Press-coverage from public preview: • CNET • VentureBeat • PRNewswire “I quite like your AutoML function. It gives me good results compared to other libraries I tested before (tpot and auto-sklearn) that I believe was only looking at scores and often gave me models that over-trained my data. And of course the model from your suggested code is better.” - Big oil company “I will start with AutoML and use the algorithm that AutoML recommends to further tune the model” - Data Scientist “I actually enjoy being able to use AutoML in a Jupyter notebook. The DataRobot interface was nice for non-experts, but for someone like me, it felt a bit basic.” - Data Scientist
  31. 31. Automated ML Capabilities
  32. 32. Automated ML Capabilities • Based on Microsoft Research • Brain trained with several million experiments • Collaborative filtering and Bayesian optimization • Privacy preserving: No need to “see” the data
  33. 33. Automated ML Capabilities • ML Scenarios: Classification & Regression, Forecasting • Integration: Azure Machine Learning, Azure Notebooks, Jupyter Notebooks • Data Type: Numeric, Text • Languages: Python SDK for deployment and hosting for inference • Training Compute: Local Machine, Remote Azure DSVM (Linux), Azure Batch AI, Databricks • Transparency: View run history, model metrics • Scale: Faster model training using multiple cores and parallel experiments
  34. 34. • Dropping high cardinality or no variance features • Missing value imputation • Generating additional features • Transformations and encodings Feature Engineering
  35. 35. • Feature importance as part of training • Local feature importance for a given sample Model Explain-ability
  36. 36. Q & A

×