Contenu connexe

Présentations pour vous(20)

Similaire à Crossing the Analytics Chasm and Getting the Models You Developed Deployed(20)

Crossing the Analytics Chasm and Getting the Models You Developed Deployed

  1. Crossing the Analytic Chasm and Getting the Models You Develop Deployed Robert L. Grossman University of Chicago and Analytic Strategy Partners LLC August 20, 2018 CMI Workshop, KDD 2018, London Why it is Important to Understand the Differences Between Deploying Analytic Models and Developing Analytic Models.
  2. 1. Overview of Developing vs Deploying Analytic Models* *This section is adapted from: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
  3. The Analytic Diamond* Analytic strategy, governance, security & compliance. Analytic modeling Analytic operations Analytic Infrastructure *Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
  4. The Analytic Chasm There are platforms and tools for managing and processing big data (e.g. Hadoop and Spark), for building analytic models (e.g. R and SAS), but fewer options for deploying analytics into operations or for embedding analytics into products and services. Data scientists developing analytic models & algorithms Analytic infrastructure Enterprise IT deploying analytics into products, services and operations Deploying analytics 4
  5. Get the data, set up the infrastructure, put in place the compliance and security, etc. Analyzing & modeling the data Deploying the solution with the model in a manner that has an impact on the organization Time Effort Get the data Build a model Deploy the model
  6. The Five Main Approaches (E3RW) 1. Embed analytics in databases 2. Export models and deploy them by importing into Scoring Engines 3. Encapsulate models using containers (and virtual machines) 4. Read a table of parameters 5. Wrap algo code or analytic system (and perhaps create a service) Approaches (E3RW)
  7. Can you push code to deploy models? No Yes Do you have a single model with different parameters? No Yes Do you require workflows or custom models? No Yes No Yes Does a database have the analytic functionality required*? Embed the analytics in a database No Yes Are there stringent enterprise controls on models? Code the models and encapsulate with containers or VMs Use a PFA Analytic Engine Use a PMML Analytic Engine Use an Analytic Engine Code the model & read parameters *Assumes that a UDF is pushed to the database, otherwise embedding analytics into databases is on the other side of the tree.
  8. 2. Scoring Engines Typical use cases: regulated environments, healthtech, high availability applications, applications requiring long term reproducibility, etc.
  9. Exploratory Data Analysis Get and clean the data Build model in dev/modeling environment Initial deployment Use champion-challenger methodology to improve model Analytic modeling Analytic operations Deploy model Retire model and deploy improved model Select analytic problem & approach Scale up deployment ModelDev AnalyticOps Perf. data Data Scientists Enterprise IT Life cycle of a model *Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear. Deployment
  10. Differences Between the Modeling and Deployment Environments • Typically modelers use specialized languages such as SAS, SPSS or R. • Usually, developers responsible for products and services use languages such as Java, Python, C++, etc. • This can result in significant delays moving the model from the modeling environment to the deployment environment.
  11. Analytic Diamond Analytic models & workflows Analytic operations Deploying models & workflows Model* Consumer Model* Producer Analytic Infrastructure Analytics in products, services, and internal operations. *Model here also includes analytic workflows. How quick are updates of: • Model parameters? • New features? • New pre- & post- processing? Export model* Import model* aka Scoring Engine or Analytic Engine
  12. What is a Scoring Engine? • A scoring engine is a component that is integrated into products or enterprise IT that deploys analytic models in operational workflows for products and services. • A Model Interchange Format is a format that supports the exporting of a model by one application and the importing of a model by another application. • Model Interchange Formats include the Predictive Model Markup Language (PMML), the Portable Format for Analytics (PFA), and various in-house or custom formats. • Scoring engines are integrated once, but allow applications to update models as quickly as reading a a model interchange format file. 12
  13. A Brief History of the DMG Founded PMML v0.7 released CCSR Founded PMML v0.9 released PMML v1.0 released PMML v1.1 released PMML v2.0 released PMML v3.0 released PMML v2.1 released PMML v3.1 released PMML v3.2 released PMML v4.0 released PMML v4.1 released PMML v4.2.1 released Portable Format for Analytics (PFA) Introduced PMML v4.3 released support begins Membership Drive
  14. 3. Case Study: Deploying Analytics Using a Scoring Engine
  15. Would you minding writing all your models in Java? Alice, Data Scientist Bob, Data Scientist Joe, IT I write all my models in R, why don’t you do the same? I write all my models in scikit- learn, why don’t you do the same?
  16. Deploying analytic models Model* Consumer Model* Producer Export model* Import model* PMML & PFA • PMML is an XML language for describing analytic models • PFA is a JSON language for describing analytic models and workflows • Arbitrary models and workflows can be expressed in PFA. The Not-For-Profit Data Mining Group (DMG) develops the PMML and PFA standards *Model here also includes analytic workflows.
  17. • 20+ person data science group developing models in R, Python, Scikit-learn and MATLAB. • All the data scientists export their model in the Portable Format for Analytics (PFA). • The company’s product imports models in PFA and runs on their customers data as required. Export PFA Import PFA Widget records Widget scores Company’s data scientists build models Company’s services embed an analytic engine that can interpret PFA How a startup used PFA-compliant scoring engines:
  18. 4. Case Study: Scaling Bioinformatics Pipelines for the Genomic Data Commons by Encapsulating Analytics in Docker Containers
  19. NCI Genomic Data Commons* • The GDC was launched in 2016 with over 4 PB of data. • Used by 1500 -3000+ users per day and over 100,000 researchers each year. • Based upon an open source software stack that can be used to build other data commons. *Source: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
  20. TCGA dataset: 1.54 PB consisting of 577,878 files about 14,052 cases (patients), in 42 cancer types, across 29 primary sites. 2.5+ PB of cancer genomics data + Bionimbus data commons technology running multiple community developed variant calling pipelines. Over 12,000 cores and 10 PB of raw storage in 18+ racks running for months. AnalyticOps for the Genomic Data Commons
  21. GDC Pipelines Are Complex and are Mostly Written by Others Source: Center for Data Intensive Science, University of Chicago. This is an example of one the pipelines run by the GDC.
  22. Computations for a Single Genome Can Take Over a Week Source: Center for Data Intensive Science, University of Chicago.
  23. Dev Ops • Virtualization and the requirement for massive scale out spawned infrastructure automation (“infrastructure as code”). • Requirement for reducing the time to deploying code created tools for continuous integration and testing.
  24. ModelDev AnalyticOps • Use virtualization / containers, infrastructure automation and scale out to support large scale analytics. • Requirement: reduce the time and cost to do high quality analytics over large amounts of data.
  25. GDC Pipeline Automation System (GPAS) • Bioinformatics pipelines are written using the Common Workflow Language (CWL) • CWL uses DAGs to describe workflows, with each node a program • We developed a pipeline automation system (GPAS) to execute CWL pipelines with the GDC • We used Docker Containers and Kubernetes for automating the software deployment and simplifying the scale out • Our main work was the development of the pipelines, automating the processing of submitting data, QC, exception handling and monitoring. Source: Center for Data Intensive Science, University of Chicago.
  26. • Model quality (confusion matrix) • Data quality (six dimensions) • Lack of ground truth • Software errors • Sufficient monitoring of workflows • Scheduling inefficiencies • The ability to accurately predict problem jobs • Bottlenecks, stragglers, hot spots, etc. • Analytic configurations problems • System failures • Human errors Ten Factors Effecting AnalyticOps
  27. New Effort: Portable Format for Biomedical Data (PFB) • Based upon our experience with the GDC and data commons for other biomedical applications, we are developing a portable format so we can version and serialize biomedical data. • This includes the serialization of the data dictionary, pointers to third party ontologies, the data model, and all of the data, except for ”large objects,” such as BAM files, image files. • We track “large files” as digital objects with immutable GUIDs. • In practice, the large objects are often 1000x larger than the rest of the data. • Talk to us if you would like to get involved.
  28. 5. Summary
  29. E3RW Recap 1. Embed analytics in databases 2. Export models and deploy them by importing into Scoring Engines 3. Encapsulate models using containers (and virtual machines) 4. Read a table of parameters 5. Wrap algo code or analytic system (and perhaps create a service) Approaches (E3RW) • Use languages for analytics, such as PMML and PFA & analytic engines • Use languages for workflows, such as CWL & workflow engines • Use containers and container- orchestration systems for automating software deployment and scale out, such as Docker & Kubernetes Techniques
  30. Five Best Practices When Deploying Models 1. Mature analytic organizations have an environment to automate testing and deployment of analytic models. 2. Don’t think just about deploying analytic models, but make sure that you have a process for deploying analytic workflows. 3. Focus not just on reducing Type 1 and Type 2 errors, but also data input errors, data quality errors, software errors, systems errors and human errors. People only remember that model didn’t work, not whose fault it was. 4. Track value obtained by the deployed analytic model, even if it is not your explicit responsibility. 5. It is often easier to increase the value of deployed model by improving the pre- and post- processing vs chasing smaller improvements in the lift curve.
  31. Five Common Mistakes When Deploying Models 1. Not understanding all the subtle differences between the supplied run time data used to train the model and the actual run time data the model sees. 2. Thinking that the features are fixed and all that you will need to do is update the parameters. 3. Thinking the model is done and not realizing how much work is required to keep up to date all the the pre- and post- processing required. 4. Not checking in production to see if the inputs to the models drift slowly over time. 5. Not checking that the model will keep running despite missing values, garbage values, etc. (even values that should never be missing in first place).
  32. Summary 1. Deploying analytic models is a core technical competency. 2. A discipline of AnalyticOps is emerging defining best practices for running analytics at scale. 3. Building an analytic model is just the first step in its life cycle, which includes deployment, integration into a value chain, improvement, and replacement. 4. The Portable Format for Analytics (PFA) is a model interchange format for building analytic models and workflows in one environment and deploying them in another one. Analytic engines can be used to execute PFA. 5. Analytic containers are a good way of encapsulating everything needed to deploy analytic models and analytic workflows into production.
  33. Questions? 33 @bobgrossman