Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Software engineering practices for the data science and machine learning lifecycle

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Data breach
Data breach
Chargement dans…3
×

Consultez-les par la suite

1 sur 27 Publicité

Software engineering practices for the data science and machine learning lifecycle

Télécharger pour lire hors ligne

With the advent of newer frameworks and toolkits, data scientists are now more productive than ever and starting to prove indispensable to enterprises. Typical organizations have large teams of data scientists who build out key analytics assets that are used on a daily basis and an integral part of live transactions. However, there is also quite a lot of chaos and complexities that get introduced because of the state of the industry. Many packages used by data scientists are from open source, and even if they are well curated, there is a growing tendency to pick out the cutting-edge or unstable packages and frameworks to accelerate analytics. Different data scientists may use different versions of runtimes, different Python or R versions, or even different versions of the same packages. Predominantly data scientists work on their laptops and it becomes difficult to reproduce their environments for use by others. Since data science is now a team sport across multiple personas, involving non-practitioners, traditional application developers, execs, and IT operators, how does an enterprise create a platform for productive cross-role collaboration?

Enterprises need a very reliable and repeatable process, especially when it results in something that affects their production environments. They also require a well managed approach that enables the graduation of an asset from development through a testing and staging process to production. Given the pace of businesses nowadays, the process needs to be quite agile and flexible too—even enabling an easy path to reversing a change. Compliance and audit processes require clear lineage and history as well as approval chains.

In the traditional software engineering world, this lifecycle has been well understood and best practices have been followed for ages. But what does it mean when you have non-programmers or users who are not really trained in software engineering philosophies or who perceive all of this as "big process" roadblocks in their daily work ? How do you we engage them in a productive manner and yet support enterprise requirements for reliability, tracking, and a clear continuous integration and delivery practice? The presenters, in this session, will bring up interesting techniques based on their user research, real life customer interviews, and productized best practices. The presenters also invite the audience to share their stories and best practices to make this a lively conversation.

Speaker
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM

With the advent of newer frameworks and toolkits, data scientists are now more productive than ever and starting to prove indispensable to enterprises. Typical organizations have large teams of data scientists who build out key analytics assets that are used on a daily basis and an integral part of live transactions. However, there is also quite a lot of chaos and complexities that get introduced because of the state of the industry. Many packages used by data scientists are from open source, and even if they are well curated, there is a growing tendency to pick out the cutting-edge or unstable packages and frameworks to accelerate analytics. Different data scientists may use different versions of runtimes, different Python or R versions, or even different versions of the same packages. Predominantly data scientists work on their laptops and it becomes difficult to reproduce their environments for use by others. Since data science is now a team sport across multiple personas, involving non-practitioners, traditional application developers, execs, and IT operators, how does an enterprise create a platform for productive cross-role collaboration?

Enterprises need a very reliable and repeatable process, especially when it results in something that affects their production environments. They also require a well managed approach that enables the graduation of an asset from development through a testing and staging process to production. Given the pace of businesses nowadays, the process needs to be quite agile and flexible too—even enabling an easy path to reversing a change. Compliance and audit processes require clear lineage and history as well as approval chains.

In the traditional software engineering world, this lifecycle has been well understood and best practices have been followed for ages. But what does it mean when you have non-programmers or users who are not really trained in software engineering philosophies or who perceive all of this as "big process" roadblocks in their daily work ? How do you we engage them in a productive manner and yet support enterprise requirements for reliability, tracking, and a clear continuous integration and delivery practice? The presenters, in this session, will bring up interesting techniques based on their user research, real life customer interviews, and productized best practices. The presenters also invite the audience to share their stories and best practices to make this a lively conversation.

Speaker
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Software engineering practices for the data science and machine learning lifecycle (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

Software engineering practices for the data science and machine learning lifecycle

  1. 1. © 2018 IBM Corporation Applying Software Engineering Practices for the Data Science & ML Lifecycle Data Works Summit, San Jose 2018 Sriram Srinivasan Architect - IBM Data Science & Machine Learning, Cloud Private for Data IBM Data Science Experience
  2. 2. © 2017 IBM Corporation<#> Overview  Enterprises, as usual, want quick return on investments in Data Science  But with a shrinking dev -> prod cycle: application of new techniques and course corrections on a continuous basis is the norm  Data Science & Machine Learning are increasingly cross-team endeavors  Data Engineers, Business Analysts, DBAs and Data Stewards are frequently involved in the lifecycle  The “Cloud” has been a great influence  Economies of scale with regards to infrastructure cost, quick re-assignments of resources are expected  Automation, APIs, Repeatability, Reliability and Elasticity are essential for “operations”.  Data Science & ML need to exhibit the same maturity as other enterprise class apps :  Compliance & Regulations are still a critical mandate for large Enterprises to adhere to  Security, audit-ability and governance need to be in-place from the start and not an after-thought.
  3. 3. © 2017 IBM Corporation<#> Data Scientist Concerns  Where is the data I need to drive business insights?  I don’t want to do all the plumbing – connect to databases, Hadoop etc. How do I collaborate and share my work with others?  What visualization techniques exist to tell my story?  How do I bring my familiar R/Python libraries to this new Data Science platform?  How do I use the latest libraries/Technique or newer versions ?  How do I procure compute resources for my experimentation ?  With specialized compute such as GPUs How are my Machine Learning Models performing & how to improve them?  I have this Machine Learning Model, how do I help deploy it in production?
  4. 4. Data Science Experience Access to libraries & tools.. an ever growing list..  Multiple programming languages – Python, R, Scala..  Modern Data Scientists are programmers/ software developers too !  Build your favorite libraries or experiment with new ones  Modularization via packages & dependency management are problems just as with any Software development  Publish apps and expose APIs.. – share & collaborate  Work with a variety of data sources and technologies.. easily.. Machine Learning Environments Deep Learning Environments SPSS Modeler …. …. ….
  5. 5. © 2017 IBM Corporation<#> Challenges for the Enterprise  Ensure secure data access & auditability - for governance and compliance  Control and Curate access to data and for all open source libraries used  Explainability and reproducibility of machine learning activities  Improve trust in analytics and predictions  Efficient Collaboration and versioning of all source, sample data and models  Easy teaming with accountability  Establish Continuous integration practices just as with any Enterprise software  Agility in delivery and problem resolutions in production  Publish/Share and identify provenance/ lineage with confidence  Visibility and Access control  Effective Resource utilization and ability to scale-out on demand  Guarantee SLAs for production work, balance resources amongst different data scientists, machine learning practioners' workloads  Goal: Operationalize Data Science !
  6. 6. 5 tenets for operationalizing data science Analytics-Ready Data Managed Trusted Quality, Provenance and Explainability Resilient Measurable Monitor + Measure Evolution Deliver & ImproveAt Scale & Always On
  7. 7. Where’s my data ? Analytics-Ready Data Managed • access to data with techniques to track & deal with sensitive content • data virtualization • automate-able pipelines for data preparation, transformations Need: An Enterprise Catalog & Data Integration capabilities
  8. 8. How can I convince you to use this model ? • provenance of data used to train & test • lineage of the model - transforms , features & labels • model explainability - algorithm, size of data, compute resources used to train & test, evaluation thresholds, repeatable Trusted Quality, Provenance and Explainability How was the model built ? Need: An enterprise Catalog for Analytics & Model assets
  9. 9. Dependable for your business Resilient At Scale & Always On • reliable & performant for (re-)training • highly available, low latency model serving at real time even with sophisticated data prep • outage free model /version upgrades in production ML infused in real-time, critical business processes Must have: A platform for elasticity, reliability & load-balancing
  10. 10. Is the model still good enough ? Measurable Monitor + Measure • latency metrics for real-time scoring • frequent accuracy evaluations with thresholds • health monitoring for model decay Desired: Continuous Model evaluations & instrumentations
  11. 11. Growth & Maturity Evolution Deliver & Improve • versioning: champion/challenger, experimentation and hyper-parameterization • process efficiencies: automated re-training auto deployments (with curation & approvals) Must-have: Delta deployments & outage free upgrades
  12. 12. 12 A git and Docker/Kubernetes based approach from lessons learnt during the implementation of : IBM Data Science Experience Local & Desktop https://www.ibm.com/products/data-science-experience and IBM Cloud Private for Data http://ibm.biz/data4ai
  13. 13. Part 1: Establish a way to organize Models, scripts & all other assets  A “Data Science Project” o just a folder of assets grouped together o Contains “models” (say .pickle/.joblib or R objects with metadata) o scripts o for data wrangling and transformations o used for training & testing, evaluations, batch scoring o interactive notebooks and apps (R Shiny etc.) o sample data sets & references to remote data sources o perhaps even your own home-grown libraries & modules.. o is a git repository • why ? Easy to share, version & publish – across different teams or users with different Roles • track history of commit changes, setup approval practices & version. Familiar concept - Projects exist in most IDEs & tools Open & Portable.. - Even works on your laptop Sharable & Versionable .. - Courtesy of .git  Use Projects with all tools/IDEs and services – one container for all artifacts & for tracking dependencies
  14. 14. Part 2: Provide reproducible Environments  Enabled by Docker & Kubernetes o A Docker image represents a ”Runtime” – and essentially enables repeatability by other users • For example - a Python 2.7 Runtime with Anaconda 4.x, or an R 3.4.3 Runtime environment • Can also include IDEs or tooling, such as Jupyter /Jupyterlab or Zeppelin , RStudio etc. – exposed via a http address • Allows for many different types of packages & versions of packages to be surfaced. Compatibility of package-versions can be maintained, avoiding typical package conflict issues o Docker containers with Project volumes provide for reproducible compute environments o With Kubernetes - Orchestrate & Scale out compute and scale for users. Automate via Kube Cron jobs Port forwarding +auth Project (.git repo) mounted as a volume - or just git cloned & pushed when work is done.. Example -a Jupyterlab container Churn predictor –v2 Scoring Server pods Kiubesvc Load balance Auth-Proxy Example -a scoring service for a specific version of a Model Port forwarding Create replicas for scale/load-balancing
  15. 15. Part 3: A dev-ops process & an Enterprise Catalog  Establish a “Release” to production mechanism o git tag the state of a Data Science project to identify a version that is a release candidate o Take that released project tag through a conventional dev->stage->production pipeline o An update to the release would simply translate to a “pull” from a new git tag. o Stand up Docker containers in Kubernetes deployments + svc) to deploy Data Science artifacts, expose Web Services or simply use Kube Jobs to run scripts as needed.  A catalog for metadata and lineage o All asset metadata is recorded in this repository, including all data assets in the Enterprise • Enables tracking of relationships between assets – including references to data sources such as Relational tables/views or HDFS files etc. • Manage Projects and versions in development, Releases in production • Track APIs / URL end-points and Apps being exposed (and consumers) o Establish policies for governance & compliance (apart from just access control)
  16. 16. Summary: The Governed Data Science lifecycle - is a team sport Data Engineer CDO (Data Steward) Data Scientist Organizes • Data & Analytics Asset Enterprise Catalog • Lineage • Governance of Data & Models • Audits & Policies • Enables Model Explainability Collects Builds data lakes and warehouses Gets Data Ready for Analytics Analyzes Explores, Shapes data & trains models Exec App. Developer Problem Statement or target opportunity Finds Data Explores & Understands Data Collects, Preps & Persists Data Extracts features for ML Train Models Deploy & monitor accuracy Trusted Predictions Experiments Sets goals & measures results Real-time Apps & business processes Infuses ML in apps & processes PRODUCTION • Secure & Governed • Monitor & Instrument • High Throughput -Load Balance & Scale • Reliable Deployments - outage free upgrades Auto-retrain & upgrade Refine Features, Lineage/Relationships recorded in Governance Catalog Development & prototyping Production Admin/Ops
  17. 17. 17 June 2018 / © 2018 IBM Corporation Backup IBM Cloud Private for Data & Data Science Experience Local
  18. 18. Build & Collaborate Collaborate within git-backed projects Sample data Understand data distributions & profile
  19. 19. Understand, Analyze, Train ML models Jupyter notebook Environment Python 2.7/3.5 with Anaconda Scala, R R Studio Environment with > 300 packages, R Markdown, R Shiny Zeppelin notebook with Python 2.7 with Anaconda Data Scientist Analyzes Experiments, trains models Evaluates model accuracy Publish Apps & models Experiments Features, Lineage recorded in Governance Catalog Models, Scoring services & Apps Publish to the Catalog Explores & Understands Data, distributions - ML Feature engineering - Visualizations - Notebooks - Train Models - Dashboards, apps
  20. 20. Self service Compute Environments Servers/IDEs - lifecyle easily controlled by each Data Scientist Self-serve reservations of compute resources Worker compute resources – for batch jobs run on-demand or on schedule Environments are essentially Kubernetes pods – with High Availability & Compute scale-out baked in (load-balancing/auto-scaling is being planned for a future spring) On demand or leased compute
  21. 21. Extend .. – Roll your own Environments Add libs/packages to the existing Jupyter, Rstudio , Zeppelin IDE Environments or introduce new Job “Worker” environments https://content-dsxlocal.mybluemix.net/docs/content/local/images.html DSX Local provides a Docker Registry (and replicated for HA) as well. These images get managed by DSX and is used to help build out custom Environments Plug-n-Play extensibility Reproducibility, courtesy of Docker images
  22. 22. Automate .. Jobs – trigger on-demand or by a schedule. such as for Model Evaluations, Batch scoring or even continuous (re-) training
  23. 23. Monitor models through a dashboard Model versioning, evaluation history Publish versions of models, supporting dev/stage/production paradigm Monitor scalability through cluster dashboard Adapt scalability by redistributing compute/memory/disk resources Deploy, monitor and manage
  24. 24. Deployment manager - Project Releases Project releases Deployed & (delta) updatable Current git tag • Develop in one DSX Local instance & deploy/manage in another (or the same too) • Easy support for Hybrid use cases - develop & train on-prem, deploy in the cloud (or vice versa)
  25. 25. Bring in a new “release” to production New Releases - from a “Source” Project in the same cluster New Releases - from a “Source” Project pulled from github/bitbucket New Releases - from a “Source” Project created from a .tar.gz package
  26. 26. Expose a ML model via a REST API replicas for load balancing pick a version to expose (multiple deployments are possible too..) Optionally reserve compute scoring end-point Model pre-loaded into memory inside scoring containers
  27. 27. Expose Python and R scripts as a Web Service Custom scripts can be externalized as a REST service - say for custom prediction functions

×