Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 48 Publicité

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Télécharger pour lire hors ligne

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix (20)

Publicité

Plus par Databricks (20)

Publicité

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

  1. 1. The Function, the Context, and the Data Building an Abstraction for Simpler ML Ops at Stitch Fix Elijah ben Izzy Data Platform Engineer - Model Lifecycle @elijahbenizzy linkedin.com/in/elijahbenizzy Try out Stitch Fix → goo.gl/Q3tCQ3
  2. 2. 2 - Stitch Fix/Data Science (DS) @ Stitch Fix - Common Workflows/Motivation - Representing a Model - Unlocked Capabilities - Future Musings Agenda
  3. 3. 3 The right abstraction enables separation of concerns between DS and Platforms Take Home
  4. 4. DAIS 2021 4 whoami
  5. 5. Stitch Fix?
  6. 6. DAIS 2021 6 Stitch Fix is a Personal Styling Service Shop at your personal curated store. Check out what you like.
  7. 7. DAIS 2021 7 Data Science is Behind Everything We Do algorithms-tour.stitchfix.com Algorithms Org. - 145+ Data Scientists and Platform Engineers - 3 main verticals + platform Data Platform
  8. 8. Data Science @ Stitch Fix
  9. 9. DAIS 2021 9 Common Approaches to Data Science Typical organization: ● Horizontal teams ● Hand off between fns ● Coordination required DATA SCIENCE / RESEARCH TEAMS ETL TEAMS ENGINEERING TEAMS
  10. 10. DAIS 2021 10 At Stitch Fix: ● Single organization ● No handoffs ● End to end ownership ● Lots of DS! ● Built on top of data platform tools & abstractions Data Scientists (DS) are Full Stack See https://cultivating-algos.stitchfix.com/ DATA SCIENCE ETL ENGINEERING
  11. 11. The Problem
  12. 12. DAIS 2021 12 The Problem with Verticals “DS are full stack” != “DS builds stack from the ground up” Goal: scale without -> more complex infrastructure -> more cognitive burden on DS DS should always be full stack... ...but can we shorten the stack? ML platform
  13. 13. DAIS 2021 13 Examining Workflows etl.py save on s3 copy to production Training (run at a regular cadence) Inference model microservice predictions in batch streaming predictions track metrics share with other teams Analysis
  14. 14. DAIS 2021 14 Optimizing the Workflow Goal: Build abstraction to give DS all these capabilities for free Caveat: Largely uniform workflows with independent technologies ??????? model microservice predictions in batch streaming predictions track metrics share with other teams ...
  15. 15. The Lede
  16. 16. DAIS 2021 16 Build or Buy? We built our own - Seamless integration with current infrastructure -> leverage - Model tracking/management data model was not standard - We have lots of segments/varying ways to slice and dice our models - Custom build allows for pivoting as needed - Invest in interface design to allow for plug/play with open-source options Called it the Model Envelope Hats off to MLFlow, TFX, modelDB!
  17. 17. DAIS 2021 17 What we Built DS only writes training script -- the rest is configuration-driven import model_envelope as me from sklearn import linear_model, metrics df_X, df_y = load_data_somehow() model = linear_model.LogisticRegression(multi_class='auto') model.fit(df_X, df_y) my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))
  18. 18. DAIS 2021 18 Model Envelope (ctd.) model microservice predictions in batch streaming predictions track metrics ... share with other teams model envelope registry
  19. 19. Representing a Model
  20. 20. DAIS 2021 20 Writing a Recipe The instructions The cookware The ingredients
  21. 21. DAIS 2021 21 Representing a Model The function: what the model does The context: where/how to run the model The data: data the model needs to run
  22. 22. DAIS 2021 22 The Function Artifact + Shape
  23. 23. DAIS 2021 23 The Function Artifact + Shape - Serialized model (bytes) including state - Serialization metadata my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) DS passes object, platform serializes Platform derives metadata
  24. 24. DAIS 2021 24 The Function Artifact + Shape - Function inputs - Function outputs my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) DS passes sample dataframe or specifies type-annotations Platform serializes, represents in custom format
  25. 25. DAIS 2021 25 The Context Environment + Index
  26. 26. DAIS 2021 26 The Context Environment + Index - Installed packages - Custom code - Language + version import my_custom_fancy_ml_module my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}, # pip_env=['scikit-learn', pandas'], edge case if needed custom_modules=[my_custom_fancy_ml_module]) Platform automagically derived, or DS passes pointers DS passes in as needed Platform automagically derived
  27. 27. DAIS 2021 27 The Context Environment + Index - Key-value tags - Spine/index of envelope registry import my_custom_fancy_ml_module my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}, custom_modules=[my_custom_module]) Platform derives base tags DS passes custom tags as desired `
  28. 28. DAIS 2021 28 The Data Training Data + Metrics
  29. 29. DAIS 2021 29 The Data Training Data + Metrics - Features - Summary statistics my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, feature_store_pointers=...) DS (optionally) passes spec for features Platform derives summary stats from passed data
  30. 30. DAIS 2021 30 The Data Training Data + Metrics - Scalars - Fancy metrics my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, feature_store_pointers=...) evaluations = model(df_X) my_envelope.log_metrics( validation_loss=metrics.log_loss(evaluations, df_y) roc_curve=metrics.roc_curve(evaluations, df_y)) ) DS logs metrics using Platform metric-schema library
  31. 31. Unlocked Capabilities
  32. 32. DAIS 2021 32 Online Inference Approach Generate, automatically deploy microservice for model predictions 1. Runs cron job to determine models for deployment 2. Generates code to run model microservice 3. Deploys models with config to AWS 4. Monitors/manages model infrastructure 1. Generates, tests out service locally 2. Sets up automatic deployment “rule” 3. Publishes model, waits DS Platform
  33. 33. DAIS 2021 33 Online Inference The Function - Serialized artifacted loaded on service instantiation, called during endpoints - Function shape used to create OpenAPI spec/validate inputs
  34. 34. DAIS 2021 34 Online Inference The Context - Tag spec used to automatically deploy whenever new model is published - Note: user never has to call deploy()! Done through system-managed CD. - Stored package versions used to build docker images - Custom code made accessible to model for deserialization, execution Docker Image installed python packages custom code CD
  35. 35. DAIS 2021 35 Online Inference The Data - Summary stats used to validate/monitor input (data drift) - Feature pointer used to load feature data Feature Store
  36. 36. DAIS 2021 36 Batch Inference Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla) 1. Spins up spark cluster (if specified) 2. Loads input data, optionally joins with features 3. Execute model’s predict function over input 4. Saves to output table 1. Creates config for batch job (local/spark) a. tag query to choose model b. input/output tables 2. Executes as part of ETL DS Platform
  37. 37. DAIS 2021 37 Batch Inference The function - Serialized artifacted loaded on batch job start - Function shape used to validate against inputs and outputs - MapPartitions + Pyarrow used to run models that take in DFs efficiently on spark -- abstracted away from user
  38. 38. DAIS 2021 38 Batch Inference The context - Frozen package, language versions used in installing dependencies - Custom code made accessible to model for deserialization, execution - Tags used to determine which model to run Docker Image installed python packages custom code
  39. 39. DAIS 2021 39 Batch Inference The data - Feature pointer used to load feature data if IDs specified - Evaluation table pointers stored in the registry Feature Store
  40. 40. DAIS 2021 40 Metrics Tracking Approach Allow for metrics tracking with tag-based querying 1. Builds/manages dashboard 2. Adds fancy new metric types! 1. Logs metrics using python client 2. Explores in the Model Operations Dashboard 3. Saves URL for favorite viz DS Platform
  41. 41. DAIS 2021 41 Metrics Tracking
  42. 42. DAIS 2021 42 Metrics Tracking
  43. 43. DAIS 2021 43 Metrics Tracking
  44. 44. In Summation
  45. 45. DAIS 2021 45 Value Added by Separating Concerns Making deployment easy Ensuring environment in prod == environment in training Providing easy metrics analysis Wrapping up complex systems Behind-the-scenes best practices Creating the best model Choosing the best libraries Determining the right metrics to log DS concerned with... Platform concerned with... DS focuses on creating the best model [writing the recipe] Platform focuses on optimal infrastructure [cooking it]
  46. 46. Future Musings
  47. 47. DAIS 2021 47 Some Ideas... More advanced use of the data - production monitoring: utilize training data/stats to have visibility into prod/training drift More deployment contexts - Predictions on streaming/kafka topics More sophisticated feature tracking/integration - Feature stores are all the rage… Lambda-like architecture - Rather than requiring a deploy, can we query system for a model’s predictions? - Requires more unified environments… Attach external capabilities to replace home-built components of our own system...
  48. 48. Questions? Find me at: @elijahbenizzy linkedin.com/in/elijahbenizzy elijah.benizzy@stitchfix.com Try out Stitch Fix → goo.gl/Q3tCQ3

×