Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Re-imagine Data Monitoring with whylogs and Spark

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 24 Publicité

Re-imagine Data Monitoring with whylogs and Spark

Télécharger pour lire hors ligne

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.

In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.

In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Re-imagine Data Monitoring with whylogs and Spark (20)

Publicité

Plus par Databricks (20)

Plus récents (20)

Publicité

Re-imagine Data Monitoring with whylogs and Spark

  1. 1. Re-imagine Data Monitoring with whylogs and Apache Spark Andy Dang Co-Founder & Lead Engineer, WhyLabs
  2. 2. Outline ML Data Challenges How traditional data analysis techniques fail ML data pipelines Lightweight Profiling for Big ML Data Profiling techniques for detecting data quality problems The Open Source whylogs Library Building the standard for data logging 2
  3. 3. Source: Google Cloud AI 3 ML Lifecycle
  4. 4. Issues encountered in production (small sample)... ...or it simply doesn’t work, and nobody know why... ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 4
  5. 5. Issues encountered in production (small sample)... issues caused by data ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 5
  6. 6. Data Logs Model Metadata Pipeline Metadata i.e. data profiling Data profiling refers to the analysis of information [...] in order to clarify the structure, content, relationships, and derivation rules of the data [Wikipedia] 6 Data monitoring starts with logging
  7. 7. 7 Sampling Profiling Pros ● Easy to build ● Little upfront design ● Log & raw data analysis identical ● Scalable & lightweight ● Flexible & configurable ● Rare events and outlier-dependent metrics ● Directly interpretable results Cons ● I/O & storage ● Noisy ● Requires statistical analysis ● Rare events & outliers ● Min/max, unique values, etc ● Data dependent output format ● No existing widespread solutions ● Mathematical & engineering challenges Data logs: sampling vs. profiling
  8. 8. 8 Data logs: must be accurate Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.
  9. 9. 9 Data logs: must be scalable Dataset Size # of entries # of features Memory consumption Output size Lending Club 1.6G 2.2M 151 14MB 7.4MB NYC Tickets 1.9G 10.8 43 14MB 2.3MB Pain pills 75GB 178M 42 15MB 2MB
  10. 10. 10 Logging ML data at scale Four key paradigms: ● Approximations rather than exact results ● Lightweight ● Additive ● Batch and streaming support profile: collection of lightweight metrics that provide these properties
  11. 11. Lightweight Old Approach 11 process process process process Data Warehouse/ Data Lake Processing Engine New Approach process profiling process profiling process profiling process profiling Profile Store Analysis Only feasible if: ● Profiling is fast ● Profiling is not memory intensive
  12. 12. Additive 12 dataset 1 dataset 2 dataset 3 sort (shuffle) reduce step Median dataset 1 profile 1 dataset 2 profile 2 dataset 3 profile 3 add(profile1, 2, 3) Estimated Median
  13. 13. Batch and streaming support 13 partition 1 profiling partition 2 profiling partition 3 profiling partition n profiling Spark/Hive Query Engine No shuffle! day 0 profiling day 1 profiling day 2 profiling day 3 profiling ... ... sum(profiles)
  14. 14. Approximate Statistics ● Using Stochastic Streaming Algorithms ○ Model the problem as a stochastic process ○ Apache Datasketches is the open source implementation ● Statistics that we focus on at the moment: ○ Histograms ○ Frequent items ○ Cardinality 14
  15. 15. whylogs: The Data Logging Library ● Multi-language support: Python + Java ● Support both data engineering and data science workflows ● Extensibility: image support. Text, video, audio & embeddings support to come ● Growing integration list: 15
  16. 16. 16 whylogs: Python ● A few lines of code to start logging ● Integrate with popular data science libraries ● Out of the box visualization utilities
  17. 17. 17 whylogs in Apache Spark Data Lake col1 , col2 , …, coln partition 1 partition 2 partition k profile profile merge ( profile1 , profile2 …, profilek ) global profile Schema Metadata Sketches profile Metrics
  18. 18. 18 Simple Spark API
  19. 19. 19 pySpark support
  20. 20. 20 Catch distribution drift in a few lines of code
  21. 21. 21 Scalable monitoring at input feature granularity
  22. 22. Monitoring layer for ML applications 22
  23. 23. 23 bit.ly/whylogs
  24. 24. andy@whylabs.ai @andy_dng 24 bit.ly/whylogs Help build the open standard for data logging! Thank you!

×