SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
1
Sayan Chakraborty
Smit Shah
Scaling AutoML-driven
Anomaly Detection with
Luminaire
Who We Are
Data Governance Platform Team
@ Zillow
Sayan Chakraborty
Senior Applied Scientist
Smit Shah
Senior Software Development
Engineer, Big Data
Agenda
● What is Zillow?
● Why Monitor Data Quality
● Data Quality Challenges
● Luminaire and Scaling
● Key Takeaways
Zillow
About Zillow
● Reimagining real estate to make it
easier to unlock life’s next chapter
* As of Q4-2020
● Offer customers an on-demand
experience for selling, buying,
renting and financing with
transparency and nearly seamless
end-to-end service
● Most-visited real estate website in
the United States
Why Monitor Data Quality
Why Monitor Data Quality?
● Data fuels many customer facing
and internal services at Zillow that
rely on high quality data
○ Zestimate
○ Zillow Offers
○ Zillow Premier Agent
○ Econ and many more
● Reliable performance of ML and
Services requires certain level of
data quality
Why detect Anomalies?
Anomaly
A data instance or behavior significantly
different from the ‘regular’ patterns
Complex
Time-sensitive
Inevitable
Catching anomalies in important metric
helps keep our business healthy
Ways to Monitor Data Quality
Rule Based
● Domain experts sets pre-specified
rules or thresholds
○ Example: Percent of null data should be
less than 2% per day for a given metric
● Less complicated to set up and
easy to interpret
● Works well when the properties of
data are simple and remain
stationary over time
ML Based
● Rules are set through mathematical
modeling
● Works well when properties of data
are complex and changes over time
● A more hands-off approach
Data Quality Challenges
Data Quality is Context Dependent
● Depends on the use case
● Depends on the reference time frame under consideration
○ Example: Different interpretation of the same fluctuation can be obtained when compared
under shorter vs longer reference time-frames
● Depends on externalities such as holidays, product launches, market specific
etc
Challenges
● Modeling
○ Wide ranges of time series patterns from different data sources - one model
doesn’t fit all
○ Definition of anomalies changes at different levels of aggregation of the same
data
● Scaling and Standardization
○ Everyone (Analyst, PM, DE) should be able to use ML for anomaly detection and
get trustworthy data (but everyone is not an ML expert)
○ Require Scalability for handling large amount of data across teams
Wishlist for the system
● Able to catch any data irregularities
● Scale for large amount of data and metrics
● Minimal configuration
● Minimal maintenance over time
No existing solution meets the above requirements
Luminaire
Luminaire Python Package
Integrated with
Different
Models
AutoML Built-in
Proven to
Outperform
Many Existing
Methods
Time series
Data Profiling
Enabled
Built for Batch
and Streaming
use cases
Key Features
Github: https://github.com/zillow/luminaire
Tutorials: https://zillow.github.io/luminaire/
Scientific Paper (IEEE BigData 2020): Building an Automated and Self-Aware Anomaly Detection System (arxiv link)
Luminaire Components
AutoML
Training Components
Data Profiling / Preprocessing
Batch Data Modeling
Streaming Data
Modeling
Scoring/Alerting
Scoring Components
Pull Batch Model
Pull Streaming
Model
Data Profiling / Preprocessing
AutoML
Training Components
Data Profiling / Preprocessing
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.exploration.data_exploration import
DataExploration
>>> de_obj = DataExploration(freq='D', data_shift_truncate=False,
is_log_transformed=True, fill_rate=0.9)
>>> data, pre_prc = de_obj.profile(data)
>>> print(pre_prc)
{'success': True, 'trend_change_list': ['2020-04-01 00:00:00'],
'change_point_list': ['2020-03-16 00:00:00'],
'is_log_transformed': 1, 'min_ts_mean': None, 'ts_start': '2020-
01-01 00:00:00', 'ts_end': '2020-06-07 00:00:00'}
Training - Batch
AutoML
Training Components
Data Profiling / Preprocessing
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.model.lad_structural import LADStructuralModel
>>> hyper_params = {"include_holidays_exog": True,
"is_log_transformed": False, "max_ft_freq": 5, "p": 3, "q": 3}
>>> lad_struct_obj = LADStructuralModel(hyper_params=hyper_params,
freq='D')
>>> success, model_date, model = lad_struct_obj.train(data=data,
**pre_prc)
>>> print(success, model_date, model)
(True, '2020-06-07 00:00:00',
<luminaire_models.model.lad_structural.LADStructuralModel object
at 0x7f97e127d320>)
Training - Streaming
AutoML
Training Components
Data Profiling / Preprocessing
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.model.window_density import
WindowDensityHyperParams, WindowDensityModel
>>> from luminaire.exploration.data_exploration import
DataExploration
>>> config = WindowDensityHyperParams().params
>>> de_obj = DataExploration(**config)
>>> data, pre_prc = de_obj.stream_profile(df=data)
>>> config.update(pre_prc)
>>> wdm_obj = WindowDensityModel(hyper_params=config)
>>> success, training_end, model = wdm_obj.train(data=data)
>>> print(success, training_end, model)
True 2020-07-03 00:00:00
<luminaire.model.window_density.WindowDensityModel object at
0x7fb6fab80b00>
AutoML - Configuration Optimization
AutoML
Training Components
Data Profiling / Preprocessing
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.optimization.hyperparameter_optimization import
HyperparameterOptimization
>>> hopt_obj = HyperparameterOptimization(freq='D')
>>> opt_config = hopt_obj.run(data=data)
>>> print(opt_config)
{'LuminaireModel': 'LADStructuralModel', 'data_shift_truncate': 0,
'fill_rate': 0.742353444620679, 'include_holidays_exog': 1,
'is_log_transformed': 1, 'max_ft_freq': 2, 'p': 1, 'q': 1}
>>> model_class_name = opt_config['LuminaireModel']
>>> module = __import__('luminaire.model', fromlist=[''])
>>> model_class = getattr(module, model_class_name)
>>> model_object = model_class(hyper_params=opt_config, freq='D')
>>> success, model_date, trained_model =
model_object.train(data=training_data, **pre_prc)
>>> print(success, model_date, model)
(True, '2020-06-07 00:00:00',
<luminaire_models.model.lad_structural.LADStructuralModel object
at 0x7fe2b47a7978>)
Scoring - Batch
Scoring/Alerting
Scoring Components
Pull Batch Model
Pull Streaming
Model
>>> model.score(2000, '2020-06-08')
{'Success': True, 'IsLogTransformed': 1,
'LogTransformedAdjustedActual': 7.601402334583733,
'LogTransformedPrediction': 7.85697078664991,
'LogTransformedStdErr': 0.05909378128162875,
'LogTransformedCILower': 7.759770166178546,
'LogTransformedCIUpper': 7.954171407121274, 'AdjustedActual':
2000.000000000015, 'Prediction': 1913.333800801316, 'StdErr':
111.1165409184448, 'CILower': 1722.81265596681, 'CIUpper':
2093.854945635823, 'ConfLevel': 90.0, 'ExogenousHolidays': 0,
'IsAnomaly': False, 'IsAnomalyExtreme': False,
'AnomalyProbability': 0.9616869199903785,
'DownAnomalyProbability': 0.21915654000481077,
'UpAnomalyProbability': 0.7808434599951892, 'ModelFreshness': 0.1}
Scoring - Streaming
Scoring/Alerting
Scoring Components
Pull Batch Model
Pull Streaming
Model
>>> freq = model._params['freq']
>>> de_obj = DataExploration(freq=freq)
>>> processed_data, pre_prc =
de_obj.stream_profile(df=data, impute_only=True,
impute_zero=True)
>>> score, scored_window = model.score(processed_data)
>>> print(score)
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True,
'AnomalyProbability': 1.0}
Scaling
Scaling - Distributed Training/Scoring
Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
Scaling - Distributed Training/Scoring
Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
Scaling - Distributed Training/Scoring
Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
Scaling - Distributed Training/Scoring
Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
Scaling - Distributed Training/Scoring
Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
Scaling - Distributed using Spark
metrics time_series
[data-date, observed-value]
run_date
met_1 [[2021-01-01, 125], [2021-01-
02, 135],
[2021-01-03, 140], ...]
2021-02-01
00:00:00
met_2 [[2021-01-01, 0.17], [2021-01-
02, 0.19],
[2021-01-03, 0.22], ...]
2021-02-01
00:00:00
UDF
(Train)
metrics time_series
[data-date, observed-value]
run_date model_object
met_1 [[2021-01-01, 125], [2021-01-
02, 135],
[2021-01-03, 140], ...]
2021-02-01
00:00:00
<object_met_1>
met_2 [[2021-01-01, 0.17], [2021-01-
02, 0.19],
[2021-01-03, 0.22], ...]
2021-02-01
00:00:00
<object_met_2>
UDF
(Score)
metrics time_series
[data-date, observed-value]
run_date model_object
met_1 [[2021-04-01, 115], [2021-04-
02, 113]]
2021-04-02
00:00:00
<object_met_1>
met_2 [[2021-04-01, 0.45], [2021-04-
02, 0.36]]
2021-04-02
00:00:00
<object_met_2>
metrics time_series
[data-date, observed-value]
run_date score_results
met_1 [[2021-04-01, 115], [2021-04-
02, 113]]
2021-04-02
00:00:00
[{“success”:
True,
“'AnomalyProbab
ility': 0.85, ..}, ..]
met_2 [[2021-04-01, 0.45], [2021-04-
02, 0.36]]
2021-04-02
00:00:00
[{“success”:
True,
“'AnomalyProbab
ility': 0.995, ..}, ..]
* These values are simulated
Our Integrations with Central Data Systems
● Self-service UI for easier on-boarding
● Surfacing health metrics of the data source with central data catalog
● Tagging producers and consumers of the anomaly detection jobs
● Smart Alerting based on scoring output sensitivity
Future Direction
● Support anomaly detection beyond temporal context
● Build decision systems for ML pipelines using Luminaire
● Root Cause Analysis to go a step ahead from detection to diagnosis
● User feedback to get labeled anomalies
Key Takeaways
Key Takeaways
● Luminaire is a python library which supports anomaly detection for wide
variety of time series patterns and use cases
● Proposed a technique to build a fully automated anomaly detection system
that scales to big data use cases and requires minimal maintenance
Questions?
Thank you!
https://www.zillow.com/careers/

Contenu connexe

Similaire à Scaling AutoML-Driven Anomaly Detection With Luminaire

Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEAIntroduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEASandesh Rao
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Sandesh Rao
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data WarehouseSandesh Rao
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionSplunk
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Introduction to Machine Learning and Data Science using Autonomous Database  ...Introduction to Machine Learning and Data Science using Autonomous Database  ...
Introduction to Machine Learning and Data Science using Autonomous Database ...Sandesh Rao
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionSplunk
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionSplunk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Emvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce DeckEmvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce DeckEmvigo Technologies
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022SkillCertProExams
 
ITAM Tools Day, November 2015 - Concorde
ITAM Tools Day, November 2015 - ConcordeITAM Tools Day, November 2015 - Concorde
ITAM Tools Day, November 2015 - ConcordeMartin Thompson
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemVMware Tanzu
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionSplunk
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 

Similaire à Scaling AutoML-Driven Anomaly Detection With Luminaire (20)

Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEAIntroduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...Introduction to Machine Learning and Data Science using the Autonomous databa...
Introduction to Machine Learning and Data Science using the Autonomous databa...
 
AI at Scale in Enterprises
AI at Scale in Enterprises AI at Scale in Enterprises
AI at Scale in Enterprises
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data Warehouse
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Introduction to Machine Learning and Data Science using Autonomous Database  ...Introduction to Machine Learning and Data Science using Autonomous Database  ...
Introduction to Machine Learning and Data Science using Autonomous Database ...
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Emvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce DeckEmvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce Deck
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022
 
ITAM Tools Day, November 2015 - Concorde
ITAM Tools Day, November 2015 - ConcordeITAM Tools Day, November 2015 - Concorde
ITAM Tools Day, November 2015 - Concorde
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation System
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Dernier (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Scaling AutoML-Driven Anomaly Detection With Luminaire

  • 1. 1 Sayan Chakraborty Smit Shah Scaling AutoML-driven Anomaly Detection with Luminaire
  • 2. Who We Are Data Governance Platform Team @ Zillow Sayan Chakraborty Senior Applied Scientist Smit Shah Senior Software Development Engineer, Big Data
  • 3. Agenda ● What is Zillow? ● Why Monitor Data Quality ● Data Quality Challenges ● Luminaire and Scaling ● Key Takeaways
  • 5. About Zillow ● Reimagining real estate to make it easier to unlock life’s next chapter * As of Q4-2020 ● Offer customers an on-demand experience for selling, buying, renting and financing with transparency and nearly seamless end-to-end service ● Most-visited real estate website in the United States
  • 7. Why Monitor Data Quality? ● Data fuels many customer facing and internal services at Zillow that rely on high quality data ○ Zestimate ○ Zillow Offers ○ Zillow Premier Agent ○ Econ and many more ● Reliable performance of ML and Services requires certain level of data quality
  • 8. Why detect Anomalies? Anomaly A data instance or behavior significantly different from the ‘regular’ patterns Complex Time-sensitive Inevitable Catching anomalies in important metric helps keep our business healthy
  • 9. Ways to Monitor Data Quality Rule Based ● Domain experts sets pre-specified rules or thresholds ○ Example: Percent of null data should be less than 2% per day for a given metric ● Less complicated to set up and easy to interpret ● Works well when the properties of data are simple and remain stationary over time ML Based ● Rules are set through mathematical modeling ● Works well when properties of data are complex and changes over time ● A more hands-off approach
  • 11. Data Quality is Context Dependent ● Depends on the use case ● Depends on the reference time frame under consideration ○ Example: Different interpretation of the same fluctuation can be obtained when compared under shorter vs longer reference time-frames ● Depends on externalities such as holidays, product launches, market specific etc
  • 12. Challenges ● Modeling ○ Wide ranges of time series patterns from different data sources - one model doesn’t fit all ○ Definition of anomalies changes at different levels of aggregation of the same data ● Scaling and Standardization ○ Everyone (Analyst, PM, DE) should be able to use ML for anomaly detection and get trustworthy data (but everyone is not an ML expert) ○ Require Scalability for handling large amount of data across teams
  • 13. Wishlist for the system ● Able to catch any data irregularities ● Scale for large amount of data and metrics ● Minimal configuration ● Minimal maintenance over time No existing solution meets the above requirements
  • 15. Luminaire Python Package Integrated with Different Models AutoML Built-in Proven to Outperform Many Existing Methods Time series Data Profiling Enabled Built for Batch and Streaming use cases Key Features Github: https://github.com/zillow/luminaire Tutorials: https://zillow.github.io/luminaire/ Scientific Paper (IEEE BigData 2020): Building an Automated and Self-Aware Anomaly Detection System (arxiv link)
  • 16. Luminaire Components AutoML Training Components Data Profiling / Preprocessing Batch Data Modeling Streaming Data Modeling Scoring/Alerting Scoring Components Pull Batch Model Pull Streaming Model
  • 17. Data Profiling / Preprocessing AutoML Training Components Data Profiling / Preprocessing Batch Data Modeling Streaming Data Modeling >>> from luminaire.exploration.data_exploration import DataExploration >>> de_obj = DataExploration(freq='D', data_shift_truncate=False, is_log_transformed=True, fill_rate=0.9) >>> data, pre_prc = de_obj.profile(data) >>> print(pre_prc) {'success': True, 'trend_change_list': ['2020-04-01 00:00:00'], 'change_point_list': ['2020-03-16 00:00:00'], 'is_log_transformed': 1, 'min_ts_mean': None, 'ts_start': '2020- 01-01 00:00:00', 'ts_end': '2020-06-07 00:00:00'}
  • 18. Training - Batch AutoML Training Components Data Profiling / Preprocessing Batch Data Modeling Streaming Data Modeling >>> from luminaire.model.lad_structural import LADStructuralModel >>> hyper_params = {"include_holidays_exog": True, "is_log_transformed": False, "max_ft_freq": 5, "p": 3, "q": 3} >>> lad_struct_obj = LADStructuralModel(hyper_params=hyper_params, freq='D') >>> success, model_date, model = lad_struct_obj.train(data=data, **pre_prc) >>> print(success, model_date, model) (True, '2020-06-07 00:00:00', <luminaire_models.model.lad_structural.LADStructuralModel object at 0x7f97e127d320>)
  • 19. Training - Streaming AutoML Training Components Data Profiling / Preprocessing Batch Data Modeling Streaming Data Modeling >>> from luminaire.model.window_density import WindowDensityHyperParams, WindowDensityModel >>> from luminaire.exploration.data_exploration import DataExploration >>> config = WindowDensityHyperParams().params >>> de_obj = DataExploration(**config) >>> data, pre_prc = de_obj.stream_profile(df=data) >>> config.update(pre_prc) >>> wdm_obj = WindowDensityModel(hyper_params=config) >>> success, training_end, model = wdm_obj.train(data=data) >>> print(success, training_end, model) True 2020-07-03 00:00:00 <luminaire.model.window_density.WindowDensityModel object at 0x7fb6fab80b00>
  • 20. AutoML - Configuration Optimization AutoML Training Components Data Profiling / Preprocessing Batch Data Modeling Streaming Data Modeling >>> from luminaire.optimization.hyperparameter_optimization import HyperparameterOptimization >>> hopt_obj = HyperparameterOptimization(freq='D') >>> opt_config = hopt_obj.run(data=data) >>> print(opt_config) {'LuminaireModel': 'LADStructuralModel', 'data_shift_truncate': 0, 'fill_rate': 0.742353444620679, 'include_holidays_exog': 1, 'is_log_transformed': 1, 'max_ft_freq': 2, 'p': 1, 'q': 1} >>> model_class_name = opt_config['LuminaireModel'] >>> module = __import__('luminaire.model', fromlist=['']) >>> model_class = getattr(module, model_class_name) >>> model_object = model_class(hyper_params=opt_config, freq='D') >>> success, model_date, trained_model = model_object.train(data=training_data, **pre_prc) >>> print(success, model_date, model) (True, '2020-06-07 00:00:00', <luminaire_models.model.lad_structural.LADStructuralModel object at 0x7fe2b47a7978>)
  • 21. Scoring - Batch Scoring/Alerting Scoring Components Pull Batch Model Pull Streaming Model >>> model.score(2000, '2020-06-08') {'Success': True, 'IsLogTransformed': 1, 'LogTransformedAdjustedActual': 7.601402334583733, 'LogTransformedPrediction': 7.85697078664991, 'LogTransformedStdErr': 0.05909378128162875, 'LogTransformedCILower': 7.759770166178546, 'LogTransformedCIUpper': 7.954171407121274, 'AdjustedActual': 2000.000000000015, 'Prediction': 1913.333800801316, 'StdErr': 111.1165409184448, 'CILower': 1722.81265596681, 'CIUpper': 2093.854945635823, 'ConfLevel': 90.0, 'ExogenousHolidays': 0, 'IsAnomaly': False, 'IsAnomalyExtreme': False, 'AnomalyProbability': 0.9616869199903785, 'DownAnomalyProbability': 0.21915654000481077, 'UpAnomalyProbability': 0.7808434599951892, 'ModelFreshness': 0.1}
  • 22. Scoring - Streaming Scoring/Alerting Scoring Components Pull Batch Model Pull Streaming Model >>> freq = model._params['freq'] >>> de_obj = DataExploration(freq=freq) >>> processed_data, pre_prc = de_obj.stream_profile(df=data, impute_only=True, impute_zero=True) >>> score, scored_window = model.score(processed_data) >>> print(score) {'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 1.0}
  • 24. Scaling - Distributed Training/Scoring Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
  • 25. Scaling - Distributed Training/Scoring Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
  • 26. Scaling - Distributed Training/Scoring Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
  • 27. Scaling - Distributed Training/Scoring Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
  • 28. Scaling - Distributed Training/Scoring Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/
  • 29. Scaling - Distributed using Spark metrics time_series [data-date, observed-value] run_date met_1 [[2021-01-01, 125], [2021-01- 02, 135], [2021-01-03, 140], ...] 2021-02-01 00:00:00 met_2 [[2021-01-01, 0.17], [2021-01- 02, 0.19], [2021-01-03, 0.22], ...] 2021-02-01 00:00:00 UDF (Train) metrics time_series [data-date, observed-value] run_date model_object met_1 [[2021-01-01, 125], [2021-01- 02, 135], [2021-01-03, 140], ...] 2021-02-01 00:00:00 <object_met_1> met_2 [[2021-01-01, 0.17], [2021-01- 02, 0.19], [2021-01-03, 0.22], ...] 2021-02-01 00:00:00 <object_met_2> UDF (Score) metrics time_series [data-date, observed-value] run_date model_object met_1 [[2021-04-01, 115], [2021-04- 02, 113]] 2021-04-02 00:00:00 <object_met_1> met_2 [[2021-04-01, 0.45], [2021-04- 02, 0.36]] 2021-04-02 00:00:00 <object_met_2> metrics time_series [data-date, observed-value] run_date score_results met_1 [[2021-04-01, 115], [2021-04- 02, 113]] 2021-04-02 00:00:00 [{“success”: True, “'AnomalyProbab ility': 0.85, ..}, ..] met_2 [[2021-04-01, 0.45], [2021-04- 02, 0.36]] 2021-04-02 00:00:00 [{“success”: True, “'AnomalyProbab ility': 0.995, ..}, ..] * These values are simulated
  • 30. Our Integrations with Central Data Systems ● Self-service UI for easier on-boarding ● Surfacing health metrics of the data source with central data catalog ● Tagging producers and consumers of the anomaly detection jobs ● Smart Alerting based on scoring output sensitivity
  • 31. Future Direction ● Support anomaly detection beyond temporal context ● Build decision systems for ML pipelines using Luminaire ● Root Cause Analysis to go a step ahead from detection to diagnosis ● User feedback to get labeled anomalies
  • 33. Key Takeaways ● Luminaire is a python library which supports anomaly detection for wide variety of time series patterns and use cases ● Proposed a technique to build a fully automated anomaly detection system that scales to big data use cases and requires minimal maintenance