SlideShare a Scribd company logo
1 of 19
Download to read offline
Predicting Real Estate Prices in Moscow
A Kaggle Competition
University of Washington Professional & Continuing Education
BIG DATA 220B SPRING 2017 FINAL PROJECT
Team D-Hawks
Leo Salemann, Karunakar Kotha, Shiva Vuppala, John Bever, Wenfan Xu
Keywords: Big Data, Kaggle, Machine Learning, Azure ML Studio, Boosted Decision Tree, Neural Network, Regression, Tableau
Problem Description & Datasets
Input Data Description Features Observations
Housing Data Property, neighborhood, sales date & price 292 30,473
Macroeconomics Daily commodity prices, indicators like GDP 100 2,485
Data Dictionary Feature Definitions
Shapefiles Spatial data for maps
ML Studio Flow
1. Load data; select columns
2. Edit Metadata (set datatype)
3. Clean
Missing
Data
4. Clip,
Normalize,
Split
5. Train & Evaluate
Boosted Decision
Tree, Neural Network
Azure ML Studio Experiments - Variations
Name
Strategy Experiment Characteristics Cols Rows
Root Mean
Squared Error
RMSE /
STDEV(price)
Wenfan
Baseline
● Basic 12 real estate features
● Tried 4 regression models, kept 2
13 27,909 2,505,749.58 0.524203184
Leo
Incremental add
● Incrementally add more real estate features
● Omit macroeconomic features
● Detailed Human-in-The-Loop process
64 15,693 2,573,721.30 0.538422878
Shiva
Feature Selection Pre-processor
● Separate Experiment for Feature Selection
(Permutation Feature Importance)
● Joined Macro Data
● Added Retail-specific Features
● Added Decision Forest Regression Module
21 30,471 2,425,862.34 0.507490762
Karunakar
Filter Based Feature Selection
● Filter Based Feature Selection
● Boosted Decision tree
● Decision forest regression
38 14,853 3,054,675.32 0.639038531
John
Parallel Cleansing Paths
● Joined Macro Data
● Start with all fields, gradually remove
● Parallel cleansing paths (set to zero; set to 391 30,471 2,263,084.20 0.473437552
The Winning Experiment
2. Clean Missing Data - Try three Modes
a. Custom Value Substitution (a fixed value i.e. 0)
b. Replace with Mean
c. Replace using Probabilistic PCA
3. Clip, Normalize Split (same for all 3 paths)
- Handling Categorical & Continuous Variables
- Outlier clipping (per-value; not via SQL)
- Data Normalization or Feature Scaling
4. Train & Evaluate - Compare Three different models
a. Poisson Regression
b. Neural Network Regression
c. Boosted Decision Tree Regression
BA C
1. Collecting Data
Final Algorithm Parameters
BA C
Predictions Based on Normalized Inputs
389 input columns
Visualization
Predicted price
VS.
Real price
● Actual “waveform” tracks quite well (peaks and valleys line up)
● Fairly consistent delta - always undershooting by about 500K Rubles
VisualizationVisualization
Error analysis
based on
House property:
square meters
VisualizationVisualization
Error analysis
based on
Geometry property:
districts
Conclusion & Further Work
TWL (Today We Learned)
● Azure ML Studio is great for trying multiple techniques in parallel (try that in python!)
● Many ways to approach the problem.
○ Effort required varies a lot …
○ So does the quality of the results.
Next time …
● Watch those row counts … did you lose any?
● Deploy Web Service earlier and more often.
Someday/Oneday …
● Use different models for different subclasses of real estate.
THANKYOU!
Appendix
Experiment Variation Details & Results
More experiment screenshots
Azure ML Studio Experiments - Variations
Name
Strategy Experiment Characteristics Regression Models Notes
Wenfan
Baseline
● Basic 12 real estate features ● Boosted Decision Tree
● Neural Network
● Bayesian Linear
● Linear
Kept Boosted Decision Tree
and Neural Network; dropped
the others.
Leo
Incremental add
● Incrementally add more real estate
features
● Omit macroeconomic features
● Boosted Decision Tree
● Neural Network
Detailed Human-In-The-Loop
(HITL) process.
Shiva
Feature Selection
Pre-processor, add
Macro & Retail
● Joined Macro Data
● Added Retail-specific Features
● Boosted Decision Tree
● Decision Forest
Regression
Separate Experiment for
Feature Selection (Permutation
Feature Importance)
Karunakar
Filter Based Feature
Selection
● Filter Based Feature Selection
● Remove features that aren’t helping
● Boosted Decision Tree
● Forest Regression
Kept Filter Based
Feature,Boosted Decision tree
and Forest regression
John
Parallel Cleansing Paths -
set to 0 vs. median vs.
Probabilistic PCA
● Joined Macro Data
● Start with all fields, gradually remove
● Parallel cleansing paths
● Multiple Boosted Decision
Tree Models
● Poisson
● Neural Network
Multiple simultaneous parallel
paths
Evaluation Metrics
Name
Strategy Cols Rows
Mean Absolute
Error
Root Mean
Squared Error
RMSE /
STDEV(price)
Relative
Absolute
Error
Relative
Squared
Error
Coefficient of
Determination
Wenfan
Baseline
13 27,909 1,448,475.24 2,505,749.58 0.524203184 0.535641 0.386980 0.6130200
Leo
Incremental
add
64 15,693 1,577,436.18 2,573,721.30 0.538422878 0.507266 0.284116 0.7158840
Shiva
Feature
Selection
Pre-processor
21 30,471 1,390,695.31 2,425,862.34 0.507490762 0.521245 0.352367 0.6476330
Karunakar
Filter Based
Feature
Selection
38 14,853 1,874,864.85 3,054,675.32 0.639038531 0.626830 0.439601 0.5603993
John
Parallel
Cleansing
Paths
391 30,471 1,358,929.12 2,263,084.20 0.473437552 0.487444 0.315758 0.6842420
Shiva’s Pre-Processor Experiment
Permutation Feature Importance algorithm to compute importance scores for each of the feature variables of dataset.
1.Load Housing and macro data; Join data
2. Select ALL columns Edit Metadata (set datatype)
3. Split Data
4. Add Permutation Feature Importance Model. Conn: L: Train Model, R: Dataset
Works only for Regression or Classification.
5. Execute Permutation Feature Importance (40 mins).
6. Result lists top most scored features in the dataset.
Karunakar Pre-Processor Experiment
Boosted decision tree algorithm in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.
1.Load Housing data
2. Select columns, Edit Metadata (set datatype)
3. Apply SQL transformations.
4. Filter based feature selection ,normalize data and
split data.
5.choosed Boosted decision tree and decision tree regression to choose the best
predictive.
6 Apply train and score model for each decision algorithm .
7. Evaluate the data model .
Karunakar Variation
1. Filter Based Feature Selection (remove features
that aren’t helping)
2. Decision Forest
Filter Based Feature Selection:
1. Feature selection is the process of selecting those
attributes(Columns) in dataset that are most relevant to the
predictive modeling.
2. By choosing the right features, it can potentially improve the
accuracy and efficiency of classification.
3. Filter Based Feature Selection module to identify the columns in
your input dataset that have the greatest predictive power.
Pearson Correlation:
1. Pearson’s correlation statistics or Pearson’s correlation coefficient
is also known in statistical models as the r value. For any two
variables, it returns a value that indicates the strength of the
correlation.
2. Pearson's correlation coefficient is computed by taking the
covariance of two variables and dividing by the product of their
standard deviations. The coefficient is not affected by changes of
scale in the two variables.
Karunakar Variation
Decision Forest Regression Model:
Decision trees are nonparametric models that perform a sequence of
simple tests for each instance, traversing a binary tree data structure
until a leaf node (decision) is reached.
Decision trees have these advantages:
1. They are efficient in both computation and memory usage
during training and prediction.
2. They can represent non-linear decision boundaries.
3. They perform integrated feature selection and classification
and are resilient in the presence of noisy features.
This regression model consists of an ensemble of decision trees.
Each tree in a regression decision forest outputs a Gaussian
distribution by way of prediction. An aggregation is performed over the
ensemble of trees to find a Gaussian distribution closest to the
combined distribution for all trees in the model.

More Related Content

What's hot

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by AnalogyColleen Farrelly
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
 
Survival Analysis Superlearner
Survival Analysis SuperlearnerSurvival Analysis Superlearner
Survival Analysis SuperlearnerColleen Farrelly
 
Recommender systems in practice
Recommender systems in practiceRecommender systems in practice
Recommender systems in practiceBigData Republic
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization CS, NcState
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMLAI2
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Basics of Clustering
Basics of ClusteringBasics of Clustering
Basics of ClusteringB. Nichols
 
A Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification TasksA Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification TasksEditor IJCATR
 
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFYusuke Yamamoto
 
Face Recognition Using Neural Networks
Face Recognition Using Neural NetworksFace Recognition Using Neural Networks
Face Recognition Using Neural NetworksCSCJournals
 

What's hot (18)

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
Survival Analysis Superlearner
Survival Analysis SuperlearnerSurvival Analysis Superlearner
Survival Analysis Superlearner
 
Recommender systems in practice
Recommender systems in practiceRecommender systems in practice
Recommender systems in practice
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Feature scaling
Feature scalingFeature scaling
Feature scaling
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Basics of Clustering
Basics of ClusteringBasics of Clustering
Basics of Clustering
 
A Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification TasksA Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification Tasks
 
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CF
 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
 
Face Recognition Using Neural Networks
Face Recognition Using Neural NetworksFace Recognition Using Neural Networks
Face Recognition Using Neural Networks
 

Similar to Predicting Moscow Real Estate Prices with Azure Machine Learning

Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningRomiRoy4
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentationNeerajNishad4
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1khairulhuda242
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Build Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customersBuild Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customerssriram30691
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 

Similar to Predicting Moscow Real Estate Prices with Azure Machine Learning (20)

Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Build Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customersBuild Deep Learning model to identify santader bank's dissatisfied customers
Build Deep Learning model to identify santader bank's dissatisfied customers
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 

Recently uploaded (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

Predicting Moscow Real Estate Prices with Azure Machine Learning

  • 1. Predicting Real Estate Prices in Moscow A Kaggle Competition University of Washington Professional & Continuing Education BIG DATA 220B SPRING 2017 FINAL PROJECT Team D-Hawks Leo Salemann, Karunakar Kotha, Shiva Vuppala, John Bever, Wenfan Xu Keywords: Big Data, Kaggle, Machine Learning, Azure ML Studio, Boosted Decision Tree, Neural Network, Regression, Tableau
  • 2. Problem Description & Datasets Input Data Description Features Observations Housing Data Property, neighborhood, sales date & price 292 30,473 Macroeconomics Daily commodity prices, indicators like GDP 100 2,485 Data Dictionary Feature Definitions Shapefiles Spatial data for maps
  • 3. ML Studio Flow 1. Load data; select columns 2. Edit Metadata (set datatype) 3. Clean Missing Data 4. Clip, Normalize, Split 5. Train & Evaluate Boosted Decision Tree, Neural Network
  • 4. Azure ML Studio Experiments - Variations Name Strategy Experiment Characteristics Cols Rows Root Mean Squared Error RMSE / STDEV(price) Wenfan Baseline ● Basic 12 real estate features ● Tried 4 regression models, kept 2 13 27,909 2,505,749.58 0.524203184 Leo Incremental add ● Incrementally add more real estate features ● Omit macroeconomic features ● Detailed Human-in-The-Loop process 64 15,693 2,573,721.30 0.538422878 Shiva Feature Selection Pre-processor ● Separate Experiment for Feature Selection (Permutation Feature Importance) ● Joined Macro Data ● Added Retail-specific Features ● Added Decision Forest Regression Module 21 30,471 2,425,862.34 0.507490762 Karunakar Filter Based Feature Selection ● Filter Based Feature Selection ● Boosted Decision tree ● Decision forest regression 38 14,853 3,054,675.32 0.639038531 John Parallel Cleansing Paths ● Joined Macro Data ● Start with all fields, gradually remove ● Parallel cleansing paths (set to zero; set to 391 30,471 2,263,084.20 0.473437552
  • 5. The Winning Experiment 2. Clean Missing Data - Try three Modes a. Custom Value Substitution (a fixed value i.e. 0) b. Replace with Mean c. Replace using Probabilistic PCA 3. Clip, Normalize Split (same for all 3 paths) - Handling Categorical & Continuous Variables - Outlier clipping (per-value; not via SQL) - Data Normalization or Feature Scaling 4. Train & Evaluate - Compare Three different models a. Poisson Regression b. Neural Network Regression c. Boosted Decision Tree Regression BA C 1. Collecting Data
  • 7. Predictions Based on Normalized Inputs 389 input columns
  • 8. Visualization Predicted price VS. Real price ● Actual “waveform” tracks quite well (peaks and valleys line up) ● Fairly consistent delta - always undershooting by about 500K Rubles
  • 11. Conclusion & Further Work TWL (Today We Learned) ● Azure ML Studio is great for trying multiple techniques in parallel (try that in python!) ● Many ways to approach the problem. ○ Effort required varies a lot … ○ So does the quality of the results. Next time … ● Watch those row counts … did you lose any? ● Deploy Web Service earlier and more often. Someday/Oneday … ● Use different models for different subclasses of real estate.
  • 13. Appendix Experiment Variation Details & Results More experiment screenshots
  • 14. Azure ML Studio Experiments - Variations Name Strategy Experiment Characteristics Regression Models Notes Wenfan Baseline ● Basic 12 real estate features ● Boosted Decision Tree ● Neural Network ● Bayesian Linear ● Linear Kept Boosted Decision Tree and Neural Network; dropped the others. Leo Incremental add ● Incrementally add more real estate features ● Omit macroeconomic features ● Boosted Decision Tree ● Neural Network Detailed Human-In-The-Loop (HITL) process. Shiva Feature Selection Pre-processor, add Macro & Retail ● Joined Macro Data ● Added Retail-specific Features ● Boosted Decision Tree ● Decision Forest Regression Separate Experiment for Feature Selection (Permutation Feature Importance) Karunakar Filter Based Feature Selection ● Filter Based Feature Selection ● Remove features that aren’t helping ● Boosted Decision Tree ● Forest Regression Kept Filter Based Feature,Boosted Decision tree and Forest regression John Parallel Cleansing Paths - set to 0 vs. median vs. Probabilistic PCA ● Joined Macro Data ● Start with all fields, gradually remove ● Parallel cleansing paths ● Multiple Boosted Decision Tree Models ● Poisson ● Neural Network Multiple simultaneous parallel paths
  • 15. Evaluation Metrics Name Strategy Cols Rows Mean Absolute Error Root Mean Squared Error RMSE / STDEV(price) Relative Absolute Error Relative Squared Error Coefficient of Determination Wenfan Baseline 13 27,909 1,448,475.24 2,505,749.58 0.524203184 0.535641 0.386980 0.6130200 Leo Incremental add 64 15,693 1,577,436.18 2,573,721.30 0.538422878 0.507266 0.284116 0.7158840 Shiva Feature Selection Pre-processor 21 30,471 1,390,695.31 2,425,862.34 0.507490762 0.521245 0.352367 0.6476330 Karunakar Filter Based Feature Selection 38 14,853 1,874,864.85 3,054,675.32 0.639038531 0.626830 0.439601 0.5603993 John Parallel Cleansing Paths 391 30,471 1,358,929.12 2,263,084.20 0.473437552 0.487444 0.315758 0.6842420
  • 16. Shiva’s Pre-Processor Experiment Permutation Feature Importance algorithm to compute importance scores for each of the feature variables of dataset. 1.Load Housing and macro data; Join data 2. Select ALL columns Edit Metadata (set datatype) 3. Split Data 4. Add Permutation Feature Importance Model. Conn: L: Train Model, R: Dataset Works only for Regression or Classification. 5. Execute Permutation Feature Importance (40 mins). 6. Result lists top most scored features in the dataset.
  • 17. Karunakar Pre-Processor Experiment Boosted decision tree algorithm in a decision tree ensemble tends to improve accuracy with some small risk of less coverage. 1.Load Housing data 2. Select columns, Edit Metadata (set datatype) 3. Apply SQL transformations. 4. Filter based feature selection ,normalize data and split data. 5.choosed Boosted decision tree and decision tree regression to choose the best predictive. 6 Apply train and score model for each decision algorithm . 7. Evaluate the data model .
  • 18. Karunakar Variation 1. Filter Based Feature Selection (remove features that aren’t helping) 2. Decision Forest Filter Based Feature Selection: 1. Feature selection is the process of selecting those attributes(Columns) in dataset that are most relevant to the predictive modeling. 2. By choosing the right features, it can potentially improve the accuracy and efficiency of classification. 3. Filter Based Feature Selection module to identify the columns in your input dataset that have the greatest predictive power. Pearson Correlation: 1. Pearson’s correlation statistics or Pearson’s correlation coefficient is also known in statistical models as the r value. For any two variables, it returns a value that indicates the strength of the correlation. 2. Pearson's correlation coefficient is computed by taking the covariance of two variables and dividing by the product of their standard deviations. The coefficient is not affected by changes of scale in the two variables.
  • 19. Karunakar Variation Decision Forest Regression Model: Decision trees are nonparametric models that perform a sequence of simple tests for each instance, traversing a binary tree data structure until a leaf node (decision) is reached. Decision trees have these advantages: 1. They are efficient in both computation and memory usage during training and prediction. 2. They can represent non-linear decision boundaries. 3. They perform integrated feature selection and classification and are resilient in the presence of noisy features. This regression model consists of an ensemble of decision trees. Each tree in a regression decision forest outputs a Gaussian distribution by way of prediction. An aggregation is performed over the ensemble of trees to find a Gaussian distribution closest to the combined distribution for all trees in the model.