The document summarizes experiments conducted by Team D-Hawks to predict real estate prices in Moscow using data from Kaggle competitions. It describes five different experiments that varied the features used, data cleaning techniques, and machine learning models. The winning experiment used parallel data cleaning paths and multiple boosted decision tree models to achieve the lowest root mean squared error. The team's work demonstrated that feature selection, additional data sources, and testing multiple approaches can improve price prediction accuracy.
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Predicting Moscow Real Estate Prices with Azure Machine Learning
1. Predicting Real Estate Prices in Moscow
A Kaggle Competition
University of Washington Professional & Continuing Education
BIG DATA 220B SPRING 2017 FINAL PROJECT
Team D-Hawks
Leo Salemann, Karunakar Kotha, Shiva Vuppala, John Bever, Wenfan Xu
Keywords: Big Data, Kaggle, Machine Learning, Azure ML Studio, Boosted Decision Tree, Neural Network, Regression, Tableau
2. Problem Description & Datasets
Input Data Description Features Observations
Housing Data Property, neighborhood, sales date & price 292 30,473
Macroeconomics Daily commodity prices, indicators like GDP 100 2,485
Data Dictionary Feature Definitions
Shapefiles Spatial data for maps
4. Azure ML Studio Experiments - Variations
Name
Strategy Experiment Characteristics Cols Rows
Root Mean
Squared Error
RMSE /
STDEV(price)
Wenfan
Baseline
● Basic 12 real estate features
● Tried 4 regression models, kept 2
13 27,909 2,505,749.58 0.524203184
Leo
Incremental add
● Incrementally add more real estate features
● Omit macroeconomic features
● Detailed Human-in-The-Loop process
64 15,693 2,573,721.30 0.538422878
Shiva
Feature Selection Pre-processor
● Separate Experiment for Feature Selection
(Permutation Feature Importance)
● Joined Macro Data
● Added Retail-specific Features
● Added Decision Forest Regression Module
21 30,471 2,425,862.34 0.507490762
Karunakar
Filter Based Feature Selection
● Filter Based Feature Selection
● Boosted Decision tree
● Decision forest regression
38 14,853 3,054,675.32 0.639038531
John
Parallel Cleansing Paths
● Joined Macro Data
● Start with all fields, gradually remove
● Parallel cleansing paths (set to zero; set to 391 30,471 2,263,084.20 0.473437552
5. The Winning Experiment
2. Clean Missing Data - Try three Modes
a. Custom Value Substitution (a fixed value i.e. 0)
b. Replace with Mean
c. Replace using Probabilistic PCA
3. Clip, Normalize Split (same for all 3 paths)
- Handling Categorical & Continuous Variables
- Outlier clipping (per-value; not via SQL)
- Data Normalization or Feature Scaling
4. Train & Evaluate - Compare Three different models
a. Poisson Regression
b. Neural Network Regression
c. Boosted Decision Tree Regression
BA C
1. Collecting Data
11. Conclusion & Further Work
TWL (Today We Learned)
● Azure ML Studio is great for trying multiple techniques in parallel (try that in python!)
● Many ways to approach the problem.
○ Effort required varies a lot …
○ So does the quality of the results.
Next time …
● Watch those row counts … did you lose any?
● Deploy Web Service earlier and more often.
Someday/Oneday …
● Use different models for different subclasses of real estate.
14. Azure ML Studio Experiments - Variations
Name
Strategy Experiment Characteristics Regression Models Notes
Wenfan
Baseline
● Basic 12 real estate features ● Boosted Decision Tree
● Neural Network
● Bayesian Linear
● Linear
Kept Boosted Decision Tree
and Neural Network; dropped
the others.
Leo
Incremental add
● Incrementally add more real estate
features
● Omit macroeconomic features
● Boosted Decision Tree
● Neural Network
Detailed Human-In-The-Loop
(HITL) process.
Shiva
Feature Selection
Pre-processor, add
Macro & Retail
● Joined Macro Data
● Added Retail-specific Features
● Boosted Decision Tree
● Decision Forest
Regression
Separate Experiment for
Feature Selection (Permutation
Feature Importance)
Karunakar
Filter Based Feature
Selection
● Filter Based Feature Selection
● Remove features that aren’t helping
● Boosted Decision Tree
● Forest Regression
Kept Filter Based
Feature,Boosted Decision tree
and Forest regression
John
Parallel Cleansing Paths -
set to 0 vs. median vs.
Probabilistic PCA
● Joined Macro Data
● Start with all fields, gradually remove
● Parallel cleansing paths
● Multiple Boosted Decision
Tree Models
● Poisson
● Neural Network
Multiple simultaneous parallel
paths
16. Shiva’s Pre-Processor Experiment
Permutation Feature Importance algorithm to compute importance scores for each of the feature variables of dataset.
1.Load Housing and macro data; Join data
2. Select ALL columns Edit Metadata (set datatype)
3. Split Data
4. Add Permutation Feature Importance Model. Conn: L: Train Model, R: Dataset
Works only for Regression or Classification.
5. Execute Permutation Feature Importance (40 mins).
6. Result lists top most scored features in the dataset.
17. Karunakar Pre-Processor Experiment
Boosted decision tree algorithm in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.
1.Load Housing data
2. Select columns, Edit Metadata (set datatype)
3. Apply SQL transformations.
4. Filter based feature selection ,normalize data and
split data.
5.choosed Boosted decision tree and decision tree regression to choose the best
predictive.
6 Apply train and score model for each decision algorithm .
7. Evaluate the data model .
18. Karunakar Variation
1. Filter Based Feature Selection (remove features
that aren’t helping)
2. Decision Forest
Filter Based Feature Selection:
1. Feature selection is the process of selecting those
attributes(Columns) in dataset that are most relevant to the
predictive modeling.
2. By choosing the right features, it can potentially improve the
accuracy and efficiency of classification.
3. Filter Based Feature Selection module to identify the columns in
your input dataset that have the greatest predictive power.
Pearson Correlation:
1. Pearson’s correlation statistics or Pearson’s correlation coefficient
is also known in statistical models as the r value. For any two
variables, it returns a value that indicates the strength of the
correlation.
2. Pearson's correlation coefficient is computed by taking the
covariance of two variables and dividing by the product of their
standard deviations. The coefficient is not affected by changes of
scale in the two variables.
19. Karunakar Variation
Decision Forest Regression Model:
Decision trees are nonparametric models that perform a sequence of
simple tests for each instance, traversing a binary tree data structure
until a leaf node (decision) is reached.
Decision trees have these advantages:
1. They are efficient in both computation and memory usage
during training and prediction.
2. They can represent non-linear decision boundaries.
3. They perform integrated feature selection and classification
and are resilient in the presence of noisy features.
This regression model consists of an ensemble of decision trees.
Each tree in a regression decision forest outputs a Gaussian
distribution by way of prediction. An aggregation is performed over the
ensemble of trees to find a Gaussian distribution closest to the
combined distribution for all trees in the model.