SlideShare une entreprise Scribd logo
1  sur  32
Real-estate analytics: A Vietnam case study
Real-estate analytics: a Vietnam case study
Viet-Trung Tran
School of Communication and Information Technology
Hanoi University of Science and Technology
Outline
• Problem
• Where big data analytics can help
• Geographically weighted regression for
property appraisal
• Conclusion
2
Problem
• A national data base is needed to support investors and home
buyers.
– "After more than 20 years of establishment and development, information on
Vietnam’s real estate market Vietnam is still ranked low on transparency"
3
Where is my data?
• The good
– Property listings are almost public on the Internet
• The bad
– Thousands sites
– Semi-structured text, needed NLP
• The ugly
– Spam/Duplication
– Unreal, un-correct, low data quality
4
5
there is a boom in trading floors and many use tricks similar
to those adopted by multi-level marketing companies such
as sending messages to customers, providing misleading
information about real estate products, causing price
bubbles.
6
Trang tin ABC
Trang tin XYZ
Vietnam real-estate vs. stock market
• 300 billions USD (FPT
securities/2015)
• Lack of high quality data, tons
of scrams
• Under weak governmental
control
• No national databases
• 33 billionsUSD (quandl.com)
• Clear reports & plots, curated
data
• Strong governmental control
• Centralized, real-time
monitoring
7
Vietnam real-estate vs. things e-commerce
• High value, high ROI
• Immobile
8
• Low value, no ROI
• Mobile, disappeared over time
Vietnam property listings are advertised in the same
manner as fridges and TV
Where big data analytics can help
• Index the entire real estate market
– 8.5 millions listing to date (02/2017)
• Deliver real time market insights
– powered by machine learning and Vietnamese
language processing
9
MARKET DATA
TRANSPARENCY
for all
SAVE TIME
AVOID OVER PRICE
for buyers
Big data processing
10
Big data processing
Natural language
processing
Crawlers
QC: Filters/deduplication
Distributed Database
Report
Chatbot
Website
Vietnamese language processing
• Tasks
– Named Entity Recognition (NER)
– Vietnamese address normalization (Critical!)
11
Big data processing
• Tasks
– Price timelines for every roads, wards, districts, cities
– Automatic property appraisal
– More analytics to come
• About our data
– 8.5 millions listings (to date)
– Stored on Hbase
– Processed on Spark
12
Prototype (to date)
13
Automatic property appraisal
• Tran, Hung Tien, Hiep Tuan Nguyen, and Viet-Trung Tran. "Large-scale
geographically weighted regression on Spark." Knowledge and Systems
Engineering (KSE), 2016 Eighth International Conference on. IEEE, 2016.
14
GWR + =
- Large-scale spatial data
- Improve performance
- Distributed
First Law of Geography - Waldo Tobler:
“Everything is related with everything else, but
closer things are more related”.
Background
• First Law of Geography - Waldo Tobler:
“Everything is related with everything else, but closer
things are more related”.
• Model GWR
– The OLS estimator takes the form
yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Background
• Kernel function
– Gaussian function
• Bandwidth
16
fixed bandwidth adaptive bandwidth
Problem
• Estimating a local model
• Bandwidth selection
– Which bandwidth is good
• Evaluation model
– Choose kernel function
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Source: http://rose.bris.ac.uk
O(n3)
Problem
• How to apply the model for large-scale
data?
– Data points
– Features
– Regression points
Large-Scale GWR on Spark
• Why is Spark?
– In-memory cluster-computing platform
– Parallel programming
– Resilient distributed datasets
Large-Scale GWR on Spark
• We propose three approach to scaling GWR
– Scaling Weighted Linear Regression
– Parallel Multiple WLR models
– Parallel Geographically Weighted Regression
(combine the first two approach)
Scalable GWR on Spark
• Naïve approach – Scaling Weighted Linear
Regression
Foreach regPoint
Compute weight
Fit Weighted
Linear Regression
Summary model
Compute weight
parallel
Compute WLR
model parallel
Scalable GWR on Spark
• Parallel Multiple WLR models
Regression dataset
Training dataset
WLR
Compute weight
WLR
Compute parallel
multiple WLR models
Summary
Scalable GWR on Spark
• Parallel Geographically Weighted Regression
R
R
R
T
T
T
RT
RT
RT
Regression
dataset
Training
dataset
Combine
dataset
Distributed GWR Computation
Experiments
• Environment
– Cluster: 8 nodes on Amazon Web Service
• 4 cores Inte Xeon E5-2670 v2 2.5 GHz
• 16 GB RAM, 2x40 GB SSD
• Hadoop 2.7.2 and Spark 1.6.1
– Dataset
| − −x : double(nullable = false)
| − −y : double(nullable = false)
| − −label : double(nullable = false)
| − −f eatures : vector(nullable = false)
Large training dataset
0
200
400
600
800
1000
1200
10000 100000 1000000 2000000 5000000
Distributed WLR
computation
Parallel WLR
Distributed GWR NE
Distributed GWR GD
time (sec).
Number of training points
Large regression dataset
0
200
400
600
800
1000
1200
1000 5000 10000 20000 50000
Distributed WLR computation
Parallel WLR
Distributed GWR NE
Distributed GWR GD
time (sec).
Number of regression points
Cluster performance
0
500
1000
1500
2000
2-node 4-node 8-node
Distributed WLR computation
Parallel WLR
Distributed GWR NE
Distributed GWR GD
time (sec).
Land value prediction (GWR)
28
Land value heat map
29
30
Conclusion
• Vietnam real-estate analytics just work!
– Large-scale crawlers
– Big data processing
– Specialized NLP for listing corpus
• However
– lot of undiscovered values from data
– lot of room to improve and to research on
31
Call for collaboration!
Thanks for your attention!
trungtv@soict.hust.edu.vn
32

Contenu connexe

Tendances

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 

Tendances (20)

Streaming Weather Data from Web APIs to Jupyter through Kafka
Streaming Weather Data from Web APIs to Jupyter through KafkaStreaming Weather Data from Web APIs to Jupyter through Kafka
Streaming Weather Data from Web APIs to Jupyter through Kafka
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)
An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)
An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Dr Richard Fry - Using R as a GIS
Dr Richard Fry - Using R as a GISDr Richard Fry - Using R as a GIS
Dr Richard Fry - Using R as a GIS
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP Processing
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
 
A seminar on neo4 j
A seminar on neo4 jA seminar on neo4 j
A seminar on neo4 j
 
Os Percy
Os PercyOs Percy
Os Percy
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GIS
 
GIS file types
GIS file typesGIS file types
GIS file types
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 
ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical data
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 

En vedette

E commerce landscape 2012
E commerce landscape 2012E commerce landscape 2012
E commerce landscape 2012
we20
 
Doing Business in Vietnam-Presentation
Doing Business in Vietnam-PresentationDoing Business in Vietnam-Presentation
Doing Business in Vietnam-Presentation
Eran Harish
 

En vedette (20)

Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
Vietnam Real Estate Surges by Anthony S Casey
Vietnam Real Estate Surges by Anthony S CaseyVietnam Real Estate Surges by Anthony S Casey
Vietnam Real Estate Surges by Anthony S Casey
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
Vienam real estate report 2014
Vienam real estate report 2014Vienam real estate report 2014
Vienam real estate report 2014
 
How to business in vietnam_2014 All you need know
How to business in vietnam_2014 All you need knowHow to business in vietnam_2014 All you need know
How to business in vietnam_2014 All you need know
 
Giới thiệu tổng quan về dự án Gamuda Gardens - Gamuda City
Giới thiệu tổng quan về dự án Gamuda Gardens - Gamuda CityGiới thiệu tổng quan về dự án Gamuda Gardens - Gamuda City
Giới thiệu tổng quan về dự án Gamuda Gardens - Gamuda City
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Mrkt quoc te
Mrkt quoc teMrkt quoc te
Mrkt quoc te
 
A Short PMML Tutorial by LatentView
A Short PMML Tutorial by LatentViewA Short PMML Tutorial by LatentView
A Short PMML Tutorial by LatentView
 
Vietnam Investment Report Q4 2015 (EN)
Vietnam Investment Report Q4 2015 (EN)Vietnam Investment Report Q4 2015 (EN)
Vietnam Investment Report Q4 2015 (EN)
 
E commerce landscape 2012
E commerce landscape 2012E commerce landscape 2012
E commerce landscape 2012
 
Doing Business in Vietnam-Presentation
Doing Business in Vietnam-PresentationDoing Business in Vietnam-Presentation
Doing Business in Vietnam-Presentation
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
IMC Plan
IMC PlanIMC Plan
IMC Plan
 
Case Study: Analytics at CMC Markets: from measuring clicks to driving business
Case Study: Analytics at CMC Markets: from measuring clicks to driving businessCase Study: Analytics at CMC Markets: from measuring clicks to driving business
Case Study: Analytics at CMC Markets: from measuring clicks to driving business
 
Neural Networks for OCR
Neural Networks for OCRNeural Networks for OCR
Neural Networks for OCR
 
Business Intelligence for kids (example project)
Business Intelligence for kids (example project)Business Intelligence for kids (example project)
Business Intelligence for kids (example project)
 
PMML - Predictive Model Markup Language
PMML - Predictive Model Markup LanguagePMML - Predictive Model Markup Language
PMML - Predictive Model Markup Language
 
HCMC CBD Market Report | May 2014
HCMC CBD Market Report | May 2014 HCMC CBD Market Report | May 2014
HCMC CBD Market Report | May 2014
 

Similaire à giasan.vn real-estate analytics: a Vietnam case study

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 

Similaire à giasan.vn real-estate analytics: a Vietnam case study (20)

Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
MachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptxMachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptx
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
Opportunities for alternative data sources
Opportunities for alternative data sourcesOpportunities for alternative data sources
Opportunities for alternative data sources
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUNye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Big Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of AdviceBig Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of Advice
 
Thinking spatially with your open data
Thinking spatially with your open dataThinking spatially with your open data
Thinking spatially with your open data
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 

Plus de Viet-Trung TRAN

Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
Viet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
Viet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
Viet-Trung TRAN
 

Plus de Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
Introduction to mining massive datasets
Introduction to mining massive datasetsIntroduction to mining massive datasets
Introduction to mining massive datasets
 
6 clustering
6 clustering6 clustering
6 clustering
 
2 association rules
2 association rules2 association rules
2 association rules
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 

Dernier

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

giasan.vn real-estate analytics: a Vietnam case study

  • 1. Real-estate analytics: A Vietnam case study Real-estate analytics: a Vietnam case study Viet-Trung Tran School of Communication and Information Technology Hanoi University of Science and Technology
  • 2. Outline • Problem • Where big data analytics can help • Geographically weighted regression for property appraisal • Conclusion 2
  • 3. Problem • A national data base is needed to support investors and home buyers. – "After more than 20 years of establishment and development, information on Vietnam’s real estate market Vietnam is still ranked low on transparency" 3
  • 4. Where is my data? • The good – Property listings are almost public on the Internet • The bad – Thousands sites – Semi-structured text, needed NLP • The ugly – Spam/Duplication – Unreal, un-correct, low data quality 4
  • 5. 5 there is a boom in trading floors and many use tricks similar to those adopted by multi-level marketing companies such as sending messages to customers, providing misleading information about real estate products, causing price bubbles.
  • 7. Vietnam real-estate vs. stock market • 300 billions USD (FPT securities/2015) • Lack of high quality data, tons of scrams • Under weak governmental control • No national databases • 33 billionsUSD (quandl.com) • Clear reports & plots, curated data • Strong governmental control • Centralized, real-time monitoring 7
  • 8. Vietnam real-estate vs. things e-commerce • High value, high ROI • Immobile 8 • Low value, no ROI • Mobile, disappeared over time Vietnam property listings are advertised in the same manner as fridges and TV
  • 9. Where big data analytics can help • Index the entire real estate market – 8.5 millions listing to date (02/2017) • Deliver real time market insights – powered by machine learning and Vietnamese language processing 9 MARKET DATA TRANSPARENCY for all SAVE TIME AVOID OVER PRICE for buyers
  • 10. Big data processing 10 Big data processing Natural language processing Crawlers QC: Filters/deduplication Distributed Database Report Chatbot Website
  • 11. Vietnamese language processing • Tasks – Named Entity Recognition (NER) – Vietnamese address normalization (Critical!) 11
  • 12. Big data processing • Tasks – Price timelines for every roads, wards, districts, cities – Automatic property appraisal – More analytics to come • About our data – 8.5 millions listings (to date) – Stored on Hbase – Processed on Spark 12
  • 14. Automatic property appraisal • Tran, Hung Tien, Hiep Tuan Nguyen, and Viet-Trung Tran. "Large-scale geographically weighted regression on Spark." Knowledge and Systems Engineering (KSE), 2016 Eighth International Conference on. IEEE, 2016. 14 GWR + = - Large-scale spatial data - Improve performance - Distributed First Law of Geography - Waldo Tobler: “Everything is related with everything else, but closer things are more related”.
  • 15. Background • First Law of Geography - Waldo Tobler: “Everything is related with everything else, but closer things are more related”. • Model GWR – The OLS estimator takes the form yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi βˆ(u) = (X TW (u)X )−1 X TW (u)Y
  • 16. Background • Kernel function – Gaussian function • Bandwidth 16 fixed bandwidth adaptive bandwidth
  • 17. Problem • Estimating a local model • Bandwidth selection – Which bandwidth is good • Evaluation model – Choose kernel function βˆ(u) = (X TW (u)X )−1 X TW (u)Y Source: http://rose.bris.ac.uk O(n3)
  • 18. Problem • How to apply the model for large-scale data? – Data points – Features – Regression points
  • 19. Large-Scale GWR on Spark • Why is Spark? – In-memory cluster-computing platform – Parallel programming – Resilient distributed datasets
  • 20. Large-Scale GWR on Spark • We propose three approach to scaling GWR – Scaling Weighted Linear Regression – Parallel Multiple WLR models – Parallel Geographically Weighted Regression (combine the first two approach)
  • 21. Scalable GWR on Spark • Naïve approach – Scaling Weighted Linear Regression Foreach regPoint Compute weight Fit Weighted Linear Regression Summary model Compute weight parallel Compute WLR model parallel
  • 22. Scalable GWR on Spark • Parallel Multiple WLR models Regression dataset Training dataset WLR Compute weight WLR Compute parallel multiple WLR models Summary
  • 23. Scalable GWR on Spark • Parallel Geographically Weighted Regression R R R T T T RT RT RT Regression dataset Training dataset Combine dataset Distributed GWR Computation
  • 24. Experiments • Environment – Cluster: 8 nodes on Amazon Web Service • 4 cores Inte Xeon E5-2670 v2 2.5 GHz • 16 GB RAM, 2x40 GB SSD • Hadoop 2.7.2 and Spark 1.6.1 – Dataset | − −x : double(nullable = false) | − −y : double(nullable = false) | − −label : double(nullable = false) | − −f eatures : vector(nullable = false)
  • 25. Large training dataset 0 200 400 600 800 1000 1200 10000 100000 1000000 2000000 5000000 Distributed WLR computation Parallel WLR Distributed GWR NE Distributed GWR GD time (sec). Number of training points
  • 26. Large regression dataset 0 200 400 600 800 1000 1200 1000 5000 10000 20000 50000 Distributed WLR computation Parallel WLR Distributed GWR NE Distributed GWR GD time (sec). Number of regression points
  • 27. Cluster performance 0 500 1000 1500 2000 2-node 4-node 8-node Distributed WLR computation Parallel WLR Distributed GWR NE Distributed GWR GD time (sec).
  • 29. Land value heat map 29
  • 30. 30
  • 31. Conclusion • Vietnam real-estate analytics just work! – Large-scale crawlers – Big data processing – Specialized NLP for listing corpus • However – lot of undiscovered values from data – lot of room to improve and to research on 31 Call for collaboration!
  • 32. Thanks for your attention! trungtv@soict.hust.edu.vn 32

Notes de l'éditeur

  1. Scalability , Performance User-friendly APIs