AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daryan Dehghanpisheh, SVP Digital Strategy, The Howard Hughes Corporation
Mick Bass, CEO, 47Lining
November 2016
MAC302
Leveraging Amazon Machine Learning,
Amazon Redshift, and an Amazon S3 Data
Lake for Strategic Advantage in Real Estate

Speakers
Daryan Dehghanpisheh
SVP, Digital Strategy
The Howard Hughes Corporation
Mick Bass
CEO
47Lining

What to Expect from the Session
How to use machine learning to
improve business results
How to architect a data lake atop S3
that fuses on-premises, 3rd party
and public data sets
How to commission Lakeshore
Analytics in Amazon Redshift
Strategies for development of
summaries and aggregates for
Amazon Machine Learning
Training and running Amazon
Machine Learning to attain
predictive accuracy

Project Alamo
Trump’s Data Lake
220M+ Voter Profiles
100K Hyper Targeted Ads
$200M+ Donations
< 120 Days

About
Seaport District Ward Village The Woodlands
Downtown Summerlin Summerlin Downtown Columbia
New York, NY Honolulu, HI The Woodlands, TX
Summerlin, NV Summerlin, NV Columbia, MD

Capital intensive Lots of human touch Micro/macro exposure
Long product cycles Commodity offering Fragmented market
About Real Estate

“Big Data” problems we want to solve.
• Can we more accurately predict trends?
• Can we better forecast product demand?
• Can we speed up our sales cycles & time to money?
• Can we more accurately assess value & price?
• Can we use non-traditional data to find causality & correlation?

The Team’s Task:
Design a scalable solution that’s cost effective.
1. Combine large public & private data sources.
2. Simply perform lots of complex joins.
3. Improve the company’s data hygiene.
4. Enrich pricing & valuation models.
5. Build proprietary models & sources.

And… do all of this without adding labor costs
or exploding our infrastructure & license costs.

The big mental shifts…
Talent in the group is a profit center, not a cost center.
Our data is a key asset, worth a true dollar figure.
func(digital != “IT”);

Case study:
Propensity to Buy Luxury Property
Predictive Analytics Example

Luxury Leads Business Requirements:
The test:
Can we accurately identify potential buyers using data?
The conditions:
• New luxury product in an untested market.
• Need new leads beyond in-bound requests.
• Must drive down our cost per lead.
• Build a machine to provide continuous insight.

Luxury Leads High-Level Process
Target Market Whole US Market
Transactions B
Transactions A
“Union View”
Combined Data
Sources
Data Augmented
Features /
Signatures
Generate
Clustering & ML
Features
Data Augmented
Features /
Signatures
Clusters
Segments
Personas
Machine Learning
Propensity Predictor
Sourced Data
Refined
Engagement
Mechanisms
(US)
Refined
Engagement
Mechanisms
(Target)
Machine Learning
Propensity Predictor
Leads Database
Proactive
Call
List
Lead
Scoring
(Historic
TAM)

What is a Data Lake?
A “Data Lake” is a repository that holds raw data in
its native format until it is needed by down stream
analytics processes.

Why is S3 a Natural Fit for Data Lakes?
No need to build a complicated stack
Simplicity = Freedom
• Inherent redundancy at low cost
• Massively parallel IO
• Separate storage from compute
• Integration with Amazon Redshift, EMR, etc.

Can I Just Put Data in S3 and Call it a Lake?
Unmanaged Lake = Swamp
• How do you find things?
• Does everyone just have access to
everything?
• Does all data stay there forever?

What Separates a Data Lake from a Data Swamp?
Intelligent use of storage
conventions in S3
Fine-grained permissions
to contribute, discover, transform and consume data
Data ingest standards
Defining and enforcing data governance processes
Metadata
To enable search, discovery

Data Lake Reference Architecture
BI Tools
Data
Contributors
Managed
Enterprise Data Lake
External
Systems
Data Lake
Governors
(Governance,
En tlements)
Data Consumers
B2E | B2B | B2C
Direct Users
Business Processes
Raw Submissions
Untransformed
Batch | Stream
Managed Datasets
Data Managed by Lake
Suppor ng Schema-on-Read Usage
Data | Metadata
Published Data
Indexed, consumable via
HA DataLake API
Indices, History
ContributeManage
ConsumeSearch
Rules, Policies & En tlements
Contribute | Manage | Transform | Access
Rule-Driven Incremental Loads, Transforms, Cataloging/Indexing, Publishing
Iden ty & Security Indexing & Search
Ingest
Workers,
Loaders
DataLake API
Agile Lakeshore Analy cs
Data Mgmt &
Orchestra on
DataLake UI
Owned | On-prem
3rd Party
Partners | Vendors
Customers
AWS
Directory
Service
Roles
AWS IAM
Perms
AWS
KMS
Monitoring
DataLake
Web Uis
Elas c Beanstalk
Search
Manage
Consume
Search
Manage
Consume
Elas c Beanstalk
Single Sign-On
Unified Policy-Based En tlements
S3 | Submissions
Amazon Kinesis | Submissions
Amazon S3 | Content
S3 | Work in Progress
SQS
Queue
Lambda
Worker
Tier
RDS
UI, App & API State
Amazon DynamoDB
Discovery Views
Amazon
CloudSearch
Facets | Indices | Views
Amazon DynamoDB
HA Published Results
R Studio Amazon
Machine
Learning
Hadoop/Spark
On-demand
Elas c
MapReduce,
Qubole
Redshi
On-Demand Warehouses
BI /Visualiza on
AWS
Data
Pipeline
airflo
w
Tableau
Server
Amazon
Quicksight
Amazon CloudWatch CloudCheckr
AWS CloudTrail DataDog
Data
Ecosystem
API Users
Amazon
Redshi

High Level Data Flow & Ops Model
Analysis WIP
Submission
Dataset
“Union View”
Sourced data
Owned data
Data
contributors
Defined
submission
mechanisms
Data lake
governors
• Define datasets managed within data lake
• Define submission mechanisms for each dataset
• Manage submission & access entitlements
• Govern costs associated with datasets & Lakeshore Analytics
• Work with business owners to define required Lakeshore Analytics
• Submit data
using defined
submission
mechanisms
Lakeshore
Analytics
• Consume datasets from data lake
• Use analysis WIP
• Manufacture published results1
2 3
4
Amazon
Kinesis S3
EMR Amazon
Redshift
Generate
clustering & ML
features

Generate Clustering & ML Features
Extracted Features
Clustering
Dimensions
Distance
Heuristic
R Cluster Analysis
hierarchical | model-based
Profiling
Dimensions
Segment
Analysis
Feature
Definitions
N distinct buyer
personas emerged“Union View”
Sourced Data
Owned Data
Leads Cluster Analysis

Buyer Personas for Marketing
Descriptive analytics on buyer personas = ability
to refine engagement models
Cluster 9

Lead Scoring
1. Train the Model
Qualified
Candidates
ML Training Inputs
(per candidate, “rewound” history)
Transaction History
Buy/Sell Quantities
Property Types
Time
3rd Party Data
For each candidate, model
predicts:
Total Amount of Future
Real Estate Purchases
US Real Estate Activity
all buyers & sellers,
all transaction types,
past 30-Years
Per-Candidate Statistics
number & size of
purchases/sales, locations, …
Bought Nothing
PercentileRank
Bought Most
Bought Little
0%
15%
40%
100%
model predicts
rank of candidate
to +/- 20% of
actual rank 70% of
the time
Training process detects complex patterns
in training inputs that the model uses to
make predictions. Patterns are not
available externally.
Train Model Generate Predictions

Lead Scoring
2. Use the Model
Qualified
Candidates
Will Buy Nothing
PercentileRank Will Buy Most
Will Buy Some
0%
60% Sales Focus
Current Data
(per candidate)
Transaction History
Buy/Sell Quantities
Property Types
Time
3rd Party Data
Predicted Rank of
Candidates
Generic Luxury Leads High-Level Process
100%
Current Data and
Rank Predictions
Re-Generated Each Night

Scored Leads for Sales Team
Scored call list of real people who have bought
high-end real estate – tied to 9 personas
Leads Cluster Analysis

Lead Scoring
3. Review / Refine Model Performance

Improving ML predictive accuracy
…use Amazon Redshift to extract use-case–specific
features
Examples:
• Aggregate computation, e.g., average consumption per month / year
• Periodic behavior frequency extraction
• Volatility analysis & extraction
• Time-series difference analysis
(e.g. average time between A and B, time-adjusted values)

Amazon Redshift + Amazon Machine Learning
…better together
Time-series difference analysis example
today
recent
past
Behaviors 1
today
A Long
Time Ago
Behaviors 2
A
B
C
D
A
A
A
B
B
C
C
D
D
Behaviors 1 Behaviors 2
Net Value
Frequency
Selll to Buy
HoldTime
Amazon
Redshift
Amazon
ML
Amazon
Redshift

Technical Benefits of Approach:
Managed services that “just work”
providing speed, agility and scale
Amazon ML delivered
higher predictive accuracy
for propensity to buy

Business benefits of approach:
• Extensible.
• Adaptive.
• Open standards. Can work with lots of partners.
• On demand.
• Ever growing talent pool.

Remember to complete
your evaluations!

AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)

Similaire à AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302) (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)