AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)
The Howard Hughes Corporation partnered with 47Lining to develop a managed enterprise data lake based on Amazon S3. The purpose of the managed EDL is to fuse relevant on-premises and third-party data to enable Howard Hughes to answer its most valuable business questions. Their first analysis was a lead-scoring model that uses Amazon Machine Learning (Amazon ML) to predict propensity to purchase high-end real estate. The model is based on a combined set of public and private data sources, including all publicly recorded real estate transactions in the US for the past 35 years. By changing their business process for identifying and qualifying leads to use the results of data-driven analytics from their managed data lake in AWS, Howard Hughes increased the number of identified qualified leads in their pipeline by over 400% and reduced the acquisition cost per lead by more than 10 times. In this session, you will see a practical example of how to use Amazon ML to improve business results, how to architect a data lake with Amazon S3 that fuses on-premises, third-party, and public data sets, and how to train and run an Amazon ML model to attain predictive accuracy.
AWS re:Invent 2016: Metering Big Data at AWS: From 0 to 100 Million Records i...
Similaire à AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)
Similaire à AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302) (20)
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)
3. What to Expect from the Session
How to use machine learning to
improve business results
How to architect a data lake atop S3
that fuses on-premises, 3rd party
and public data sets
How to commission Lakeshore
Analytics in Amazon Redshift
Strategies for development of
summaries and aggregates for
Amazon Machine Learning
Training and running Amazon
Machine Learning to attain
predictive accuracy
6. About
Seaport District Ward Village The Woodlands
Downtown Summerlin Summerlin Downtown Columbia
New York, NY Honolulu, HI The Woodlands, TX
Summerlin, NV Summerlin, NV Columbia, MD
7. Capital intensive Lots of human touch Micro/macro exposure
Long product cycles Commodity offering Fragmented market
About Real Estate
8. “Big Data” problems we want to solve.
• Can we more accurately predict trends?
• Can we better forecast product demand?
• Can we speed up our sales cycles & time to money?
• Can we more accurately assess value & price?
• Can we use non-traditional data to find causality & correlation?
9. The Team’s Task:
Design a scalable solution that’s cost effective.
1. Combine large public & private data sources.
2. Simply perform lots of complex joins.
3. Improve the company’s data hygiene.
4. Enrich pricing & valuation models.
5. Build proprietary models & sources.
10. And… do all of this without adding labor costs
or exploding our infrastructure & license costs.
11. The big mental shifts…
Talent in the group is a profit center, not a cost center.
Our data is a key asset, worth a true dollar figure.
func(digital != “IT”);
13. Luxury Leads Business Requirements:
The test:
Can we accurately identify potential buyers using data?
The conditions:
• New luxury product in an untested market.
• Need new leads beyond in-bound requests.
• Must drive down our cost per lead.
• Build a machine to provide continuous insight.
14. Luxury Leads High-Level Process
Target Market Whole US Market
Transactions B
Transactions A
“Union View”
Combined Data
Sources
Data Augmented
Features /
Signatures
Generate
Clustering & ML
Features
Data Augmented
Features /
Signatures
Clusters
Segments
Personas
Machine Learning
Propensity Predictor
Sourced Data
Refined
Engagement
Mechanisms
(US)
Refined
Engagement
Mechanisms
(Target)
Machine Learning
Propensity Predictor
Leads Database
Proactive
Call
List
Lead
Scoring
(Historic
TAM)
16. What is a Data Lake?
A “Data Lake” is a repository that holds raw data in
its native format until it is needed by down stream
analytics processes.
17. Why is S3 a Natural Fit for Data Lakes?
No need to build a complicated stack
Simplicity = Freedom
• Inherent redundancy at low cost
• Massively parallel IO
• Separate storage from compute
• Integration with Amazon Redshift, EMR, etc.
18. Can I Just Put Data in S3 and Call it a Lake?
Unmanaged Lake = Swamp
• How do you find things?
• Does everyone just have access to
everything?
• Does all data stay there forever?
19. What Separates a Data Lake from a Data Swamp?
Intelligent use of storage
conventions in S3
Fine-grained permissions
to contribute, discover, transform and consume data
Data ingest standards
Defining and enforcing data governance processes
Metadata
To enable search, discovery
20. Data Lake Reference Architecture
BI Tools
Data
Contributors
Managed
Enterprise Data Lake
External
Systems
Data Lake
Governors
(Governance,
En tlements)
Data Consumers
B2E | B2B | B2C
Direct Users
Business Processes
Raw Submissions
Untransformed
Batch | Stream
Managed Datasets
Data Managed by Lake
Suppor ng Schema-on-Read Usage
Data | Metadata
Published Data
Indexed, consumable via
HA DataLake API
Indices, History
ContributeManage
ConsumeSearch
Rules, Policies & En tlements
Contribute | Manage | Transform | Access
Rule-Driven Incremental Loads, Transforms, Cataloging/Indexing, Publishing
Iden ty & Security Indexing & Search
Ingest
Workers,
Loaders
DataLake API
Agile Lakeshore Analy cs
Data Mgmt &
Orchestra on
DataLake UI
Owned | On-prem
3rd Party
Partners | Vendors
Customers
AWS
Directory
Service
Roles
AWS IAM
Perms
AWS
KMS
Monitoring
DataLake
Web Uis
Elas c Beanstalk
Search
Manage
Consume
Search
Manage
Consume
Elas c Beanstalk
Single Sign-On
Unified Policy-Based En tlements
S3 | Submissions
Amazon Kinesis | Submissions
Amazon S3 | Content
S3 | Work in Progress
SQS
Queue
Lambda
Worker
Tier
RDS
UI, App & API State
Amazon DynamoDB
Discovery Views
Amazon
CloudSearch
Facets | Indices | Views
Amazon DynamoDB
HA Published Results
R Studio Amazon
Machine
Learning
Hadoop/Spark
On-demand
Elas c
MapReduce,
Qubole
Redshi
On-Demand Warehouses
BI /Visualiza on
AWS
Data
Pipeline
airflo
w
Tableau
Server
Amazon
Quicksight
Amazon CloudWatch CloudCheckr
AWS CloudTrail DataDog
Data
Ecosystem
API Users
Amazon
Redshi
21. Luxury Leads High-Level Process
High Level Data Flow & Ops Model
Analysis WIP
Submission
Dataset
“Union View”
Sourced data
Owned data
Data
contributors
Defined
submission
mechanisms
Data lake
governors
• Define datasets managed within data lake
• Define submission mechanisms for each dataset
• Manage submission & access entitlements
• Govern costs associated with datasets & Lakeshore Analytics
• Work with business owners to define required Lakeshore Analytics
• Submit data
using defined
submission
mechanisms
Lakeshore
Analytics
• Consume datasets from data lake
• Use analysis WIP
• Manufacture published results1
2 3
4
Amazon
Kinesis S3
EMR Amazon
Redshift
Generate
clustering & ML
features
22. Luxury Leads High-Level Process
Generate Clustering & ML Features
Extracted Features
Clustering
Dimensions
Distance
Heuristic
R Cluster Analysis
hierarchical | model-based
Profiling
Dimensions
Segment
Analysis
Feature
Definitions
N distinct buyer
personas emerged“Union View”
Sourced Data
Owned Data
Leads Cluster Analysis
23. Luxury Leads High-Level Process
Buyer Personas for Marketing
Descriptive analytics on buyer personas = ability
to refine engagement models
Cluster 9
24. Luxury Leads High-Level Process
Lead Scoring
1. Train the Model
Qualified
Candidates
ML Training Inputs
(per candidate, “rewound” history)
Transaction History
Buy/Sell Quantities
Property Types
Time
3rd Party Data
For each candidate, model
predicts:
Total Amount of Future
Real Estate Purchases
US Real Estate Activity
all buyers & sellers,
all transaction types,
past 30-Years
Per-Candidate Statistics
number & size of
purchases/sales, locations, …
Bought Nothing
PercentileRank
Bought Most
Bought Little
0%
15%
40%
100%
model predicts
rank of candidate
to +/- 20% of
actual rank 70% of
the time
Training process detects complex patterns
in training inputs that the model uses to
make predictions. Patterns are not
available externally.
Train Model Generate Predictions
25. Lead Scoring
2. Use the Model
Qualified
Candidates
Will Buy Nothing
PercentileRank Will Buy Most
Will Buy Some
0%
60% Sales Focus
Current Data
(per candidate)
Transaction History
Buy/Sell Quantities
Property Types
Time
3rd Party Data
Predicted Rank of
Candidates
Generic Luxury Leads High-Level Process
100%
Current Data and
Rank Predictions
Re-Generated Each Night
26. Scored Leads for Sales Team
Scored call list of real people who have bought
high-end real estate – tied to 9 personas
Luxury Leads High-Level Process
Leads Cluster Analysis
28. Improving ML predictive accuracy
…use Amazon Redshift to extract use-case–specific
features
Examples:
• Aggregate computation, e.g., average consumption per month / year
• Periodic behavior frequency extraction
• Volatility analysis & extraction
• Time-series difference analysis
(e.g. average time between A and B, time-adjusted values)
29. Amazon Redshift + Amazon Machine Learning
…better together
Time-series difference analysis example
today
recent
past
Behaviors 1
today
A Long
Time Ago
Behaviors 2
A
B
C
D
A
A
A
B
B
C
C
D
D
Behaviors 1 Behaviors 2
Net Value
Frequency
Selll to Buy
HoldTime
Amazon
Redshift
Amazon
ML
Amazon
Redshift
30. Technical Benefits of Approach:
Managed services that “just work”
providing speed, agility and scale
Amazon ML delivered
higher predictive accuracy
for propensity to buy
32. Business benefits of approach:
• Extensible.
• Adaptive.
• Open standards. Can work with lots of partners.
• On demand.
• Ever growing talent pool.