Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk

Consumer Analytics in Real Time:
How InfoScout Tracks Purchase Behavior with Mechanical Turk
Jon Brelig, CTO, InfoScout
Sharon Chiarella, Vice President, Amazon Mechanical Turk
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Overview

– Receipt workflow
– Quality control
– Analytics

Wish I knew who that shopper was!

Helping brands answer…
•
•
•
•
•
•
•

Who’s buying my product?
Who’s the end consumer?
Why did they buy?
When and where?
How many?
At what price?
With what else?

Who’s the shopper? What’s their motive?

How do we build
a better panel?
Capture receipts through mobile

Our mobile apps
Receipt Hog

Put $ in your pocket!

Shoparoo

Fundraise for a cause!

Architecture

target.com
target.com

Masterdata
MySQL

GAT G2 LMN LIME = UPC 052000209648

1. Capture Receipt

2. Convert to structured data
Computer vision + OCR + MTurk

3) Link to masterdata
Scraping + classification models +
human training

Tlog
Redshift

5. Build cool stuff on top of it!
Analytics, data firehouse, hacks, etc.

4) Data warehouse & prematerialize
MySQL, Amazon Redshift, Hadoop
(Amazon EMR)

Digitizing Receipts
Task is to convert image(s) of receipts => structured data

Transcribing Receipts
• Isn’t OCR good enough?

Auto Extract
OpenCV, OCR, Regex

– Leverage OCR & computer vision, fill gaps with
humans

• Human = MTurk + small audit staff
– We leverage a 6-person team to act as the top
audit layer of the system

User marks or staff rejects HIT

• Hybrid of computer + human

Summary Extraction
Mechanical Turk

Itemized Extraction
Mechanical Turk

Score & Audit
Staff / Mechanical Turk

Complete

Can we skip?

– It is a solved problem… for books
– Low recognition on wrinkled receipts from mobile

Summary Transcription

Summary Extraction
Mechanical Turk

Itemized Extraction
Mechanical Turk

Score & Audit

Complete

Can we skip?


Auto Extract
OpenCV, OCR, Regex

Summary Transcription
Receipts by Month
1,200,000
1,000,000
800,000
600,000
400,000
200,000
-

How do we scale quality control with growing volume?

Known Answers
• Publish HIT with at least one
known answer to audit Worker
accuracy
• Additional support provided by
Amazon API
• Most effective when there is a
concrete, expected answer
– i.e. Multiple choice answers

Known Answer

Known Answers
Net Cost per Receipt
Developed more efficient review process
$0.0300

Transitioned to Known Answers

$0.0250
$0.0200
$0.0150
$0.0100
$0.0050
$-

InfoScout Review Cost

Mturk Cost

Known Answers lowered our net cost per receipt from 2 cents to 1 cent per receipt

Itemized Extraction

Summary Extraction
Mechanical Turk

Itemized Extraction
Mechanical Turk

Score & Audit

Complete

Can we skip?


Auto Extract
OpenCV, OCR, Regex

Itemized Extraction
• Transcribe every item on receipt
• HITs audited by review team, priority scored by:
–
–
–
–
–

Comparing output to known OCR extraction
Comparison to master data? (i.e. did they “fat finger” a price or UPC?)
Worker approval history
Worker tenure (for InfoScout HITs)
Additional features

• Not a great candidate for Known Answers….
How do we scale quality control for itemized extraction?

Plurality

Publish HIT

• HIT completed by >1 Worker
– InfoScout only sends HITs with low
confidence to multiple Workers
Worker 2
Submits

Worker 1
Submits

• Higher quality, higher cost
– Limit costs by scientifically selecting HITs to
send to a second Worker

• Multiple strategies when an answer
discrepancy is found
– Ask a third Worker
– Leverage internal auditors

Match
?
YES

Accept

HIT Acceptance Latency
700

Minutes to Accept

600

Changed Template

500
400
300
200
100
0
12/22/12

•
•

1/22/13

2/22/13

3/22/13

4/22/13

5/22/13

6/22/13

Measures HIT demand
Template change decreased demand temporarily, but Workers acclimated

700,000

100%
90%

Total HITs Completed

600,000

80%
500,000

70%
60%

400,000

50%
300,000

40%
30%

200,000

20%
100,000

10%
0%

0

HITs Complete (New Workers)

% Completed by retained Workers

Worker Retention

HITs Complete (Retained Workers)

Within two months, 80% of HITs were completed by returning Workers

Pareto of Worker Volume
90%
% of all HITs completed

80%
70%
60%
50%
40%
30%
20%
10%
0%
Top 5%

6-10%

10-20%

21-50%

51-100%

Worker Percentile

Our top 5% (~500) active Workers account for >80% of all HITs completed

Please give us your feedback on this
presentation

BDT206
As a thank you, we will select prize
winners daily for completed surveys!

Quality Control Strategies
• Filter incoming Workers
– Qualifications
– Template validation
– Template instructions

Enhance

• Increase quality during completion
HIT

• Post submission
– Plurality (multiple HITs per task)
– Known Answers
– Workers audit Workers

Approve/Reject?

Multiple strategies can yield high accuracy

HIT templates
• Clear & concise instructions
– 1st time each Worker sees detailed
instructions, has ability to hide once
they’re comfortable

• Keyboard shortcuts
• Maximize Validation
– Client-side and/or AJAX validation

• Bonus Rewards
– Nice option for rewarding Workers,
especially when HIT’s are variable in
length & time

Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk

Similar to Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk