Strata 2013: Text Analytics at Scale

Text Analytics at Scale
Listening to 45 Million Customers
Heather Wasserlein, Intuit
STRATA Hadoop World, Oct 30, 2013

On the phone with customer support

3

It’s extremely frustrating

5

Employees are eager to help

So, why the gap?

6

Many touch points

User Intent
7

User Feedback

Overwhelming data volumes
You can read a few 1000 customer comments, but not millions.
And, new themes come up every day..

8

You can pull a “top 1000” list, but..
Is it telling you anything new? Actionable?
Top
hello
help
call
login

Mid

password

cant find pwd

account

multiple accounts

print

import error 5514

phone

printing blank page

phone number

call customer sevice

change password charged twice cancel
Long tail
10

Tail
print function not working new
version of IE, error msg 87956

please call back at 555-555-5555

Insights often in the tail
Top

Needle-in-the-haystack problem – valuable details
hidden in descriptive, tail verbatims

hello
help
call
login

Mid

password

cant find pwd

account

multiple accounts

print

import error 5514

phone

printing blank page

phone number


print function not working

change password

charged twice cancel


Long tail
11

Tail

Related topics dispersed
Top

The “top 1000” can be misleading – the most common
verbatims may not represent the most common themes

hello
help
call
login

Mid

password

cant find pwd

account

multiple accounts

print

import error 5514

phone

printing blank page

phone number


print function not working new

change password

charged twice cancel


Long tail
12

Tail

What is text analytics?
With numeric data, you can run summary stats summarizing textual data is more complex

Statistics + Linguistics

13

You can mix and match various statistical and linguistic tools,
depending on the problem

Example – forensic linguistics

Same author?
14

Case Studies
Applying text analytics
to simple and complex problems
at Travelocity, Yahoo! and Intuit

15

Travelocity search

Where is Albekerke?
San
San
San
San

Jose
Jose, CA
Jose, Costa Rica
Jose Intl Airport

NY
NYC
JFK
New York, NY, USA
NY, New York
Grand Canyon
Disneyland
16

Home

Travelocity search solution
Finite set of airports, but many variations in search
San Jose
San Jose, CA
San Jose International
Mineta San Jose Airport
San Josee Airport
Silicon Valley
SJC

SJC

Simple, but manually intensive solution –
Mapping of all known search variations to relevant
airport codes. Plus, sound-ex phonetic matching
to catch unforeseen misspellings.
“Rules-based” approach
no statistics, minimal linguistics (sounds)
17

Yahoo! web site classification

Is this site clean?
Does it contain any illegal
or sensitive content?
alcohol
tobacco
drug
online gambling
violence or weapons
adult content
Does the web site meet
advertiser standards?

18

Yahoo! web site classification solution
Verbose, rapidly-changing data, but finite set of topics.
100,000’s of web sites in Y! and partner Ad Networks.
Training data (human-labeled)
5K positive examples

30K negative examples

Multiple approaches –
Classifiers, keyword matching, image
matching, and human-review process.

19

Supervised machine learning
Pattern detection, phrases and contexts
associated with finite set of “risk categories.”
Emphasis on recall, catching true positives.

Intuit tax support

Adjusted cost basis?

20

Intuit tax support solution
Millions of questions daily, of all types.
Google-like search, but often in natural language.
PIN number
Where can I find my PIN?
Newly married, file jointly
File married or separately?
Home mortgage deduction
Can I deduct my dog?
Why is 1099-int import slow?
Where’s my refund??
Solution –
Clustering of site searches,
topic “discovery”.

21

PIN
file married
deduct
1099int
refund

Unsupervised machine learning
Statistics and linguistics. Part of speech
tagging. Detection of words that “go
together more often than not”.

import

Results for 3 algorithms
LDA

(bag of words)

File, free, taxes
File, extension, get
File, security, social
Income, state,
business
Payment, state, filed
State, refund, check

Lingo

(hierarchal clustering)

File
File 2012
File an extension
File state
Deduction
Deduction car
Deduction sales tax
Deduction standard

Custom

(n-gram clustering)

File extension
Social security
Business income
Sales tax deduction
Refund check
Payment

(in-house solution)

Words + numbers = insights
Emerging
Topics

Funnel
Analysis

Refund

deduct

Late legislation
File extension
Error 576
etc.

Enter
w2

Import
error..

Trending &
(pre) Segmentation
Taxes done!

Sentiment

23

Use Cases
Product
Managers
1.

User needs

Customer
Care
1.

– Identify product
enhancements
– Rapidly diagnose
product defects
– Tune site search
– Personalize content

Common questions

Marketing
1.

– Train agents & staff
appropriately

2.
3.

– Address common
questions to retain users
– Segment by sentiment
and empower promotors

Emerging issues
– Early insight to new issues

Call routing

Segment by VOC

2.

Customer dialogue
– Listen to feedback &
respond 1:1 or 1:many

Our journey

Site search &
FAQ tuning

2 new
products
100’s items enabled
actioned,
$10M’s
X-functional value
“VOC team”
Scaled
meets weekly

Data
volume
grew,
system
crawled

Emerging issues
detection

Science
project

Clustering
2M searches
2-day lag

Vocal
early
adopters

Y1
Proof of concept
25

Transfer
from
science to
eng

Y2
Productize
Campaign
to grow
adoption

to 15M
searches,
1-day lag

Report
email

Scaled to
30M
searches,
next day
9am SLA

Viral
adoption,
50+ users

Y3
Scale..!

Scaling
Reduce
problem size
1.

Pre-process
– de-dup
– remove PII, system
generated info, etc.
– remove stop words
– map synonyms
– stemming

2.

Reduce data size
– sample
– segment
– narrow time period
– remove tail terms
(cautiously)

Add
hardware
1.

Add memory
– text clustering is
memory constrained
– verbose text is harder

2.

Distribute processes
– rule-based categorization
scales linearly
– clustering of segments
can be run in parallel
– data sourcing
– pre-processing

Optimize
algorithm
1.

Tradeoffs & tuning
– Choose approach to
balance accuracy vs.
performance
– Tune algorithm
parameters

Results
1. Faster time to insights
2. Better customer experience
3. $10’s millions in revenue

Customer issues detected up to 1
week earlier
Search is a leading indicator for call
drivers – a canary in the coal mine

Using text insights to tune search
results improved relevancy
Identifying users with common questions
made it possible to personalize the
experience
VOC data + user behavior led to a whole
new understanding of product use

Detecting and resolving customer pain
points generated $10’s of millions
27

Getting started?
1. Read a sample of verbatims + scope the problem
– Topic discovery or known topics?
– Sources of text and verbosity (few words, sentences, pages)?
– Estimate data volumes and define SLA’s

2. Build vs. buy
– Compare tools, build proofs of concept
– Compare results relative to a “golden set”

3. Start small
– One data source, non-verbose text, small volumes
– 1000’s of documents for statistically valid results
– Beta test reporting, QA topic-verbatim fit

4. Establish business processes
– X-functional process to action insights, let reports go viral
Scale and incorporate domain knowledge later (“phase 2”)
28

Long story short

Listen.
To everyone!

Words
+ Numbers
=
Insights

Apply the
right tools for
the job

Thank You!
@heatherwater
@IntuitInc

30

“Home grown” Algorithm
Unsupervised machine learning / clustering
1. Identify candidate phrases
– Sparse: Identify all combinations of bi-grams, tri-grams, four-grams
– Verbose: Use linguistic approaches to identify phrases
• Split text into sentences + identify part-of-speech for each word (noun, adj, etc.)
• Apply linguistic filters to parse candidate phrases (adj noun, verb adv, etc.)

2. Determine which phrases are “significant”
– Count word frequencies and calculate likelihood ratios
• L1 = words are independent, L2 = words are dependent
• If L2 > L1, the words appear together more often than not

3. Cluster related topics
– Represent n-grams and searches as vectors, calculate similarity (cosine
distance), and cluster related topics when similarity > pre-defined threshold

4. Identify topic “title”
32

– Construct “title” representative of the cluster (ex. most common search)

What’s next for text at Intuit?
1.
2.
3.
4.

Finalize evaluation of new algorithms (ex. Lingo3G, LDA, etc)
Scale through distributed processing (ie. move to Hadoop)
Support more types of text (ex. verbose)
Continue to integrate topics & usage data for complete
picture of end-to-end user experience
5. Provide text analytics as a service
6. Semantic search
7. Internationalization (future)

33

Strata 2013: Text Analytics at Scale

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (11)

Plus de Intuit Inc.

Plus de Intuit Inc. (20)

Dernier

Dernier (20)

Strata 2013: Text Analytics at Scale

Notes de l'éditeur