Top 3 Considerations for Machine Learning on Big Data

Top 3 Things to Consider with
Machine Learning on Big Data
Karen Hsu
Elliott Cordo


About our Speakers
Karen Hsu
•

Karen is Senior Director, Product Marketing at
Datameer. With over 15 years of experience in
enterprise software, Karen Hsu has co-authored
4 patents and worked in a variety of
engineering, marketing and sales roles.

•

Most recently she came from Informatica where
she worked with the start-ups Informatica
purchased to bring data quality, master data
management, B2B and data security solutions
to market.

• Karen has a Bachelors of Science degree in

Management Science and Engineering from
Stanford University.


About our Speakers
Elliott Cordo
• Elliott is a data warehouse and information

management expert. He brings more than a
decade of experience in implementing data
solutions with hands-on experience in every
component of the data warehouse software
development lifecycle.

• At Caserta Concepts, Elliott oversees largescale major technology projects, including
those involving business intelligence, data
analytics, Big Data and data warehousing.


Drivers &
Challenges

Use Cases


Key Criteria

Best
Practices

Next Steps

Big Data Drives Results
Amazon vs Barnes & Noble

Big Data Analytics Drives Results

$300

$225

$150

$75

$0
12

/31

/09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/2
03
06
09
12
03
06
09
12
03
06
09
12
03

NetFlix vs Blockbuster
$300

$225

$150

$75

$0
12

/31

/09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/2
03
06
09
12
03
06
09
12
03
06
09
12
03


Alternatives Are Lacking

Data
Mining

•
•
•
•

Traditional
BI

Hard to use
Requires PHD experts
Must write code
Expensive

• Fixed DW models
• Must write code for
analytics
• Very high IT labor
costs
• Not agile


Visualization

• Easy for small teams
• Can’t manage large data
volume
• Lack support of advanced
analytics

Costs of Building Can be $1M+

Solution

$1M+ in Capital

Bay Area
140,000.00
117,000.00
119,000.00
125,000.00
116,000.00

New York
$126,000.00
$105,000.00
$107,000.00
$119,000.00
$104,000.00

137,000.00

$133,000.00

138,000.00
136,000.00
120,000.00

$133,000.00
$133,000.00
$114,000.00

1,148,000.00

$1M+ in Salaries

Job Title
IT Project Manager
System Administrator
Network Administrator
Database
Administrator
IT Security Manager
Business Intelligence
Analyst
Data Scientist
Java Developer
QA Engineer

$1,074,000.00

Cost / 100TB

Teradata EDW

1,650,000.00

Oracle Exadata

1,400,000.00

IBM Netezza

1,000,000.00


Use Cases
Use Case

What is Revealed

Proﬁling and
segmentation

Customer, product, market characteristics and segments

Acquisition and
retention

What leads a person to become a customer or stop being a
customer

Product development
and operations
optimization

What led to product or network failure

Campaign
management

Patterns of successful campaigns

Cross-sell / up-sell

Recommendations on services, products, or advisors for a
given user/customer proﬁle


Customer Examples
Industry

Use Case

Financial Services

• Show correlation between services purchased and
investments/trades made
• Identify customer segments
• Recommendations for research articles to drive trading

eCommerce

• Show types of events person will like
• Decision tree based on likelihood to click through
• Recommendations for a large “cold start” population

Gaming

• Clustering for user proﬁles
• Correlation between attributes of a game and behavior
• Churn analysis

Healthcare

• Recommend tests or other offerings
• Identify factors/trends that lead to disease

Ease of Use


Quality

Clustering Overview
•
•
•

K-means is a popular and versatile general purpose clustering
algorithm.
Commonly used to group people and objects together to form
segments
Often leveraged to enhance recommendation and search systems

K-Means

How it works
1. Treats items as coordinates
2. Places a number of random
“centroids” and assigns the
nearest items
3. Moves the centroids around
based on average location
4. Process repeats until the
assignments stop changing

*Diagram from Collective Intelligence by Toby Segaran

Ease of Use
First, the set up...

In Datameer, you select the columns... And
get the results

And then run the results...

And the quality of results increases with larger
data sets…
And write additional code to scale...

Ease of Use
First, you have to set up...
pca <- princomp(iris[1:4]);
colors <- kmeans(iris[1:4], 3)$cluster;
plot(pca$scores[,1], pca$scores[,2],
col=colors, pch=5);

And then write more code to scale...

get the results

Ease of Use
First, select the data...

get the results
Second, you need to create the cluster...

And then see the results


Ease of Use
1. First a dataset’s attirbutes must be converted to numeric representations
User

Location

Company

Favorite Algo

Elliott

New Jersey

Caserta

K-Means

Karen

California

Datameer

K-Means

User

Location

Company

Favorite Algo

1001

1

101

1001

1002

2

102

1001

2. This numeric dataset is then converted to a sequence ﬁle, then sparse
vector leveraging Seqdirectory and seq2sparse
3. Mahout is called, number of clusters, distance calculation is speciﬁed

bin/mahout kmeans -i /user/kmeans/vectors -c /user/
kmeans/input -o /user/kmeans/output -k 200 -dm
CosineSimilarity -x 20 -ow

4. The sparse vector output is then converted back to a delimted format,
5. Textual attributes willl be appended back to the record, numeric values
preserved for ad-hoc distance comparison of members within a cluster

*Diagram from Collective Intelligence by Toby Segaran

get the results

Quality Comparison


Column Dependencies Overview
A

B

C

D

a

x

a

x

b

y

b

x

b

y

b

y

a

x

a

z

c

z

c

y

a

y

a

y

Column
Dependency ~
0.99

Column
Dependency ~
0.01

Value
•See how data is related after joining multiple sets of
data
•See column dependencies on multiple types of data


Quality Comparison
ColumnDependency(A,B) = 0.5


0

Column B

0

Column B

0
-2

-5

-5

-1

Column B

1

5

5

2

ColumnDependency(A,B) = 0

-3

-2

-1

0

1

2

3

-2

-1

0

1

2

3

-2

-1

0

1

2

Column A

Column A




m
k
j
i
h
g
f
e

Column B (STRING)

a

b

c

a

d

b

0
-6000 -4000 -2000

Column B

2000

Column B (STRING)

l

c

4000

n

6000

o

Column A

-3

-2

-1

0

1

2

3

Column A


0

0.5

1

1.5

Column A
(NUMBER)

2

2.5

3

1

2

3

4

5

6

7

8

9

10

Column A
(NUMBER)

12

14

Decision Tree Overview
Goal: Create a model that predicts the value of a target
based on several inputs.


Ease of Use
First, you need to code...
packages.install(rpart);
library(rpart);
treeInput <- read.csv("/PathToData/
iris.csv");
fit <- rpart(class ~ sepalLength
+sepalWidth+petalLength+petalWidth,
data=treeInput);
par(mfrow=c(1,2), xpd=NA);
plot(fit);
text(fit, use.n=TRUE);

And then write more code to scale...


get the results

Ease of Use

First, select the data...

get the results
Second, you conﬁgure the settings...

And then see the results


Quality Comparison
Iris

Wine

Breast

Cancer

Wisconsin

R

92.66%

86.47%

92.86%

Weka

95.33%

89.33%

93.5%

Datameer

93.33%

91.18%

93.04%


Recommendations Overview
Increased revenue
Your customers expect them
What makes a good
recommendation?
Combination of algorithms and
Hadoop make effective
recommendations platform
achievable


Ease of Use
First, the set up...
# run factorization of ratings matrix
$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output $
{WORK_DIR}/als/out
--tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda
0.065 --numThreadsPerSolver 2
# compute recommendations
$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ -output ${WORK_DIR}/recommendations/
--userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/
M/
--numRecommendations 6 --maxRating 5 --numThreads 2

get the results

1
[845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,52
7:5.0,31:5.0,515:5.0,514:5.0]
2
[546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515
:5.0,508:5.0,496:5.0,483:5.0]
3
[137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.7
5,124:4.7,319:4.703,29:4.67,591:4.6]
4
[748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,
483:5.0,475:5.0,471:5.0,876:5.0]
5
[732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:
5.0,514:5.0,511:5.0,508:5.0]
6
[739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5
.0,527:5.0,526:5.0,521:5.0]

Quality Comparison
Shawshank

Godfather

Pulp
Fiction

Fight
Club

Dianna

4.76

4.98

1.95

2.44

Jon

1.99

2.51

2.87

4.83

Karen

3.28

4.72

1.89

2.95

Elliott

2.92

3.64

2.97

4.83


Same Results

Big Data Analytics Process

Integrate

Deﬁne

Ad
Hoc

Prepare and
Analyze
Deploy

Visualize


Production

Clustering
• Leverage Hierarchies
• If possible, use numbering schemes
• Scale the surrogate key of attributes
• Try different cluster sizes
• Avoid numeric similarities when building your data


Recommendations
K-Means:
Similar

Item-Based

• Leverage a combination of
algorithms

• Clustering is your friend!
• Treat cold start situations differently
• Think about ranking
• Don’t let recommendations go wild

Item Similarity
Best
Recommendations

Process Best Practices

Map


Chain

Iterate

Return on Investment
Funnel
Optimization

Behavioral
Analytics

Fraud
Prevention

EDW
Optimization

Customer
Segmentation

Increase Customer
conversion by 3x

Increase Revenue
by 2x

Identify $2B in
potential fraud

98% OpEx savings
$1M+ CapEx
savings

Lower Customer
Acquisition Costs by
30%


Call to Action
Workshop
Contact

• Elliott Cordo elliott@casertaconcepts.com
• Karen Hsu khsu@datameer.com


Top 3 Considerations for Machine Learning on Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Top 3 Considerations for Machine Learning on Big Data

Similar to Top 3 Considerations for Machine Learning on Big Data (20)

More from Datameer

More from Datameer (18)

Recently uploaded

Recently uploaded (20)

Top 3 Considerations for Machine Learning on Big Data