More Related Content Similar to Top 3 Considerations for Machine Learning on Big Data (20) Top 3 Considerations for Machine Learning on Big Data2. Top 3 Things to Consider with
Machine Learning on Big Data
Karen Hsu
Elliott Cordo
© 2013 Datameer, Inc. All rights reserved.
3. About our Speakers
Karen Hsu
•
Karen is Senior Director, Product Marketing at
Datameer. With over 15 years of experience in
enterprise software, Karen Hsu has co-authored
4 patents and worked in a variety of
engineering, marketing and sales roles.
•
Most recently she came from Informatica where
she worked with the start-ups Informatica
purchased to bring data quality, master data
management, B2B and data security solutions
to market.
• Karen has a Bachelors of Science degree in
Management Science and Engineering from
Stanford University.
© 2013 Datameer, Inc. All rights reserved.
4. About our Speakers
Elliott Cordo
• Elliott is a data warehouse and information
management expert. He brings more than a
decade of experience in implementing data
solutions with hands-on experience in every
component of the data warehouse software
development lifecycle.
• At Caserta Concepts, Elliott oversees largescale major technology projects, including
those involving business intelligence, data
analytics, Big Data and data warehousing.
© 2013 Datameer, Inc. All rights reserved.
7. Big Data Drives Results
Amazon vs Barnes & Noble
Big Data Analytics Drives Results
$300
$225
$150
$75
$0
12
/31
/09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/2
03
06
09
12
03
06
09
12
03
06
09
12
03
NetFlix vs Blockbuster
$300
$225
$150
$75
$0
12
/31
/09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/3
/2
03
06
09
12
03
06
09
12
03
06
09
12
03
© 2013 Datameer, Inc. All rights reserved.
8. Alternatives Are Lacking
Data
Mining
•
•
•
•
Traditional
BI
Hard to use
Requires PHD experts
Must write code
Expensive
• Fixed DW models
• Must write code for
analytics
• Very high IT labor
costs
• Not agile
© 2013 Datameer, Inc. All rights reserved.
Visualization
• Easy for small teams
• Can’t manage large data
volume
• Lack support of advanced
analytics
9. Costs of Building Can be $1M+
Solution
$1M+ in Capital
Bay Area
140,000.00
117,000.00
119,000.00
125,000.00
116,000.00
New York
$126,000.00
$105,000.00
$107,000.00
$119,000.00
$104,000.00
137,000.00
$133,000.00
138,000.00
136,000.00
120,000.00
$133,000.00
$133,000.00
$114,000.00
1,148,000.00
$1M+ in Salaries
Job Title
IT Project Manager
System Administrator
Network Administrator
Database
Administrator
IT Security Manager
Business Intelligence
Analyst
Data Scientist
Java Developer
QA Engineer
$1,074,000.00
Cost / 100TB
Teradata EDW
1,650,000.00
Oracle Exadata
1,400,000.00
IBM Netezza
1,000,000.00
© 2013 Datameer, Inc. All rights reserved.
11. Use Cases
Use Case
What is Revealed
Profiling and
segmentation
Customer, product, market characteristics and segments
Acquisition and
retention
What leads a person to become a customer or stop being a
customer
Product development
and operations
optimization
What led to product or network failure
Campaign
management
Patterns of successful campaigns
Cross-sell / up-sell
Recommendations on services, products, or advisors for a
given user/customer profile
© 2013 Datameer, Inc. All rights reserved.
12. Customer Examples
Industry
Use Case
Financial Services
• Show correlation between services purchased and
investments/trades made
• Identify customer segments
• Recommendations for research articles to drive trading
eCommerce
• Show types of events person will like
• Decision tree based on likelihood to click through
• Recommendations for a large “cold start” population
Gaming
• Clustering for user profiles
• Correlation between attributes of a game and behavior
• Churn analysis
Healthcare
• Recommend tests or other offerings
• Identify factors/trends that lead to disease
© 2013 Datameer, Inc. All rights reserved.
15. Ease of Use
© 2013 Datameer, Inc. All rights reserved.
Quality
17. Clustering Overview
•
•
•
K-means is a popular and versatile general purpose clustering
algorithm.
Commonly used to group people and objects together to form
segments
Often leveraged to enhance recommendation and search systems
K-Means
How it works
1. Treats items as coordinates
2. Places a number of random
“centroids” and assigns the
nearest items
3. Moves the centroids around
based on average location
4. Process repeats until the
assignments stop changing
© 2013 Datameer, Inc. All rights reserved.
*Diagram from Collective Intelligence by Toby Segaran
18. Ease of Use
First, the set up...
In Datameer, you select the columns... And
get the results
And then run the results...
And the quality of results increases with larger
data sets…
And write additional code to scale...
© 2013 Datameer, Inc. All rights reserved.
19. Ease of Use
First, you have to set up...
pca <- princomp(iris[1:4]);
colors <- kmeans(iris[1:4], 3)$cluster;
plot(pca$scores[,1], pca$scores[,2],
col=colors, pch=5);
And then run the results...
And then write more code to scale...
© 2013 Datameer, Inc. All rights reserved.
In Datameer, you select the columns... And
get the results
20. Ease of Use
First, select the data...
In Datameer, you select the columns... And
get the results
Second, you need to create the cluster...
And then see the results
© 2013 Datameer, Inc. All rights reserved.
21. Ease of Use
1. First a dataset’s attirbutes must be converted to numeric representations
User
Location
Company
Favorite Algo
Elliott
New Jersey
Caserta
K-Means
Karen
California
Datameer
K-Means
User
Location
Company
Favorite Algo
1001
1
101
1001
1002
2
102
1001
2. This numeric dataset is then converted to a sequence file, then sparse
vector leveraging Seqdirectory and seq2sparse
3. Mahout is called, number of clusters, distance calculation is specified
bin/mahout kmeans -i /user/kmeans/vectors -c /user/
kmeans/input -o /user/kmeans/output -k 200 -dm
CosineSimilarity -x 20 -ow
4. The sparse vector output is then converted back to a delimted format,
5. Textual attributes willl be appended back to the record, numeric values
preserved for ad-hoc distance comparison of members within a cluster
© 2013 Datameer, Inc. All rights reserved.
*Diagram from Collective Intelligence by Toby Segaran
In Datameer, you select the columns... And
get the results
25. Quality Comparison
ColumnDependency(A,B) = 0.5
ColumnDependency(A,B) = 0.5
0
Column B
0
Column B
0
-2
-5
-5
-1
Column B
1
5
5
2
ColumnDependency(A,B) = 0
-3
-2
-1
0
1
2
3
-2
-1
0
1
2
3
-2
-1
0
1
2
Column A
Column A
ColumnDependency(A,B) = 1
ColumnDependency(A,B) = 0.5
ColumnDependency(A,B) = 1
m
k
j
i
h
g
f
e
Column B (STRING)
a
b
c
a
d
b
0
-6000 -4000 -2000
Column B
2000
Column B (STRING)
l
c
4000
n
6000
o
Column A
-3
-2
-1
0
1
2
3
Column A
© 2013 Datameer, Inc. All rights reserved.
0
0.5
1
1.5
Column A
(NUMBER)
2
2.5
3
1
2
3
4
5
6
7
8
9
10
Column A
(NUMBER)
12
14
27. Decision Tree Overview
Goal: Create a model that predicts the value of a target
based on several inputs.
© 2013 Datameer, Inc. All rights reserved.
28. Ease of Use
First, you need to code...
packages.install(rpart);
library(rpart);
treeInput <- read.csv("/PathToData/
iris.csv");
fit <- rpart(class ~ sepalLength
+sepalWidth+petalLength+petalWidth,
data=treeInput);
par(mfrow=c(1,2), xpd=NA);
plot(fit);
text(fit, use.n=TRUE);
And then run the results...
And then write more code to scale...
© 2013 Datameer, Inc. All rights reserved.
In Datameer, you select the columns... And
get the results
29. Ease of Use
First, select the data...
In Datameer, you select the columns... And
get the results
Second, you configure the settings...
And then see the results
© 2013 Datameer, Inc. All rights reserved.
32. Recommendations Overview
Increased revenue
Your customers expect them
What makes a good
recommendation?
Combination of algorithms and
Hadoop make effective
recommendations platform
achievable
© 2013 Datameer, Inc. All rights reserved.
33. Ease of Use
First, the set up...
# run factorization of ratings matrix
$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output $
{WORK_DIR}/als/out
--tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda
0.065 --numThreadsPerSolver 2
# compute recommendations
$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ -output ${WORK_DIR}/recommendations/
--userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/
M/
--numRecommendations 6 --maxRating 5 --numThreads 2
In Datameer, you select the columns... And
get the results
And then run the results...
1
[845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,52
7:5.0,31:5.0,515:5.0,514:5.0]
2
[546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515
:5.0,508:5.0,496:5.0,483:5.0]
3
[137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.7
5,124:4.7,319:4.703,29:4.67,591:4.6]
4
[748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,
483:5.0,475:5.0,471:5.0,876:5.0]
5
[732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:
5.0,514:5.0,511:5.0,508:5.0]
6
[739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5
.0,527:5.0,526:5.0,521:5.0]
© 2013 Datameer, Inc. All rights reserved.
36. Big Data Analytics Process
Integrate
Define
Ad
Hoc
Prepare and
Analyze
Deploy
Visualize
© 2013 Datameer, Inc. All rights reserved.
Production
37. Clustering
• Leverage Hierarchies
• If possible, use numbering schemes
• Scale the surrogate key of attributes
• Try different cluster sizes
• Avoid numeric similarities when building your data
© 2013 Datameer, Inc. All rights reserved.
38. Recommendations
K-Means:
Similar
Item-Based
• Leverage a combination of
algorithms
• Clustering is your friend!
• Treat cold start situations differently
• Think about ranking
• Don’t let recommendations go wild
© 2013 Datameer, Inc. All rights reserved.
Item Similarity
Best
Recommendations