2. Ø 2015.04~ Research Engineer at Treasure Data,
Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-
a-service company
Ø 2010.04-2015.03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan.
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
Little about me …
2016/10/29 @Dots 2
3. 2016/10/29 @Dots 3
Hiro Yoshikawa
CEO
Kaz Ota
CTO
Sada Furuhashi
Chief Architect
Open source business
veteran
Founder - world’s
largest Hadoop group
Invented Fluentd,
Messagepack
TODAY
100+ Employees, 30M+ funding
2015
New office in Seoul, Korea
2013
New office in Tokyo, Japan
2012
Founded in Mountain View, CA
Investors
Jerry Yang
Yahoo! Founder
Bill Tai
Angel Investor
Yukihiro Matsumoto
Ruby Inventor
Sierra Ventures - Tim Guleri
Entrerprise Software
Scale Ventures - Andy Vitus
B2B SaaS
Treasure Data
21. List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
21
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones
2016/10/29 @Dots
23. 2016/10/29 @Dots 23
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List top-2 students for each class
24. 2016/10/29 @Dots 24
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
Top-k query processing
25. 2016/10/29 @Dots 25
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
Top-k query processing
31. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
Industry use cases of Hivemall
312016/10/29 @Dots
http://www.slideshare.net/eventdotsjp/hivemall
32. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
Industry use cases of Hivemall
322016/10/29 @Dots
minne.com
33. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
Industry use cases of Hivemall
332016/10/29 @Dots
34. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
• User score calculation
• Algrorithm: Regression
• Klout
Industry use cases of Hivemall
34
bit.ly/klout-hivemall
2016/10/29 @Dots
Influencer marketing
klout.com
38. Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
382016/10/29 @Dots
43. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
432016/10/29 @Dots
44. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
442016/10/29 @Dots
46. How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
462016/10/29 @Dots
64. ü Spark 2.0 Dataframe support
ü XGBoost Integration
ü Field-aware Factorization Machines
ü Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
2016/10/29 @Dots 64
Other new features to come