Dots20161029 myui

Apache Hivemall:
Machine Learning Library for
Apache Hive/Spark/Pig
Research Engineer
Makoto YUI @myui
<myui@treasure-data.com>
12016/10/29 @Dots

Ø 2015.04~ Research Engineer at Treasure Data,
Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-
a-service company
Ø 2010.04-2015.03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan.
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
Little about me …
2016/10/29 @Dots 2

2016/10/29 @Dots 3
Hiro Yoshikawa
CEO
Kaz Ota
CTO
Sada Furuhashi
Chief Architect
Open source business
veteran
Founder - world’s
largest Hadoop group
Invented Fluentd,
Messagepack
TODAY 
100+ Employees, 30M+ funding
2015 
New ofﬁce in Seoul, Korea
2013 
New ofﬁce in Tokyo, Japan
2012 
Founded in Mountain View, CA
Investors
Jerry Yang 
Yahoo! Founder
Bill Tai 
Angel Investor
Yukihiro Matsumoto 
Ruby Inventor
Sierra Ventures - Tim Guleri 
Entrerprise Software
Scale Ventures - Andy Vitus 
B2B SaaS
Treasure Data

2016/10/29 @Dots 4
Big Data Stats in Treasure Data

2016/10/29 @Dots 5
We Open-source! TD invented ..
Streaming log collector Bulk data import/export efficient binary serialization
Streaming Query Processor
Machine learning on Hadoop
digdag.io
Workflow engine (Beta)

2016/10/29 @Dots 6
Treasure Data’s Solution

1. What is Hivemall (introduction)
2. How to use Hivemall
3. Roadmap and coming new features
Agenda
2016/10/29 @Dots 7

2016/10/29 @Dots 8
Hivemall entered Apache Incubator
on Sept 13, 2016 🎉
hivemall.incubator.apache.org
@ApacheHivemall

• Makoto Yui <Treasure Data>
• Takeshi Yamamuro <NTT>
Ø Hivemall on Apache Spark
• Daniel Dai <Hortonworks>
Ø Hivemall on Apache Pig
Ø Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>
ØApache Hadoop PMC member
• Kai Sasaki <Treasure Data>
9
Initial committers
2016/10/29 @Dots

Champion
Nominated Mentors
10
Project mentors
• Reynold Xin <Databricks, ASF member>
Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member>
Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>
Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>
Apache Bigtop/Incubator PMC member
2016/10/29 @Dots

What is Apache Hivemall
Scalable machine learning library
built as a collection of Hive UDFs
112016/10/29 @Dots
Multi/Cross
platform Versatile Scalable Ease-of-use

Hivemall is easy and scalable …
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
ML made easy for SQL developers
Born to be parallel and scalable
This SQL query automatically runs in
parallel on Hadoop cluster
122016/10/29 @Dots
Ease-of-use
Scalable

2016/10/29 @Dots 13
Hivemall is a multi/cross-platform
ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and
conversely, prediction models build by Spark can be used from Hive

Hivemall’s Technology Stack
2016/10/29 @Dots 14

2016/10/29 @Dots 15
Hivemall on Apache Hive

2016/10/29 @Dots 16
Hivemall on Apache Spark Dataframe

2016/10/29 @Dots 17
Hivemall on SparkSQL

2016/10/29 @Dots 18
Hivemall on Apache Pig

2016/10/29 @Dots 19
Versatile
Hivemall is a Versatile library ..
ü Hivemall is not only for Machine
Learning
ü Hivemall provides bunch of generic
utility functions
Each organization has own sets
of UDFs for data preprocessing!
Don’t Repeat Yourself!
Don’t Repeat Yourself!

2016/10/29 @Dots 20
Hivemall generic functions
Array
and Map
Bit and
compress
String and NLP
We welcome contributing your generic UDFs to Hivemall!

List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
21
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones
2016/10/29 @Dots

List of Algorithms for Recommendation
22
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is
useful for recommending top-k items
2016/10/29 @Dots

2016/10/29 @Dots 23
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List top-2 students for each class

2016/10/29 @Dots 24
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2

2016/10/29 @Dots 25
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t

2016/10/29 @Dots 26
Top-k query processing by RANK OVER()
partition by class
Node 1
Sort by class, score
rank over()
rank >= 2

2016/10/29 @Dots 27
Top-k query processing by EACH_TOP_K
distributed by class
Node 1
Sort by class
each_top_k
OUTPUT only K
items

2016/10/29 @Dots 28
Comparison between RANK and EACH_TOP_K
distributed by class
Sort by class
each_top_k
Sort by class, score
rank over()
rank >= 2
SORTING IS HEAVY
NEED TO
PROCESS ALL
OUTPUT only K
items
Each_top_k is very efficient where the number of class is large
Bounded Priority Queue
is utilized

Performance reported by TD customer
2016/10/29 @Dots 29
•1,000 students in each class
•20 million classes
RANK over() query does not finishes in 24 hours L
EACH_TOP_K finishes in 2 hours J
Refer for detail
https://speakerdeck.com/kaky0922/hivemall-meetup-20160908

Other Supported Algorithms
30
Anomaly Detection
✓ Local Outlier Factor (LoF)
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
(Feature Pairing)
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
(Kuromoji)
2016/10/29 @Dots

• CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
Industry use cases of Hivemall
312016/10/29 @Dots
http://www.slideshare.net/eventdotsjp/hivemall

• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
322016/10/29 @Dots
minne.com

• Scaleout Inc.
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
332016/10/29 @Dots

• Scaleout Inc.
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
• User score calculation
• Algrorithm: Regression
• Klout
34
bit.ly/klout-hivemall
2016/10/29 @Dots
Influencer marketing
klout.com

OISIX, a leading food delivery service company in Japan,
used Hivemall’s Logistic Regression to get churn probability
2016/10/29 @Dots 35
Churn Detection of Monthly Payment Service
Churn rate dropped almost by half by giving gift points to
customers being predicted to leave J

Agenda
2016/10/29 @Dots 36

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Data preparation 372016/10/29 @Dots

Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
382016/10/29 @Dots

2016/10/29 @Dots 39
How to use Hivemall

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Feature Engineering
402016/10/29 @Dots

create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature
Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
412016/10/29 @Dots

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Training
422016/10/29 @Dots

How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
432016/10/29 @Dots

How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
442016/10/29 @Dots

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Prediction
452016/10/29 @Dots

How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
462016/10/29 @Dots

Real-time prediction
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Export
prediction model
47
bit.ly/hivemall-rtp
2016/10/29 @Dots

Export Prediction Model to a RDBMS
Any RDBMS
TD export
Periodical export is very easy
in Treasure Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
48
Prediction
Model
2016/10/29 @Dots

Real-time Prediction on MySQL
Prediction
Model
Label
Feature Vector
SELECT
sigmoid(sum(t.value * m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
prediction_model m ON (t.feature = m.feature)
Index lookups are very efficient in RDBMSs!
492016/10/29 @Dots

2016/10/29 @Dots 50
Online Prediction by Apache Streaming

RandomForest in Hivemall
Ensemble of Decision Trees
2016/10/29 @Dots 51

Training of RandomForest
2016/10/29 @Dots 52

Prediction of RandomForest
2016/10/29 @Dots 53

Agenda
2016/10/29 @Dots 54

• IP clearance and project/repository site
setup
• Create contribution guidelines
• Move repository from github to ASF
• Add more tests and documentations
• Initial Apache Release will be Dec or
Jan
55
Roadmap
2016/10/29 @Dots

Efficient algorithm for finding change point and
outliers from timeseries data
2016/10/29 @Dots 56
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder

Efficient algorithm for finding change point and
outliers from timeseries data
2016/10/29 @Dots 57
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder

2016/10/29 @Dots 58
T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
Less Hyper-parameters than ChangeFinder J

2016/10/29 @Dots 59
Evaluation Metrics

2016/10/29 @Dots 60
Feature Engineering – Feature Binning
Maps quantitative variables to fixed number
of bins based on quantiles/distribution
Map Ages into 3 bins

2016/10/29 @Dots 61
Feature Selection – Signal Noise Ratio

2016/10/29 @Dots 62
Feature Selection – Chi-Square

2016/10/29 @Dots 63
Feature Transformation – Onehot encoding
Maps a categorical variable to a
unique number starting from 1

ü Spark 2.0 Dataframe support
ü XGBoost Integration
ü Field-aware Factorization Machines
ü Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
2016/10/29 @Dots 64
Other new features to come

Conclusion and Takeaway
Hivemall is a machine learning library that is …
2016/10/29 @Dots 65
We welcome your contributions to Apache Hivemall J
Multi/Cross
platform
Versatile Scalable Ease-of-use
hivemall.incubator.apache.org
Ø For Data Engineers who need ML
Ø Deep Learning is out of scope
Ø Recommendation is high-priority for us
Hivemall’s Positioning

66
Any questions or comments?
2016/10/29 @Dots

Dots20161029 myui

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dots20161029 myui

Similar to Dots20161029 myui (20)

More from Makoto Yui

More from Makoto Yui (20)

Recently uploaded

Recently uploaded (20)

Dots20161029 myui