SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
Introduction to Apache Hivemall v0.5.0:
Machine Learning on Hive/Spark
Makoto YUI @myui
ApacheCon North America 2018
Takashi Yamamuro @maropu
@ApacheHivemall
1). Principal Engineer,
2). Research Engineer,
1
Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
ApacheCon North America 2018
A quick walk-through of feature, usages, what's
new in v0.5.0, and future roadmaps
New top-k join enhancement, and a feature plan
for Supporting spark 2.3 and feature selection
2
We released the first Apache release
v0.5.0 on Mar 3rd, 2018 !
hivemall.incubator.apache.org
ApacheCon North America 2018
We plan to start voting for the 2nd Apache release (v0.5.2) in
the next month (Oct 2018).
3
What’s new in v0.5.0?
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
2.0/2.1/2.1
SparkSQL/Dataframe support,
Top-k data processing
ApacheCon North America 2018 4
What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform VersatileScalableEase-of-use
ApacheCon North America 2018 5
Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
ApacheCon North America 2018 6
Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
ApacheCon North America 2018 7
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3
ApacheCon North America 2018 8
Hivemall on Apache Hive
ApacheCon North America 2018 9
Hivemall on Apache Spark Dataframe
ApacheCon North America 2018 10
Hivemall on SparkSQL
ApacheCon North America 2018 11
Hivemall on Apache Pig
ApacheCon North America 2018 12
Online Prediction by Apache Streaming
ApacheCon North America 2018 13
Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!
ApacheCon North America 2018 14
Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> BASE91
> UNBASE91
> NORMALIZE_UNICODE
> SPLIT_WORDS
> IS_STOPWORD
> TOKENIZE
> TOKENIZE_JA/CN
> TF/IDF
> SINGULARIZE
> TILE
> MAP_URL
> HAVERSINE_DISTANCE
ApacheCon North America 2018 15
JSON
> TO_JSON
> FROM_JSON
ApacheCon North America 2018
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
16
ApacheCon North America 2018
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
17
Map tiling functions
ApacheCon North America 2018 18
Tile(lat,lon,zoom)
= xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Map tiling functions
Zoom=10
Zoom=15
ApacheCon North America 2018 19
List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
ApacheCon North America 2018 20
Generic Classifier/Regressor
OLD Style New Style from v0.5.0
ApacheCon North America 2018 21
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
ApacheCon North America 2018 22
RandomForest in Hivemall
Ensemble of Decision Trees
ApacheCon North America 2018 23
Training of RandomForest
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector input.
ApacheCon North America 2018 24
Prediction of RandomForest
ApacheCon North America 2018 25
Decision Tree Visualization
ApacheCon North America 2018 26
Decision Tree Visualization
ApacheCon North America 2018 27
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
ApacheCon North America 2018 28
Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
ApacheCon North America 2018 29
Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓English/Japanese/Chinese
Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc
ApacheCon North America 2018 30
Feature Engineering – Feature Hashing
ApacheCon North America 2018 31
Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
ApacheCon North America 2018 32
ApacheCon North America 2018
Feature Engineering – Feature Binning
33
Evaluation Metrics
ApacheCon North America 2018 34
Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
ApacheCon North America 2018 35
Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
ApacheCon North America 2018 36
Take this…
Anomaly/Change-point Detection by ChangeFinder
ApacheCon North America 2018 37
Anomaly/Change-point Detection by ChangeFinder
…and do this!
ApacheCon North America 2018 38
Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
ApacheCon North America 2018 39
• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
ApacheCon North America 2018 40
Online mini-batch LDA
ApacheCon North America 2018 41
Probabilistic Latent Semantic Analysis - training
ApacheCon North America 2018 42
Probabilistic Latent Semantic Analysis - predict
ApacheCon North America 2018 43
ü Spark 2.3 support
ü Merged Brickhouse UDFs
ü Field-aware Factorization Machines
ü SLIM recommendation
What’s new in the coming v0.5.2
ApacheCon North America 2018
Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011.
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR
Prediction", Proc. RecSys. 2016.
State-of-the-art method for CTR prediction, often used algorithm in Kaggle
Very promising algorithm for top-k recommendation
44
ü Word2Vec support
ü Multi-class Logistic Regression
ü More efficient XGBoost support
ü LightGBM support
ü Gradient Boosting
ü Kafka KSQL UDF porting
Future work for v0.6 and later
PR#91
PR#116
ApacheCon North America 2018 45
Copyright©2018 NTT corp. All Rights Reserved.
Copyright©2018 NTT corp. All Rights Reserved.
,
• .
•
• .. . / /
Copyright©2018 NTT corp. All Rights Reserved.
- : -:
•
• 8 8 .4- .4- 8 . 4-
. 4
• :
• . 8 8. ,
4
• :
• 8 .8 .8
(Copyright©2018 NTT corp. All Rights Reserved.
-: , : 2
• - :1 . -:
• - 31 - 1:- 1 31
• 31 :
$ E F > AA
> E E F $ E 9> $DF= :D AD : $> A
9 A - A A E F $F :$ L 1 A 4 ./ $ : "
9 A - A $ A F $9D M5E F ""$9D "
F 2D = )
Copyright©2018 NTT corp. All Rights Reserved.
• ( 0 2 244 0 24 40
10 0 1 00 0 0
• )0 10 0 2
• ) E F F 1C :
• C .F5 C E * C E
• EC : 0 / C :
• : 2 C
• I
3 *0.0
FCE C C E 5 F EE ( 5 E H H
Copyright©2018 NTT corp. All Rights Reserved.
•
• / 1 5 5 *55 -51 13 5 / 1 :
5 5 53 5 A 5 3 5 A 2 .D ,
Copyright©2018 NTT corp. All Rights Reserved.
• .
• +55- /25 -2- + / + + +2-
• 2- / / /- -+ - / 5+
• .
• 2 / -2+ / / 5+
• - +25 - -+ / / +5 5
• /25 - 2- / 5- 2+ - / 5+
.
Copyright©2018 NTT corp. All Rights Reserved.
•
• / . / ++ 3 : / . /3 /
. 5 5 5 . 3 / /3 .
*Copyright©2018 NTT corp. All Rights Reserved.
• ,7 299 A 3 7 A 7 2
,-1 ).. ( 2
• - 1 :3 4 13 1 23
A A 2 A
5 1: 3 $$5 0 1 $/ /1 3$ 1 0/ 3 /::
12 1 0/ 3 /::
/1 /53 , -3
: / 53 $ /
/ 53 $ 3 /:: / * ... D 23 3 23 1 3 /
Copyright©2018 NTT corp. All Rights Reserved.
• . 3 3 3
• 4 . 3
• 1 24 1
• 4 43
2 1
Copyright©2018 NTT corp. All Rights Reserved.
• 6 . .21 6 6
• ## :2. 6 .- # 26# 2 : 5 :#-
:. :# .0 .::2 6 5 $$ /2-/
0 6 6 .1.1 1 6 6 6
0. ## :2. 6 .- # 26# 2 : 5 :#- :. :#
.0 .::2 6# $$ 26
0. ## :2. 6 .- # 26# 2 : 5 :#- :. :#
.0 .::2 6# $$ .:
(Copyright©2018 NTT corp. All Rights Reserved.
2 . .
// Downloads Spark v2.3 and launches a spark-shell with Hivemall
$ 5 D C= D: >> < CD
/C E 0 E 2C C E: 5 7 > 5D C EE 7 >
D6 > . > EC 0 - D C=$C 7$ C E > 5D "$> 7 )$EC $5 "
D6 > . EC 0 $ C E 6:
C E
> 5 > 7 F5> F>> 5> - ECF "
EFC D 6E C F>> 5> - ECF "
Copyright©2018 NTT corp. All Rights Reserved.
3 . -
= . . .=> (
. 8: , ) > . , > . - : .> = . "
: B .> "
. 5> $ . "
Copyright©2018 NTT corp. All Rights Reserved.
- -.
= D ( L CF, = L D = E C O CF D
= D ( L
D EG> D, D
P 5 LM " ) O CABL )5 O CABL
P .
P 5 L CF:DGA A LM " D D
P )5 LM " O CABL
P . CF D
P
P 9 LM
L C ACF
(Copyright©2018 NTT corp. All Rights Reserved.
. .-
E 6 6 6EF )
6 : * F EF : E F"DB = "# : 6F D E #
B FBD" : 6F D E #
6
B D = F=B E
: >B= " B : :" : 6F D # *** B " : 6F D # 0. . #
DB , " DB = #
6 "E= B= "E " = F $ 6 ###
Copyright©2018 NTT corp. All Rights Reserved.
. - 4
N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB
N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT
N >G>) >NOB
NLG
S . .,: MJRFA NFD JFA >GPB " RBFDEO * MBAF OBA
S 6 :M>FI:> GB O
S . : 6 :. 61 JABG:> GB
S 6 O CB>OPMB ( CB>OPMB
S 6 = MJRFA
NOMF >MDFI
(Copyright©2018 NTT corp. All Rights Reserved.
• - . .
- . :
• , :
• .2 6 2 2 ) /- 2> 6 :=:C
-: : -
2 2+ :>C 6=2
C
6 ) C :> > 23 6 C 6
2 6) C :> > 23 6 C 6
) 3 6 > 23 6 2 6
2 2+ 62 C :C $" ".2 (" 6"), $" 6 ".2 (" "))
, = C6 C 6>C :6 62
(Copyright©2018 NTT corp. All Rights Reserved.
•
• 3 A J KN I=D KA J$ K = E K=J K > I = I
-
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K "
E K=J K :P JA IC ADD -6 J DP
(Copyright©2018 NTT corp. All Rights Reserved.
• : 3 : : .
• A J KN I=D KA J$ K = E K=J K 4 > I = I
>3 :1 : 3:> : . 13>> :
J D , JK=)
D K C > + D=>K > B A IA K >$ I R )) AD$ R"
J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R"
NAK D E
I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """
N =I= I C + K 4"
E K=J K 4 :P JA IC ADD -6 J DP
:> 3
- 2 : 3 1
3 2
Copyright©2018 NTT corp. All Rights Reserved.
• ::-
• .= AD= : A = A = > A A=> 5= 6 = >
- : :- -
: ) > A
: A=> +5 ( : 5A+5 :
: A A=> 6 A+5 : 5A+5 6 = >H ((( 6 A+5 6 = >H
: 5A+5 H 6 A+5 H = H
- >: A 55 A 5 - , ::
Copyright©2018 NTT corp. All Rights Reserved.
• : :
-
Copyright©2018 NTT corp. All Rights Reserved.
• : :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
Copyright©2018 NTT corp. All Rights Reserved.
• : :
-
K-length
priority queue
Computes top-K rows
by using a priority queue
Only joins top-K rows
Copyright©2018 NTT corp. All Rights Reserved.
• - -
• ) 9 6 96 6 9 , /9
, 6 6 , ,
- :
) 9 ( 66 9 9
Copyright©2018 NTT corp. All Rights Reserved.
• - -
• 0 / 7 , /
7/ / 7 7 /
- :
7/
(Copyright©2018 NTT corp. All Rights Reserved.
• - - *: -::
• H: K> :DD > > :K>J 2: : => E : L DK
H J :D HD: # : = EH D>J > > LK>J K
:- - :
J :D:. K H / > HD:
-- J :D D: --
L D>=1:J 2 7 H (# 8 LH ( # 8 LH )
, 0 : > :J H: K K LH ( # )
, :D7: D> : 8 LH ( # ((
0 : > :J H: K K LH ) # )
:D7: D> : 8 LH ) # )+
- -* -* : * - - *
Copyright©2018 NTT corp. All Rights Reserved.
• - 3:
3 -:1 : 1 1
! :1 : : : : : -
Copyright©2018 NTT corp. All Rights Reserved.
• : -: : :
: =: -:
• -7 1 73 1: 8 1 1-7 73
- 7 73 - - 1 8 1 1- 1 :1 1 87:
+ : : -:
Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn)
Selected Features
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
Copyright©2018 NTT corp. All Rights Reserved.
• : -: : :
: =: -:
• -7 1 4 47 1: 8 4 1 1-747
-4747 - - 1 8 1 1- 1 :1 1 487:
+ : : -:
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice
about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
Data Extraction + Feature Selection
Join Pruning by Data Statistics
Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The 2nd Apache release (v0.5.2) will appear soon!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin
ApacheCon North America 2018 75
Thank you! Questions?
ApacheCon North America 2018 76

Contenu connexe

Tendances

Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsJim Dowling
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Balancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningBalancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningDatabricks
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Hadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkHadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkboorad
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Makoto Yui
 

Tendances (20)

Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on Hops
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Balancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningBalancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine Learning
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Hadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talkHadoop and Storm - AJUG talk
Hadoop and Storm - AJUG talk
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 

Similaire à Introduction to Apache Hivemall v0.5.0

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSChristoforos Kachris
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 

Similaire à Introduction to Apache Hivemall v0.5.0 (20)

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWS
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 

Plus de Makoto Yui

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceMakoto Yui
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-treesMakoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiMakoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorMakoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myuiMakoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiMakoto Yui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetupMakoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using HivemallMakoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myuiMakoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myuiMakoto Yui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113Makoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020Makoto Yui
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Makoto Yui
 

Plus de Makoto Yui (20)

Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 

Dernier

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Dernier (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Introduction to Apache Hivemall v0.5.0

  • 1. Introduction to Apache Hivemall v0.5.0: Machine Learning on Hive/Spark Makoto YUI @myui ApacheCon North America 2018 Takashi Yamamuro @maropu @ApacheHivemall 1). Principal Engineer, 2). Research Engineer, 1
  • 2. Plan of the talk 1. Introduction to Hivemall 2. Hivemall on Spark ApacheCon North America 2018 A quick walk-through of feature, usages, what's new in v0.5.0, and future roadmaps New top-k join enhancement, and a feature plan for Supporting spark 2.3 and feature selection 2
  • 3. We released the first Apache release v0.5.0 on Mar 3rd, 2018 ! hivemall.incubator.apache.org ApacheCon North America 2018 We plan to start voting for the 2nd Apache release (v0.5.2) in the next month (Oct 2018). 3
  • 4. What’s new in v0.5.0? Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark 2.0/2.1/2.1 SparkSQL/Dataframe support, Top-k data processing ApacheCon North America 2018 4
  • 5. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use ApacheCon North America 2018 5
  • 6. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop ApacheCon North America 2018 6
  • 7. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive ApacheCon North America 2018 7
  • 8. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 ApacheCon North America 2018 8
  • 9. Hivemall on Apache Hive ApacheCon North America 2018 9
  • 10. Hivemall on Apache Spark Dataframe ApacheCon North America 2018 10
  • 11. Hivemall on SparkSQL ApacheCon North America 2018 11
  • 12. Hivemall on Apache Pig ApacheCon North America 2018 12
  • 13. Online Prediction by Apache Streaming ApacheCon North America 2018 13
  • 14. Versatile Hivemall is a Versatile library .. ü Not only for Machine Learning ü provides a bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing Don’t Repeat Yourself! Don’t Repeat Yourself! ApacheCon North America 2018 14
  • 15. Hivemall generic functions Array and Map Bit and compress String and NLP Brickhouse UDFs are merged in v0.5.2 release. We welcome contributing your generic UDFs to Hivemall Geo Spatial Top-k processing > BASE91 > UNBASE91 > NORMALIZE_UNICODE > SPLIT_WORDS > IS_STOPWORD > TOKENIZE > TOKENIZE_JA/CN > TF/IDF > SINGULARIZE > TILE > MAP_URL > HAVERSINE_DISTANCE ApacheCon North America 2018 15 JSON > TO_JSON > FROM_JSON
  • 16. ApacheCon North America 2018 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 RANK over() query does not finishes in 24 hours L where 20 million MOOCs classes and avg 1,000 students in each classes 16
  • 17. ApacheCon North America 2018 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t EACH_TOP_K finishes in 2 hours J 17
  • 18. Map tiling functions ApacheCon North America 2018 18
  • 19. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 ApacheCon North America 2018 19
  • 20. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones ApacheCon North America 2018 20
  • 21. Generic Classifier/Regressor OLD Style New Style from v0.5.0 ApacheCon North America 2018 21
  • 22. •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization ApacheCon North America 2018 22
  • 23. RandomForest in Hivemall Ensemble of Decision Trees ApacheCon North America 2018 23
  • 24. Training of RandomForest Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector input. ApacheCon North America 2018 24
  • 25. Prediction of RandomForest ApacheCon North America 2018 25
  • 26. Decision Tree Visualization ApacheCon North America 2018 26
  • 27. Decision Tree Visualization ApacheCon North America 2018 27
  • 28. SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; ApacheCon North America 2018 28
  • 29. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items ApacheCon North America 2018 29
  • 30. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓English/Japanese/Chinese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc ApacheCon North America 2018 30
  • 31. Feature Engineering – Feature Hashing ApacheCon North America 2018 31
  • 32. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins ApacheCon North America 2018 32
  • 33. ApacheCon North America 2018 Feature Engineering – Feature Binning 33
  • 35. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation ApacheCon North America 2018 35
  • 36. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder ApacheCon North America 2018 36
  • 37. Take this… Anomaly/Change-point Detection by ChangeFinder ApacheCon North America 2018 37
  • 38. Anomaly/Change-point Detection by ChangeFinder …and do this! ApacheCon North America 2018 38
  • 39. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. ApacheCon North America 2018 39
  • 40. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation ApacheCon North America 2018 40
  • 41. Online mini-batch LDA ApacheCon North America 2018 41
  • 42. Probabilistic Latent Semantic Analysis - training ApacheCon North America 2018 42
  • 43. Probabilistic Latent Semantic Analysis - predict ApacheCon North America 2018 43
  • 44. ü Spark 2.3 support ü Merged Brickhouse UDFs ü Field-aware Factorization Machines ü SLIM recommendation What’s new in the coming v0.5.2 ApacheCon North America 2018 Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011. Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR Prediction", Proc. RecSys. 2016. State-of-the-art method for CTR prediction, often used algorithm in Kaggle Very promising algorithm for top-k recommendation 44
  • 45. ü Word2Vec support ü Multi-class Logistic Regression ü More efficient XGBoost support ü LightGBM support ü Gradient Boosting ü Kafka KSQL UDF porting Future work for v0.6 and later PR#91 PR#116 ApacheCon North America 2018 45
  • 46. Copyright©2018 NTT corp. All Rights Reserved.
  • 47. Copyright©2018 NTT corp. All Rights Reserved. , • . • • .. . / /
  • 48. Copyright©2018 NTT corp. All Rights Reserved. - : -: • • 8 8 .4- .4- 8 . 4- . 4 • : • . 8 8. , 4 • : • 8 .8 .8
  • 49. (Copyright©2018 NTT corp. All Rights Reserved. -: , : 2 • - :1 . -: • - 31 - 1:- 1 31 • 31 : $ E F > AA > E E F $ E 9> $DF= :D AD : $> A 9 A - A A E F $F :$ L 1 A 4 ./ $ : " 9 A - A $ A F $9D M5E F ""$9D " F 2D = )
  • 50. Copyright©2018 NTT corp. All Rights Reserved. • ( 0 2 244 0 24 40 10 0 1 00 0 0 • )0 10 0 2 • ) E F F 1C : • C .F5 C E * C E • EC : 0 / C : • : 2 C • I 3 *0.0 FCE C C E 5 F EE ( 5 E H H
  • 51. Copyright©2018 NTT corp. All Rights Reserved. • • / 1 5 5 *55 -51 13 5 / 1 : 5 5 53 5 A 5 3 5 A 2 .D ,
  • 52. Copyright©2018 NTT corp. All Rights Reserved. • . • +55- /25 -2- + / + + +2- • 2- / / /- -+ - / 5+ • . • 2 / -2+ / / 5+ • - +25 - -+ / / +5 5 • /25 - 2- / 5- 2+ - / 5+ .
  • 53. Copyright©2018 NTT corp. All Rights Reserved. • • / . / ++ 3 : / . /3 / . 5 5 5 . 3 / /3 .
  • 54. *Copyright©2018 NTT corp. All Rights Reserved. • ,7 299 A 3 7 A 7 2 ,-1 ).. ( 2 • - 1 :3 4 13 1 23 A A 2 A 5 1: 3 $$5 0 1 $/ /1 3$ 1 0/ 3 /:: 12 1 0/ 3 /:: /1 /53 , -3 : / 53 $ / / 53 $ 3 /:: / * ... D 23 3 23 1 3 /
  • 55. Copyright©2018 NTT corp. All Rights Reserved. • . 3 3 3 • 4 . 3 • 1 24 1 • 4 43 2 1
  • 56. Copyright©2018 NTT corp. All Rights Reserved. • 6 . .21 6 6 • ## :2. 6 .- # 26# 2 : 5 :#- :. :# .0 .::2 6 5 $$ /2-/ 0 6 6 .1.1 1 6 6 6 0. ## :2. 6 .- # 26# 2 : 5 :#- :. :# .0 .::2 6# $$ 26 0. ## :2. 6 .- # 26# 2 : 5 :#- :. :# .0 .::2 6# $$ .:
  • 57. (Copyright©2018 NTT corp. All Rights Reserved. 2 . . // Downloads Spark v2.3 and launches a spark-shell with Hivemall $ 5 D C= D: >> < CD /C E 0 E 2C C E: 5 7 > 5D C EE 7 > D6 > . > EC 0 - D C=$C 7$ C E > 5D "$> 7 )$EC $5 " D6 > . EC 0 $ C E 6: C E > 5 > 7 F5> F>> 5> - ECF " EFC D 6E C F>> 5> - ECF "
  • 58. Copyright©2018 NTT corp. All Rights Reserved. 3 . - = . . .=> ( . 8: , ) > . , > . - : .> = . " : B .> " . 5> $ . "
  • 59. Copyright©2018 NTT corp. All Rights Reserved. - -. = D ( L CF, = L D = E C O CF D = D ( L D EG> D, D P 5 LM " ) O CABL )5 O CABL P . P 5 L CF:DGA A LM " D D P )5 LM " O CABL P . CF D P P 9 LM L C ACF
  • 60. (Copyright©2018 NTT corp. All Rights Reserved. . .- E 6 6 6EF ) 6 : * F EF : E F"DB = "# : 6F D E # B FBD" : 6F D E # 6 B D = F=B E : >B= " B : :" : 6F D # *** B " : 6F D # 0. . # DB , " DB = # 6 "E= B= "E " = F $ 6 ###
  • 61. Copyright©2018 NTT corp. All Rights Reserved. . - 4 N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT N >G>) >NOB NLG S . .,: MJRFA NFD JFA >GPB " RBFDEO * MBAF OBA S 6 :M>FI:> GB O S . : 6 :. 61 JABG:> GB S 6 O CB>OPMB ( CB>OPMB S 6 = MJRFA NOMF >MDFI
  • 62. (Copyright©2018 NTT corp. All Rights Reserved. • - . . - . : • , : • .2 6 2 2 ) /- 2> 6 :=:C -: : - 2 2+ :>C 6=2 C 6 ) C :> > 23 6 C 6 2 6) C :> > 23 6 C 6 ) 3 6 > 23 6 2 6 2 2+ 62 C :C $" ".2 (" 6"), $" 6 ".2 (" ")) , = C6 C 6>C :6 62
  • 63. (Copyright©2018 NTT corp. All Rights Reserved. • • 3 A J KN I=D KA J$ K = E K=J K > I = I - J D , JK=) D K C > + D=>K > B A IA K >$ I R )) AD$ R" J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R" NAK D E I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """ N =I= I C + K " E K=J K :P JA IC ADD -6 J DP
  • 64. (Copyright©2018 NTT corp. All Rights Reserved. • : 3 : : . • A J KN I=D KA J$ K = E K=J K 4 > I = I >3 :1 : 3:> : . 13>> : J D , JK=) D K C > + D=>K > B A IA K >$ I R )) AD$ R" J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R" NAK D E I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """ N =I= I C + K 4" E K=J K 4 :P JA IC ADD -6 J DP :> 3 - 2 : 3 1 3 2
  • 65. Copyright©2018 NTT corp. All Rights Reserved. • ::- • .= AD= : A = A = > A A=> 5= 6 = > - : :- - : ) > A : A=> +5 ( : 5A+5 : : A A=> 6 A+5 : 5A+5 6 = >H ((( 6 A+5 6 = >H : 5A+5 H 6 A+5 H = H - >: A 55 A 5 - , ::
  • 66. Copyright©2018 NTT corp. All Rights Reserved. • : : -
  • 67. Copyright©2018 NTT corp. All Rights Reserved. • : : - K-length priority queue Computes top-K rows by using a priority queue
  • 68. Copyright©2018 NTT corp. All Rights Reserved. • : : - K-length priority queue Computes top-K rows by using a priority queue Only joins top-K rows
  • 69. Copyright©2018 NTT corp. All Rights Reserved. • - - • ) 9 6 96 6 9 , /9 , 6 6 , , - : ) 9 ( 66 9 9
  • 70. Copyright©2018 NTT corp. All Rights Reserved. • - - • 0 / 7 , / 7/ / 7 7 / - : 7/
  • 71. (Copyright©2018 NTT corp. All Rights Reserved. • - - *: -:: • H: K> :DD > > :K>J 2: : => E : L DK H J :D HD: # : = EH D>J > > LK>J K :- - : J :D:. K H / > HD: -- J :D D: -- L D>=1:J 2 7 H (# 8 LH ( # 8 LH ) , 0 : > :J H: K K LH ( # ) , :D7: D> : 8 LH ( # (( 0 : > :J H: K K LH ) # ) :D7: D> : 8 LH ) # )+ - -* -* : * - - *
  • 72. Copyright©2018 NTT corp. All Rights Reserved. • - 3: 3 -:1 : 1 1 ! :1 : : : : : -
  • 73. Copyright©2018 NTT corp. All Rights Reserved. • : -: : : : =: -: • -7 1 73 1: 8 1 1-7 73 - 7 73 - - 1 8 1 1- 1 :1 1 87: + : : -: Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn) Selected Features Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
  • 74. Copyright©2018 NTT corp. All Rights Reserved. • : -: : : : =: -: • -7 1 4 47 1: 8 4 1 1-747 -4747 - - 1 8 1 1- 1 :1 1 487: + : : -: Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. Data Extraction + Feature Selection Join Pruning by Data Statistics
  • 75. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The 2nd Apache release (v0.5.2) will appear soon! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin ApacheCon North America 2018 75
  • 76. Thank you! Questions? ApacheCon North America 2018 76