Db tech show - hivemall

Introduction
to
Machine
Learning
on

using
Hivemall
Research
Engineer
Makoto
YUI
@myui
<myui@treasure-‐data.com>
2014/09/17
Talk@Japan
DataScientist
Society 1

Ø 2015.04
Joined
Treasure
Data,
Inc.
1st Research
Engineer
in
Treasure
Data
My
mission
in
TD
is
developing
ML-‐as-‐a-‐Service
Ø 2010.04-‐2015.03
Senior
Researcher
at
National

Institute
of
Advanced
Industrial
Science
and

Technology,
Japan.

Worked
on
a
large-‐scale
Machine
Learning
project

and
Parallel
Databases

Ø 2009.03
Ph.D.
in
Computer
Science
from
NAIST
Ø Super
programmer
award
from
the
MITOU

Foundation

Super
creators
in
TD:

Sada
Furuhashi,
Keisuke
Nishida
Who
am

I
?
2014/09/17
Talk@Japan
DataScientist
Society 2

Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 3

What
is
Hivemall
Scalable
machine
learning
library
built
as
a
collection
of

Hive
UDFs,
licensed
under
the
Apache
License
v2
2014/09/17
Talk@Japan
DataScientist
Society 4
https://github.com/myui/hivemall

What
is
Hivemall
Hadoop
HDFS
MapReduce
(MR v1)
Hive /
PIG
Hivemall
Apache
YARN
Apache
Tez

DAG
processing
MR
v2
Machine
Learning
Query
Processing
Parallel
Data

Processing
Framework
Resource
Management
Distributed
File
System
2014/09/17
Talk@Japan
DataScientist
Society 5
Scalable
machine
learning
library
built
as
a
collection
of

Hive
UDFs,
licensed
under
the
Apache
License
v2

R
M MM
M M
HDFS
R
MapReduce
and
DAG
engine
MapReduce
DAG
engine
(Tez /
Spark)
No
intermediate
DFS
reads/writes!
62014/09/17
Talk@Japan
DataScientist
Society
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
HDFS HDFS

Won
IDG’s
InfoWorld
2014
Bossie Awards 2014: The best open source big data tools
InfoWorld's
top
picks
in
distributed
data
processing,
data
analytics,

machine
learning,
NoSQL
databases,
and
the
Hadoop
ecosystem
bit.ly/hivemall-‐award
2014/09/17
Talk@Japan
DataScientist
Society 7

List
of
Features
in
Hivemall
v0.3.2
Classification
(both

binary-‐ and
multi-‐class)
✓ Perceptron
✓ Passive
Aggressive
(PA)
✓ Confidence
Weighted
(CW)
✓ Adaptive
Regularization
of

Weight
Vectors
(AROW)
✓ Soft
Confidence
Weighted

(SCW)
✓ AdaGrad+RDA
Regression
✓Logistic
Regression
(SGD)
✓PA
Regression
✓AROW
Regression
✓AdaGrad
✓AdaDELTA
kNN and
Recommendation
✓ Minhash and
b-‐Bit
Minhash
(LSH
variant)
✓ Similarity
Search
using
K-‐NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix
Factorization
Feature
engineering
✓ Feature
Hashing
✓ Feature
Scaling
(normalization,
z-‐score)

✓ TF-‐IDF
vectorizer
✓ Polynomial
Expansion
Anomaly
Detection
✓ Local
Outlier
Factor
Treasure
Data
supports
Hivemall
v0.3.2-‐3
2014/09/17
Talk@Japan
DataScientist
Society 8

Algorithms
News20.binary
Classification
Accuracy
Perceptron 0.9460

Passive-‐Aggressive
(a.k.a.
Online-‐SVM)
0.9604

LibLinear 0.9636

LibSVM/TinySVM 0.9643

Confidence Weighted
(CW) 0.9656

AROW
[1] 0.9660

SCW
[2] 0.9662

Better
CW-‐variants
are
very
smart online ML
algorithm
Hivemall
supports
the
state-‐of-‐the-‐art
online
learning

algorithms
(for
classification and
regression)
2014/09/17
Talk@Japan
DataScientist
Society 9
List
of
Features
in
Hivemall

Why
CW
variants
are
so
good?
Suppose
a
binary
classification
setting
to
classify

sentences
positive
or
negative
→
learn
the
weight
for
each
word
(each
word
is
a
feature)
I
like
this
authorPositive
I
like
this
author,
but
found
this
book
dullNegative
Label Feature
Vector
Naïve
update
will
reduce
both

at
same
rateWlike Wdull
CW-‐variants
adjust
weights
at
different
rates
2014/09/17
Talk@Japan
DataScientist
Society 10

Why
CW
variants
are
so
good?
weight
weight
Adjust
a
weight
Adjust
a
weight
&

confidence
0.6 0.80.6
0.80.6
At
this
confidence,

the
weight
is
0.5
Confidence
(covariance)
0.5
2014/09/17
Talk@Japan
DataScientist
Society 11

Features to
be
supported
from
Hivemall
v0.4
2014/09/17
Talk@Japan
DataScientist
Society 12
1.RandomForest
• classification,
regression
2.Gradient
Tree
Boosting
• classifier,
regression
3.Factorization
Machine
• classification,
regression
(factorization)
4.Online
LDA
• topic
modeling,
clustering
Planned
to
release
v0.4
in
Oct.
Gradient
Boosting
and
Factorization
Machine
are
often
used
by
data
science
competition
winners
(very
important
for
practitioners)

2014/09/17
Talk@Japan
DataScientist
Society 13
Factorization
Machine
Matrix
Factorization

2014/09/17
Talk@Japan
DataScientist
Society 14
Factorization
Machine
Context
information
(e.g.,
time)

can
be
considered
Source:
http://www.ismll.uni-‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf

2014/09/17
Talk@Japan
DataScientist
Society 15
Factorization
Machine
Factorization
Model
with
degress=2
(2-‐way
interaction)
Global Bias
Regression coefficience
of j-th variable
Pairwise Interaction
Factorization

Ø CTR
prediction
of
Ad
click
logs
• Algorithm:
Logistic
regression
• Freakout Inc.,
Smartnews,
and
more
Ø Gender
prediction
of
Ad
click
logs
• Algorithm:
Classification
• Scaleout Inc.
Ø Churn
Detection
• Algorithm:
Regression
• OISIX
and
more
Ø Item/User
recommendation
• Algorithm:
Recommendation
(Matrix
Factorization
/
kNN)

• Wish.com,
DAC,
Real-‐estate
Portal,
and
more
Ø Value
prediction
of
Real
estates
• Algorithm:

Regression
• Livesense
Industry
use
cases
of
Hivemall
162014/09/17
Talk@Japan
DataScientist
Society

Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 17

Why
Hivemall
1. In
my
experience
working
on
ML,
I
used
Hive

for
preprocessing
and
Python
(scikit-‐learn
etc.)

for
ML.
This
was
INEFFICIENT
and
ANNOYING.

Also,
Python
is
not
as
scalable
as
Hive.
2. Why
not
run
ML
algorithms
inside
Hive?
Less

components
to
manage
and
more
scalable.
That’s
why
I
build
Hivemall.
2014/09/17
Talk@Japan
DataScientist
Society 18

Data
Moving
in
Data
Analytics
Data Collection Data Lake Data Processing Data Mart
Amazon S3
Amazon EMR
Redshift
Amazon RDS
Event
Data
Insights
and
Decisions
Data Analysis
Data
Engineer Data
Scientist Data
Engineer
2014/09/17
Talk@Japan
DataScientist
Society 19

2014/09/17
Talk@Japan
DataScientist
Society 20
What
Data
Scientists
actually
Do What
Data
Scientists
Should
Do
Data
Moving
in
Data
Analytics
Hive is a great data preprocessing tool
due to its easiness & efficiency for
join, filtering, and selection (data preprocessing)

How
I
used
to
do
ML
projects
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
Extract-‐Transform-‐Load
Machine
Learning
file
2014/09/17
Talk@Japan
DataScientist
Society 21

How
I
used
to
do
ML
projects
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
file
Need
to
do
expensive
data

preprocessing

(Joins,
Filtering,
and
Formatting
of
Data

that
does
not
fit
in
memory)
Machine
Learning
2014/09/17
Talk@Japan
DataScientist
Society 22

How
I
used
to
do
ML
projects
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
file
Do
not
scale
Have
to
learn
R/Python
APIs
2014/09/17
Talk@Japan
DataScientist
Society 23

How
I
used
to
do
ML
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
Does
not
meet
my
needs
In
terms
of
its
scalability,
ML
algorithms,
and
usability
I
❤ scalable
SQL
query
2014/09/17
Talk@Japan
DataScientist
Society 24

Framework User
interface
Mahout Java
API
Programming
Spark
MLlib/MLI Scala
API
programming
Scala
Shell
(REPL)
H2O R
programming
GUI
Cloudera
Oryx Http
REST
API
programming
Vowpal
Wabbit
(w/
Hadoop
streaming)
C++
API
programming
Command
Line
Survey
on
existing
ML
frameworks
Existing
distributed
machine
learning
frameworks
are
NOT
easy
to
use
2014/09/17
Talk@Japan
DataScientist
Society 25

2014/09/17
Talk@Japan
DataScientist
Society 26
Motivation:

Machine
Learning
need
to
be
more
easy

for
developers
(esp.
data
engineers)!
People
are
saying
that
..

Hivemall’s Vision:
ML
on
SQL
Classification
with
Mahout
CREATE
TABLE
lr_model
AS
SELECT
feature,
-‐-‐ reducers
perform
model
averaging
in

parallel
avg(weight)
as
weight
FROM
(
SELECT
logress(features,label,..)
as
(feature,weight)
FROM
train
)
t
-‐-‐ map-‐only
task
GROUP
BY
feature;
-‐-‐ shuffled
to
reducers
✓Machine
Learning
made
easy
for
SQL

developers
(ML
for
the
rest
of
us)
✓Interactive
and
Stable
APIs
w/ SQL
abstraction
This
SQL
query
automatically
runs
in

parallel
on
Hadoop

2014/09/17
Talk@Japan
DataScientist
Society 27

Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 28

Implemented
machine
learning
algorithms
as

User-‐Defined
Table
generating
Functions
(UDTFs)
How
Hivemall
works
in
training
+1,
<1,2>
..
+1,
<1,7,9>
-‐1,
<1,3,
9>
..
+1,
<3,8>
tuple
<label,
array<features>>
tuple<feature,
weights>
Prediction
model
UDTF
Relation
<feature,
weights>
param-‐mix param-‐mix
Training

table
Shuffle

by
feature
train train
● Resulting prediction model is a
relation of feature and its weight
● # of mapper and reducers are
configurable
UDTF
is
a
function
that
returns
a
relation
Parallelism
is
Powerful
2014/09/17
Talk@Japan
DataScientist
Society 29

train train
+1,
<1,2>
..
+1,
<1,7,9>
-‐1,
<1,3,
9>
..
+1,
<3,8>
merge
tuple
<label,
array<features
>
array<weight>
array<sum
of
weight>,

array<count>
Training

table
Prediction

model
-‐1,
<2,7,
9>
..
+1,
<3,8>
final

merge
merge
-‐1,
<2,7,
9>
..
+1,
<3,8>
train train
array<weight
>
Why
not
UDAF
4
ops
in
parallel
2
ops
in
parallel
No
parallelism
Machine
learning
as
an
aggregate
function
Bottleneck
in
the
final
merge
Throughput
limited
by
its
fan
out
Memory

consumption
grows
Parallelism
decreases
2014/09/17
Talk@Japan
DataScientist
Society 30

Problem
that
I
faced:
Iterations
Iterations
are
mandatory
to
get
a
good
prediction

model
• However,
MapReduce is
not
suited
for
iterations
because

IN/OUT
of
MR
job
is
through
HDFS
• Spark
avoid
it
by
in-‐memory
computation
iter.
1 iter.
2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter.
1 iter.
2
Input
2014/09/17
Talk@Japan
DataScientist
Society 31

Training
with
Iterations
in
Spark
val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
Repeated
MapReduce
steps
to
do
gradient
descent
For
each
node,
loads

data
in
memory
once
This
is
just
a
toy
example!
Why?
Logistic
Regression
example
of
Spark
Input
to
the
gradient
computation
should
be
shuffled

for
each
iteration
(without
it,
more
iteration
is
required)
2014/09/17
Talk@Japan
DataScientist
Society 32

What
MLlib
actually
do?
Val data = ..
for (i <- 1 to numIterations) {
val sampled =
val gradient =
w -= gradient
}
Mini-‐batch
Gradient
Descent
with
Sampling
Iterations
are
mandatory
for
convergence
because

each
iteration
uses
only
small
fraction
of
data
GradientDescent.scala
bit.ly/spark-‐gd
sample subset of data (partitioned RDD)
averaging the subgradientsover the sampled data using Spark MapReduce
2014/09/17
Talk@Japan
DataScientist
Society 33

Alternative
Approach
in
Hivemall
Hivemall
provides
the amplify UDTF
to
enumerate

iteration
effects
in
machine
learning
without
several

MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY rand()
2014/09/17
Talk@Japan
DataScientist
Society 34

Map-‐only
shuffling
and
amplifying
rand_amplify UDTF
randomly
shuffles
the

input
rows
for
each
Map
task
CREATE VIEW training_x3
as
SELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *)
as (rowid, label, features)
FROM
training;
2014/09/17
Talk@Japan
DataScientist
Society 35

Detailed
plan
w/
map-‐local
shuffle
…
Reduce

task
Merge
Aggregate
Reduce
write
Map

task
Table
scan
Rand
Amplifier
Map
write
Logress
UDTF
Partial
aggregate
Map

task
Table
scan
Rand
Amplifier
Map
write
Logress UDTF
Partial
aggregate
Reduce

task
Merge
Aggregate
Reduce
write
Scanned
entries

are
amplified
and

then
shuffled
Note
this
is
a
pipeline
op.
The
Rand
Amplifier
operator
is
interleaved
between

the
table
scan
and
the
training
operator
Shuffle

(distributed
by

feature)
2014/09/17
Talk@Japan
DataScientist
Society 36

Method
ELAPSED
TIME

(sec)
AUC
Plain 89.718 0.734805
amplifier+clustered
by
(a.k.a.
global
shuffle)
479.855 0.746214
rand_amplifier

(a.k.a.
map-‐local
shuffle)
116.424 0.743392
Performance
effects
of
amplifiers
With
the
map-‐local
shuffle,
prediction
accuracy

got
improved
with
an
acceptable
overhead

2014/09/17
Talk@Japan
DataScientist
Society 37

Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 38

How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Data
preparation 2014/09/17
Talk@Japan
DataScientist
Society 39

CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';;
How
to
use
Hivemall
-‐ Data
preparation
Define
a
Hive
table
for
training/testing
data
2014/09/17
Talk@Japan
DataScientist
Society 40

How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Feature
Engineering
2014/09/17
Talk@Japan
DataScientist
Society 41

create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature
Normalization
How
to
use
Hivemall
-‐ Feature
Engineering
Transforming
a
label
value

to
a
value
between
0.0
and
1.0
2014/09/17
Talk@Japan
DataScientist
Society 42

How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Training
2014/09/17
Talk@Japan
DataScientist
Society 43

How
to
use
Hivemall
-‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training
by
logistic
regression
map-‐only
task
to
learn
a
prediction
model
Shuffle
map-‐outputs
to
reduces
by
feature
Reducers
perform
model
averaging

in
parallel
2014/09/17
Talk@Japan
DataScientist
Society 44

How
to
use
Hivemall
-‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training
of
Confidence
Weighted
Classifier
Vote
to
use
negative
or
positive

weights
for
avg
+0.7,
+0.3,
+0.2,
-‐0.1,
+0.7
Training
for
the
CW
classifier
2014/09/17
Talk@Japan
DataScientist
Society 45

create table news20mc_ensemble_model1as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
from
news20mc_train_x3
) t
group by label,feature;
Ensemble
learning
for
stable
prediction
performance
Just
stack
prediction
models

by
union
all
26 / 43
462014/09/17
Talk@Japan
DataScientist
Society

How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Prediction
2014/09/17
Talk@Japan
DataScientist
Society 47

How
to
use
Hivemall
-‐ Prediction
CREATE
TABLE
lr_predict
as
SELECT
t.rowid,

sigmoid(sum(m.weight))
as
prob
FROM
testing_exploded t
LEFT
OUTER
JOIN
lr_model m
ON
(t.feature =
m.feature)
GROUP
BY

t.rowid
Prediction
is
done
by
LEFT
OUTER
JOIN
between
test
data
and
prediction
model
No
need
to
load
the
entire
model
into
memory
2014/09/17
Talk@Japan
DataScientist
Society 48

How
to
use
Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Export

prediction
model
2014/09/17
Talk@Japan
DataScientist
Society 49

Real-‐time
Prediction
on
Treasure
Data
Run
batch
training
job
periodically
Real-‐time
prediction
on
a
RDBMS
Periodical
export
2014/09/17
Talk@Japan
DataScientist
Society 50

Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 51

Supervise
Learning:
Recommendation
Rating
prediction
of
a
Matrix

Can
be
applied
for
user/Item
Recommendation
522014/09/17
Talk@Japan
DataScientist
Society

53
Matrix
Factorization
Factorize
a
matrix

into
a
product
of
matrices
having
k-‐latent
factor
2014/09/17
Talk@Japan
DataScientist
Society

54
Mean
Rating
Matrix
Factorization
Regularization
Bias

for
each
user/item
Criteria
of
Biased
MF
2014/09/17
Talk@Japan
DataScientist
Society
Factorization

55
Training
of
Matrix
Factorization
Support iterative training using local disk cache
2014/09/17
Talk@Japan
DataScientist
Society

56
Prediction
of
Matrix
Factorization
2014/09/17
Talk@Japan
DataScientist
Society

ØAlgorithm
is
different
Spark:
ALS-‐WR

(considers
regularization)
Hivemall:
Biased-‐MF

(considers
regularization
and
biases)
ØUsability
Spark:
100+
line
Scala
coding
Hivemall:
SQL
ØPrediction
Accuracy
Almost
same
for
MovieLens 10M
datasets
2014/09/17
Talk@Japan
DataScientist
Society 57
Comparison
to
Spark
MLlib

rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
13255163"]
052084323"]

Unsupervised
Learning:
Anomaly
Detection
Sensor
data
etc.
Anomaly
detection
runs
on
a
series
of
SQL
queries
582014/09/17
Talk@Japan
DataScientist
Society

2014/09/17
Talk@Japan
DataScientist
Society 59
Anomalies
in
a
Sensor
Data
Source:
https://codeiq.jp/q/207

Image
Source:
https://en.wikipedia.org/wiki/Local_outlier_factor
2014/09/17
Talk@Japan
DataScientist
Society 60
Local
Outlier
Factor
(LoF)
Basic
idea
of
LOF:
comparing
the
local
density
of
a

point
with
the
densities of
its
neighbors

2014/09/17
Talk@Japan
DataScientist
Society 61
DEMO:
Local
Outlier
Factor
rowid features
0"]
13255163"]
052084323"]

2014/09/17
Talk@Japan
DataScientist
Society 62
RandomForest
in
Hivemall
v0.4
Ensemble
of
Decision
Trees
Already
available
on
a
development
(smile)
branch
and
it’s
usage
is
explained
in
the
project
wiki

2014/09/17
Talk@Japan
DataScientist
Society 63
Training
of
RandomForest

Out-‐of-‐bag
tests
and
Variable
Importance

2014/09/17
Talk@Japan
DataScientist
Society 64

2014/09/17
Talk@Japan
DataScientist
Society 65
Prediction
of
RandomForest

2014/09/17
Talk@Japan
DataScientist
Society 66
Jupyter Integration
DEMO

Conclusion
and
Takeaway
Hivemall
provides
a
collection
of
machine

learning
algorithms
as
Hive
UDFs/UDTFs
Ø For
SQL
users
that
need
ML
Ø For
whom
already
using
Hive
Ø Easy-‐of-‐use
and
scalability
in
mind
Do
not
require
coding,
packaging,
compiling
or

introducing
a
new
programming
language
or APIs.
Hivemall’s Positioning
2014/09/17
Talk@Japan
DataScientist
Society 67
v0.4
will
make
a
developmental
leap

5/12の第一回目では
Freakout, Scaleout様より利用事
例発表
10/20(火)の第2回目では
OISIX, Livesense様より利用事例
発表
dotsで近日募集開始
2014/09/17
Talk@Japan
DataScientist
Society 68
告知: Hivemall
meetup

2014/09/17
Talk@Japan
DataScientist
Society 69
Beyond
Query-‐as-‐a-‐Service!
We

Open-‐source!
We
invented
..
We
are
hiring
machine
learning
engineer!

Db tech show - hivemall

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Db tech show - hivemall

Similaire à Db tech show - hivemall (20)

Plus de Makoto Yui

Plus de Makoto Yui (20)

Dernier

Dernier (20)

Db tech show - hivemall