This document introduces Hivemall, an open-source machine learning library built as a collection of Hive user-defined functions (UDFs). Hivemall allows users to perform scalable machine learning on large datasets stored in Hive/Hadoop. It supports various classification, regression, recommendation, and feature engineering algorithms. Some key algorithms include logistic regression, matrix factorization, random forests, and anomaly detection. Hivemall is designed to perform machine learning efficiently by avoiding intermediate data reads/writes to HDFS. It has been used in industry for applications such as click-through rate prediction, churn detection, and product recommendation.
1. Introduction
to
Machine
Learning
on
using
Hivemall
Research
Engineer
Makoto
YUI
@myui
<myui@treasure-‐data.com>
2014/09/17
Talk@Japan
DataScientist
Society 1
2. Ø 2015.04
Joined
Treasure
Data,
Inc.
1st Research
Engineer
in
Treasure
Data
My
mission
in
TD
is
developing
ML-‐as-‐a-‐Service
Ø 2010.04-‐2015.03
Senior
Researcher
at
National
Institute
of
Advanced
Industrial
Science
and
Technology,
Japan.
Worked
on
a
large-‐scale
Machine
Learning
project
and
Parallel
Databases
Ø 2009.03
Ph.D.
in
Computer
Science
from
NAIST
Ø Super
programmer
award
from
the
MITOU
Foundation
Super
creators
in
TD:
Sada
Furuhashi,
Keisuke
Nishida
Who
am
I
?
2014/09/17
Talk@Japan
DataScientist
Society 2
3. Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 3
4. What
is
Hivemall
Scalable
machine
learning
library
built
as
a
collection
of
Hive
UDFs,
licensed
under
the
Apache
License
v2
2014/09/17
Talk@Japan
DataScientist
Society 4
https://github.com/myui/hivemall
5. What
is
Hivemall
Hadoop
HDFS
MapReduce
(MR v1)
Hive /
PIG
Hivemall
Apache
YARN
Apache
Tez
DAG
processing
MR
v2
Machine
Learning
Query
Processing
Parallel
Data
Processing
Framework
Resource
Management
Distributed
File
System
2014/09/17
Talk@Japan
DataScientist
Society 5
Scalable
machine
learning
library
built
as
a
collection
of
Hive
UDFs,
licensed
under
the
Apache
License
v2
6. R
M MM
M M
HDFS
R
MapReduce
and
DAG
engine
MapReduce
DAG
engine
(Tez /
Spark)
No
intermediate
DFS
reads/writes!
62014/09/17
Talk@Japan
DataScientist
Society
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
HDFS HDFS
7. Won
IDG’s
InfoWorld
2014
Bossie Awards 2014: The best open source big data tools
InfoWorld's
top
picks
in
distributed
data
processing,
data
analytics,
machine
learning,
NoSQL
databases,
and
the
Hadoop
ecosystem
bit.ly/hivemall-‐award
2014/09/17
Talk@Japan
DataScientist
Society 7
8. List
of
Features
in
Hivemall
v0.3.2
Classification
(both
binary-‐ and
multi-‐class)
✓ Perceptron
✓ Passive
Aggressive
(PA)
✓ Confidence
Weighted
(CW)
✓ Adaptive
Regularization
of
Weight
Vectors
(AROW)
✓ Soft
Confidence
Weighted
(SCW)
✓ AdaGrad+RDA
Regression
✓Logistic
Regression
(SGD)
✓PA
Regression
✓AROW
Regression
✓AdaGrad
✓AdaDELTA
kNN and
Recommendation
✓ Minhash and
b-‐Bit
Minhash
(LSH
variant)
✓ Similarity
Search
using
K-‐NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix
Factorization
Feature
engineering
✓ Feature
Hashing
✓ Feature
Scaling
(normalization,
z-‐score)
✓ TF-‐IDF
vectorizer
✓ Polynomial
Expansion
Anomaly
Detection
✓ Local
Outlier
Factor
Treasure
Data
supports
Hivemall
v0.3.2-‐3
2014/09/17
Talk@Japan
DataScientist
Society 8
9. Algorithms
News20.binary
Classification
Accuracy
Perceptron 0.9460
Passive-‐Aggressive
(a.k.a.
Online-‐SVM)
0.9604
LibLinear 0.9636
LibSVM/TinySVM 0.9643
Confidence Weighted
(CW) 0.9656
AROW
[1] 0.9660
SCW
[2] 0.9662
Better
CW-‐variants
are
very
smart online ML
algorithm
Hivemall
supports
the
state-‐of-‐the-‐art
online
learning
algorithms
(for
classification and
regression)
2014/09/17
Talk@Japan
DataScientist
Society 9
List
of
Features
in
Hivemall
10. Why
CW
variants
are
so
good?
Suppose
a
binary
classification
setting
to
classify
sentences
positive
or
negative
→
learn
the
weight
for
each
word
(each
word
is
a
feature)
I
like
this
authorPositive
I
like
this
author,
but
found
this
book
dullNegative
Label Feature
Vector
Naïve
update
will
reduce
both
at
same
rateWlike Wdull
CW-‐variants
adjust
weights
at
different
rates
2014/09/17
Talk@Japan
DataScientist
Society 10
11. Why
CW
variants
are
so
good?
weight
weight
Adjust
a
weight
Adjust
a
weight
&
confidence
0.6 0.80.6
0.80.6
At
this
confidence,
the
weight
is
0.5
Confidence
(covariance)
0.5
2014/09/17
Talk@Japan
DataScientist
Society 11
12. Features to
be
supported
from
Hivemall
v0.4
2014/09/17
Talk@Japan
DataScientist
Society 12
1.RandomForest
• classification,
regression
2.Gradient
Tree
Boosting
• classifier,
regression
3.Factorization
Machine
• classification,
regression
(factorization)
4.Online
LDA
• topic
modeling,
clustering
Planned
to
release
v0.4
in
Oct.
Gradient
Boosting
and
Factorization
Machine
are
often
used
by
data
science
competition
winners
(very
important
for
practitioners)
14. 2014/09/17
Talk@Japan
DataScientist
Society 14
Factorization
Machine
Context
information
(e.g.,
time)
can
be
considered
Source:
http://www.ismll.uni-‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf
15. 2014/09/17
Talk@Japan
DataScientist
Society 15
Factorization
Machine
Factorization
Model
with
degress=2
(2-‐way
interaction)
Global Bias
Regression coefficience
of j-th variable
Pairwise Interaction
Factorization
16. Ø CTR
prediction
of
Ad
click
logs
• Algorithm:
Logistic
regression
• Freakout Inc.,
Smartnews,
and
more
Ø Gender
prediction
of
Ad
click
logs
• Algorithm:
Classification
• Scaleout Inc.
Ø Churn
Detection
• Algorithm:
Regression
• OISIX
and
more
Ø Item/User
recommendation
• Algorithm:
Recommendation
(Matrix
Factorization
/
kNN)
• Wish.com,
DAC,
Real-‐estate
Portal,
and
more
Ø Value
prediction
of
Real
estates
• Algorithm:
Regression
• Livesense
Industry
use
cases
of
Hivemall
162014/09/17
Talk@Japan
DataScientist
Society
17. Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 17
18. Why
Hivemall
1. In
my
experience
working
on
ML,
I
used
Hive
for
preprocessing
and
Python
(scikit-‐learn
etc.)
for
ML.
This
was
INEFFICIENT
and
ANNOYING.
Also,
Python
is
not
as
scalable
as
Hive.
2. Why
not
run
ML
algorithms
inside
Hive?
Less
components
to
manage
and
more
scalable.
That’s
why
I
build
Hivemall.
2014/09/17
Talk@Japan
DataScientist
Society 18
19. Data
Moving
in
Data
Analytics
Data Collection Data Lake Data Processing Data Mart
Amazon S3
Amazon EMR
Redshift
Amazon RDS
Event
Data
Insights
and
Decisions
Data Analysis
Data
Engineer Data
Scientist Data
Engineer
2014/09/17
Talk@Japan
DataScientist
Society 19
20. 2014/09/17
Talk@Japan
DataScientist
Society 20
What
Data
Scientists
actually
Do What
Data
Scientists
Should
Do
Data
Moving
in
Data
Analytics
Hive is a great data preprocessing tool
due to its easiness & efficiency for
join, filtering, and selection (data preprocessing)
21. How
I
used
to
do
ML
projects
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
Extract-‐Transform-‐Load
Machine
Learning
file
2014/09/17
Talk@Japan
DataScientist
Society 21
22. How
I
used
to
do
ML
projects
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
Extract-‐Transform-‐Load
file
Need
to
do
expensive
data
preprocessing
(Joins,
Filtering,
and
Formatting
of
Data
that
does
not
fit
in
memory)
Machine
Learning
2014/09/17
Talk@Japan
DataScientist
Society 22
23. How
I
used
to
do
ML
projects
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
Extract-‐Transform-‐Load
file
Do
not
scale
Have
to
learn
R/Python
APIs
2014/09/17
Talk@Japan
DataScientist
Society 23
24. How
I
used
to
do
ML
before
Hivemall
Given
raw
data
stored
on
Hadoop
HDFS
Raw
Data
HDFS
S3 Feature
Vector
height:173cm
weight:60kg
age:34
gender:
man
…
Extract-‐Transform-‐Load
Does
not
meet
my
needs
In
terms
of
its
scalability,
ML
algorithms,
and
usability
I
❤ scalable
SQL
query
2014/09/17
Talk@Japan
DataScientist
Society 24
25. Framework User
interface
Mahout Java
API
Programming
Spark
MLlib/MLI Scala
API
programming
Scala
Shell
(REPL)
H2O R
programming
GUI
Cloudera
Oryx Http
REST
API
programming
Vowpal
Wabbit
(w/
Hadoop
streaming)
C++
API
programming
Command
Line
Survey
on
existing
ML
frameworks
Existing
distributed
machine
learning
frameworks
are
NOT
easy
to
use
2014/09/17
Talk@Japan
DataScientist
Society 25
26. 2014/09/17
Talk@Japan
DataScientist
Society 26
Motivation:
Machine
Learning
need
to
be
more
easy
for
developers
(esp.
data
engineers)!
People
are
saying
that
..
27. Hivemall’s Vision:
ML
on
SQL
Classification
with
Mahout
CREATE
TABLE
lr_model
AS
SELECT
feature,
-‐-‐ reducers
perform
model
averaging
in
parallel
avg(weight)
as
weight
FROM
(
SELECT
logress(features,label,..)
as
(feature,weight)
FROM
train
)
t
-‐-‐ map-‐only
task
GROUP
BY
feature;
-‐-‐ shuffled
to
reducers
✓Machine
Learning
made
easy
for
SQL
developers
(ML
for
the
rest
of
us)
✓Interactive
and
Stable
APIs
w/ SQL
abstraction
This
SQL
query
automatically
runs
in
parallel
on
Hadoop
2014/09/17
Talk@Japan
DataScientist
Society 27
28. Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 28
29. Implemented
machine
learning
algorithms
as
User-‐Defined
Table
generating
Functions
(UDTFs)
How
Hivemall
works
in
training
+1,
<1,2>
..
+1,
<1,7,9>
-‐1,
<1,3,
9>
..
+1,
<3,8>
tuple
<label,
array<features>>
tuple<feature,
weights>
Prediction
model
UDTF
Relation
<feature,
weights>
param-‐mix param-‐mix
Training
table
Shuffle
by
feature
train train
● Resulting prediction model is a
relation of feature and its weight
● # of mapper and reducers are
configurable
UDTF
is
a
function
that
returns
a
relation
Parallelism
is
Powerful
2014/09/17
Talk@Japan
DataScientist
Society 29
30. train train
+1,
<1,2>
..
+1,
<1,7,9>
-‐1,
<1,3,
9>
..
+1,
<3,8>
merge
tuple
<label,
array<features
>
array<weight>
array<sum
of
weight>,
array<count>
Training
table
Prediction
model
-‐1,
<2,7,
9>
..
+1,
<3,8>
final
merge
merge
-‐1,
<2,7,
9>
..
+1,
<3,8>
train train
array<weight
>
Why
not
UDAF
4
ops
in
parallel
2
ops
in
parallel
No
parallelism
Machine
learning
as
an
aggregate
function
Bottleneck
in
the
final
merge
Throughput
limited
by
its
fan
out
Memory
consumption
grows
Parallelism
decreases
2014/09/17
Talk@Japan
DataScientist
Society 30
31. Problem
that
I
faced:
Iterations
Iterations
are
mandatory
to
get
a
good
prediction
model
• However,
MapReduce is
not
suited
for
iterations
because
IN/OUT
of
MR
job
is
through
HDFS
• Spark
avoid
it
by
in-‐memory
computation
iter.
1 iter.
2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter.
1 iter.
2
Input
2014/09/17
Talk@Japan
DataScientist
Society 31
32. Training
with
Iterations
in
Spark
val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
Repeated
MapReduce
steps
to
do
gradient
descent
For
each
node,
loads
data
in
memory
once
This
is
just
a
toy
example!
Why?
Logistic
Regression
example
of
Spark
Input
to
the
gradient
computation
should
be
shuffled
for
each
iteration
(without
it,
more
iteration
is
required)
2014/09/17
Talk@Japan
DataScientist
Society 32
33. What
MLlib
actually
do?
Val data = ..
for (i <- 1 to numIterations) {
val sampled =
val gradient =
w -= gradient
}
Mini-‐batch
Gradient
Descent
with
Sampling
Iterations
are
mandatory
for
convergence
because
each
iteration
uses
only
small
fraction
of
data
GradientDescent.scala
bit.ly/spark-‐gd
sample subset of data (partitioned RDD)
averaging the subgradientsover the sampled data using Spark MapReduce
2014/09/17
Talk@Japan
DataScientist
Society 33
34. Alternative
Approach
in
Hivemall
Hivemall
provides
the amplify UDTF
to
enumerate
iteration
effects
in
machine
learning
without
several
MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY rand()
2014/09/17
Talk@Japan
DataScientist
Society 34
35. Map-‐only
shuffling
and
amplifying
rand_amplify UDTF
randomly
shuffles
the
input
rows
for
each
Map
task
CREATE VIEW training_x3
as
SELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *)
as (rowid, label, features)
FROM
training;
2014/09/17
Talk@Japan
DataScientist
Society 35
36. Detailed
plan
w/
map-‐local
shuffle
…
Reduce
task
Merge
Aggregate
Reduce
write
Map
task
Table
scan
Rand
Amplifier
Map
write
Logress
UDTF
Partial
aggregate
Map
task
Table
scan
Rand
Amplifier
Map
write
Logress UDTF
Partial
aggregate
Reduce
task
Merge
Aggregate
Reduce
write
Scanned
entries
are
amplified
and
then
shuffled
Note
this
is
a
pipeline
op.
The
Rand
Amplifier
operator
is
interleaved
between
the
table
scan
and
the
training
operator
Shuffle
(distributed
by
feature)
2014/09/17
Talk@Japan
DataScientist
Society 36
37. Method
ELAPSED
TIME
(sec)
AUC
Plain 89.718 0.734805
amplifier+clustered
by
(a.k.a.
global
shuffle)
479.855 0.746214
rand_amplifier
(a.k.a.
map-‐local
shuffle)
116.424 0.743392
Performance
effects
of
amplifiers
With
the
map-‐local
shuffle,
prediction
accuracy
got
improved
with
an
acceptable
overhead
2014/09/17
Talk@Japan
DataScientist
Society 37
38. Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 38
39. How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Data
preparation 2014/09/17
Talk@Japan
DataScientist
Society 39
40. CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';;
How
to
use
Hivemall
-‐ Data
preparation
Define
a
Hive
table
for
training/testing
data
2014/09/17
Talk@Japan
DataScientist
Society 40
41. How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Feature
Engineering
2014/09/17
Talk@Japan
DataScientist
Society 41
43. How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Training
2014/09/17
Talk@Japan
DataScientist
Society 43
44. How
to
use
Hivemall
-‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training
by
logistic
regression
map-‐only
task
to
learn
a
prediction
model
Shuffle
map-‐outputs
to
reduces
by
feature
Reducers
perform
model
averaging
in
parallel
2014/09/17
Talk@Japan
DataScientist
Society 44
45. How
to
use
Hivemall
-‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training
of
Confidence
Weighted
Classifier
Vote
to
use
negative
or
positive
weights
for
avg
+0.7,
+0.3,
+0.2,
-‐0.1,
+0.7
Training
for
the
CW
classifier
2014/09/17
Talk@Japan
DataScientist
Society 45
46. create table news20mc_ensemble_model1as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label,feature;
Ensemble
learning
for
stable
prediction
performance
Just
stack
prediction
models
by
union
all
26 / 43
462014/09/17
Talk@Japan
DataScientist
Society
47. How
to
use
Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Prediction
2014/09/17
Talk@Japan
DataScientist
Society 47
48. How
to
use
Hivemall
-‐ Prediction
CREATE
TABLE
lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight))
as
prob
FROM
testing_exploded t
LEFT
OUTER
JOIN
lr_model m
ON
(t.feature =
m.feature)
GROUP
BY
t.rowid
Prediction
is
done
by
LEFT
OUTER
JOIN
between
test
data
and
prediction
model
No
need
to
load
the
entire
model
into
memory
2014/09/17
Talk@Japan
DataScientist
Society 48
49. How
to
use
Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature
Vector
Feature
Vector
Label
Export
prediction
model
2014/09/17
Talk@Japan
DataScientist
Society 49
50. Real-‐time
Prediction
on
Treasure
Data
Run
batch
training
job
periodically
Real-‐time
prediction
on
a
RDBMS
Periodical
export
2014/09/17
Talk@Japan
DataScientist
Society 50
51. Agenda
1. What
is
Hivemall
2. Why
Hivemall
(motivations
etc.)
3. Hivemall
Internals
4. How
to
use
Hivemall
• Logistic
regression
(RDBMS
integration)
• Matrix
Factorization
• Anomaly
Detection
(demo)
• Random
Forest
(demo)
2014/09/17
Talk@Japan
DataScientist
Society 51
57. ØAlgorithm
is
different
Spark:
ALS-‐WR
(considers
regularization)
Hivemall:
Biased-‐MF
(considers
regularization
and
biases)
ØUsability
Spark:
100+
line
Scala
coding
Hivemall:
SQL
ØPrediction
Accuracy
Almost
same
for
MovieLens 10M
datasets
2014/09/17
Talk@Japan
DataScientist
Society 57
Comparison
to
Spark
MLlib
58. rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
052084323"]
Unsupervised
Learning:
Anomaly
Detection
Sensor
data
etc.
Anomaly
detection
runs
on
a
series
of
SQL
queries
582014/09/17
Talk@Japan
DataScientist
Society
61. 2014/09/17
Talk@Japan
DataScientist
Society 61
DEMO:
Local
Outlier
Factor
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
052084323"]
62. 2014/09/17
Talk@Japan
DataScientist
Society 62
RandomForest
in
Hivemall
v0.4
Ensemble
of
Decision
Trees
Already
available
on
a
development
(smile)
branch
and
it’s
usage
is
explained
in
the
project
wiki
67. Conclusion
and
Takeaway
Hivemall
provides
a
collection
of
machine
learning
algorithms
as
Hive
UDFs/UDTFs
Ø For
SQL
users
that
need
ML
Ø For
whom
already
using
Hive
Ø Easy-‐of-‐use
and
scalability
in
mind
Do
not
require
coding,
packaging,
compiling
or
introducing
a
new
programming
language
or APIs.
Hivemall’s Positioning
2014/09/17
Talk@Japan
DataScientist
Society 67
v0.4
will
make
a
developmental
leap
69. 2014/09/17
Talk@Japan
DataScientist
Society 69
Beyond
Query-‐as-‐a-‐Service!
We
Open-‐source!
We
invented
..
We
are
hiring
machine
learning
engineer!