SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
Introduction	
  to
Machine	
  Learning	
  on	
  
using	
  Hivemall
Research	
  Engineer
Makoto	
  YUI	
  @myui
<myui@treasure-­‐data.com>
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 1
Ø 2015.04	
  Joined	
  Treasure	
  Data,	
  Inc.
1st Research	
  Engineer	
  in	
  Treasure	
  Data
My	
  mission	
  in	
  TD	
  is	
  developing	
  ML-­‐as-­‐a-­‐Service
Ø 2010.04-­‐2015.03	
  Senior	
  Researcher	
  at	
  National	
  
Institute	
  of	
  Advanced	
  Industrial	
  Science	
  and	
  
Technology,	
  Japan.	
  
Worked	
  on	
  a	
  large-­‐scale	
  Machine	
  Learning	
  project	
  
and	
  Parallel	
  Databases	
  
Ø 2009.03	
  Ph.D.	
  in	
  Computer	
  Science	
  from	
  NAIST
Ø Super	
  programmer	
  award	
  from	
  the	
  MITOU	
  
Foundation	
  
Super	
  creators	
  in	
  TD:	
  	
  Sada	
  Furuhashi,	
  Keisuke	
  Nishida
Who	
  am	
  	
  I	
  ?
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 2
Agenda
1. What	
  is	
  Hivemall
2. Why	
  Hivemall	
  (motivations	
  etc.)
3. Hivemall	
  Internals
4. How	
  to	
  use	
  Hivemall
• Logistic	
  regression	
  (RDBMS	
  integration)
• Matrix	
  Factorization
• Anomaly	
  Detection	
  (demo)
• Random	
  Forest	
  (demo)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 3
What	
  is	
  Hivemall
Scalable	
  machine	
  learning	
  library	
  built	
  as	
  a	
  collection	
  of	
  
Hive	
  UDFs,	
  licensed	
  under	
  the	
  Apache	
  License	
  v2
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 4
https://github.com/myui/hivemall
What	
  is	
  Hivemall
Hadoop	
  HDFS
MapReduce
(MR v1)
Hive /	
  PIG
Hivemall
Apache	
  YARN
Apache	
  Tez	
  
DAG	
  processing
MR	
  v2
Machine	
  Learning
Query	
  Processing
Parallel	
  Data	
  
Processing	
  Framework
Resource	
  Management
Distributed	
  File	
  System
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 5
Scalable	
  machine	
  learning	
  library	
  built	
  as	
  a	
  collection	
  of	
  
Hive	
  UDFs,	
  licensed	
  under	
  the	
  Apache	
  License	
  v2
R
M MM
M M
HDFS
R
MapReduce	
  and	
  DAG	
  engine
MapReduce	
   DAG	
  engine
(Tez /	
  Spark)
No	
  intermediate	
  DFS	
  reads/writes!
62014/09/17	
  Talk@Japan	
  DataScientist	
  Society
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
HDFS HDFS
Won	
  IDG’s	
  InfoWorld	
  2014
Bossie  Awards  2014:  The  best  open  source  big  data  tools
InfoWorld's	
  top	
  picks	
  in	
  distributed	
  data	
  processing,	
  data	
  analytics,	
  
machine	
  learning,	
  NoSQL	
  databases,	
  and	
  the	
  Hadoop	
  ecosystem
bit.ly/hivemall-­‐award
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 7
List	
  of	
  Features	
  in	
  Hivemall	
  v0.3.2
Classification	
  (both	
  
binary-­‐ and	
  multi-­‐class)
✓ Perceptron
✓ Passive	
  Aggressive	
  (PA)
✓ Confidence	
  Weighted	
  (CW)
✓ Adaptive	
  Regularization	
  of	
  
Weight	
  Vectors	
  (AROW)
✓ Soft	
  Confidence	
  Weighted	
  
(SCW)
✓ AdaGrad+RDA
Regression
✓Logistic	
  Regression	
  (SGD)
✓PA	
  Regression
✓AROW	
  Regression
✓AdaGrad
✓AdaDELTA
kNN and	
  Recommendation
✓ Minhash and	
  b-­‐Bit	
  Minhash
(LSH	
  variant)
✓ Similarity	
   Search	
  using	
  K-­‐NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix	
  Factorization
Feature	
  engineering
✓ Feature	
  Hashing
✓ Feature	
  Scaling
(normalization,	
   z-­‐score)	
  
✓ TF-­‐IDF	
  vectorizer
✓ Polynomial	
  Expansion
Anomaly	
  Detection
✓ Local	
  Outlier	
  Factor
Treasure	
  Data	
  supports	
  Hivemall	
  v0.3.2-­‐3
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 8
Algorithms
News20.binary
Classification	
  Accuracy
Perceptron 0.9460	
  
Passive-­‐Aggressive
(a.k.a.	
  Online-­‐SVM)
0.9604	
  
LibLinear 0.9636	
  
LibSVM/TinySVM 0.9643	
  
Confidence Weighted	
  (CW) 0.9656	
  
AROW	
  [1] 0.9660	
  
SCW	
  [2] 0.9662	
  
Better
CW-­‐variants	
  are	
  very	
  smart online ML	
  algorithm
Hivemall	
  supports	
  the	
  state-­‐of-­‐the-­‐art	
  online	
  learning	
  
algorithms	
  (for	
  classification and	
  regression)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 9
List	
  of	
  Features	
  in	
  Hivemall
Why	
  CW	
  variants	
  are	
  so	
  good?
Suppose	
  a	
  binary	
  classification	
  setting	
  to	
  classify	
  
sentences	
  positive	
  or	
  negative
→	
  learn	
  the	
  weight	
  for	
  each	
  word	
  (each	
  word	
  is	
  a	
  feature)
I	
  like	
  this	
  authorPositive
I	
  like	
  this	
  author,	
  but	
  found	
  this	
  book	
  dullNegative
Label Feature	
  Vector
Naïve	
  update	
  will	
  reduce	
  both	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  at	
  same	
  rateWlike Wdull
CW-­‐variants	
  adjust	
  weights	
  at	
  different	
  rates
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 10
Why	
  CW	
  variants	
  are	
  so	
  good?
weight
weight
Adjust	
  a	
  weight
Adjust	
  a	
  weight	
  &	
  
confidence
0.6 0.80.6
0.80.6
At	
  this	
  confidence,	
  
the	
  weight	
  is	
  0.5
Confidence
(covariance)
0.5
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 11
Features to	
  be	
  supported	
  from	
  Hivemall	
  v0.4
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 12
1.RandomForest
• classification,	
  regression
2.Gradient	
  Tree	
  Boosting
• classifier,	
  regression
3.Factorization	
  Machine
• classification,	
  regression	
  (factorization)
4.Online	
  LDA
• topic	
  modeling,	
  clustering
Planned	
  to	
  release	
  v0.4	
  in	
  Oct.
Gradient	
  Boosting	
  and	
  Factorization	
  Machine
are	
  often	
  used	
  by	
  data	
  science	
  competition	
  winners
(very	
  important	
  for	
  practitioners)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 13
Factorization	
  Machine
Matrix	
  Factorization
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 14
Factorization	
  Machine
Context	
  information	
  (e.g.,	
  time)	
  
can	
  be	
  considered
Source:	
  http://www.ismll.uni-­‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 15
Factorization	
  Machine
Factorization	
  Model	
  with	
  degress=2	
  (2-­‐way	
  interaction)
Global Bias
Regression coefficience
of j-th variable
Pairwise Interaction
Factorization
Ø CTR	
  prediction	
  of	
  Ad	
  click	
  logs
• Algorithm:	
  Logistic	
  regression
• Freakout Inc.,	
  Smartnews,	
  and	
  more
Ø Gender	
  prediction	
  of	
  Ad	
  click	
  logs
• Algorithm:	
  Classification
• Scaleout Inc.
Ø Churn	
  Detection
• Algorithm:	
  Regression
• OISIX	
  and	
  more
Ø Item/User	
  recommendation
• Algorithm:	
  Recommendation	
  (Matrix	
  Factorization	
  /	
  kNN)	
  
• Wish.com,	
  DAC,	
  Real-­‐estate	
  Portal,	
  and	
  more
Ø Value	
  prediction	
  of	
  Real	
  estates
• Algorithm:	
  	
  Regression
• Livesense
Industry	
  use	
  cases	
  of	
  Hivemall
162014/09/17	
  Talk@Japan	
  DataScientist	
  Society
Agenda
1. What	
  is	
  Hivemall
2. Why	
  Hivemall	
  (motivations	
  etc.)
3. Hivemall	
  Internals
4. How	
  to	
  use	
  Hivemall
• Logistic	
  regression	
  (RDBMS	
  integration)
• Matrix	
  Factorization
• Anomaly	
  Detection	
  (demo)
• Random	
  Forest	
  (demo)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 17
Why	
  Hivemall
1. In	
  my	
  experience	
  working	
  on	
  ML,	
  I	
  used	
  Hive	
  
for	
  preprocessing	
  and	
  Python	
  (scikit-­‐learn	
  etc.)	
  
for	
  ML.	
  This	
  was	
  INEFFICIENT	
  and	
  ANNOYING.	
  
Also,	
  Python	
  is	
  not	
  as	
  scalable	
  as	
  Hive.
2. Why	
  not	
  run	
  ML	
  algorithms	
  inside	
  Hive?	
  Less	
  
components	
  to	
  manage	
  and	
  more	
  scalable.
That’s	
  why	
  I	
  build	
  Hivemall.
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 18
Data	
  Moving	
  in	
  Data	
  Analytics
Data Collection Data Lake Data Processing Data Mart
Amazon S3
Amazon EMR
Redshift
Amazon RDS
Event	
  Data
Insights	
  and	
  Decisions
Data Analysis
Data	
  Engineer Data	
  Scientist Data	
  Engineer
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 19
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 20
What	
  Data	
  Scientists	
  actually	
  Do What	
  Data	
  Scientists	
  Should	
  Do
Data	
  Moving	
  in	
  Data	
  Analytics
Hive is a great data preprocessing tool
due to its easiness & efficiency for
join, filtering, and selection (data preprocessing)
How	
  I	
  used	
  to	
  do	
  ML	
  projects	
  before	
  Hivemall
Given	
  raw	
  data	
  stored	
  on	
  Hadoop	
  HDFS
Raw
Data
HDFS
S3 Feature	
  Vector
height:173cm
weight:60kg
age:34
gender:	
  man
…
Extract-­‐Transform-­‐Load
Machine	
  Learning
file
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 21
How	
  I	
  used	
  to	
  do	
  ML	
  projects	
  before	
  Hivemall
Given	
  raw	
  data	
  stored	
  on	
  Hadoop	
  HDFS
Raw
Data
HDFS
S3 Feature	
  Vector
height:173cm
weight:60kg
age:34
gender:	
  man
…
Extract-­‐Transform-­‐Load
file
Need	
  to	
  do	
  expensive	
  data	
  
preprocessing	
  
(Joins,	
  Filtering,	
  and	
  Formatting	
  of	
  Data	
  
that	
  does	
  not	
  fit	
  in	
  memory)
Machine	
  Learning
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 22
How	
  I	
  used	
  to	
  do	
  ML	
  projects	
  before	
  Hivemall
Given	
  raw	
  data	
  stored	
  on	
  Hadoop	
  HDFS
Raw
Data
HDFS
S3 Feature	
  Vector
height:173cm
weight:60kg
age:34
gender:	
  man
…
Extract-­‐Transform-­‐Load
file
Do	
  not	
  scale
Have	
  to	
  learn	
  R/Python	
  APIs
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 23
How	
  I	
  used	
  to	
  do	
  ML	
  before	
  Hivemall
Given	
  raw	
  data	
  stored	
  on	
  Hadoop	
  HDFS
Raw
Data
HDFS
S3 Feature	
  Vector
height:173cm
weight:60kg
age:34
gender:	
  man
…
Extract-­‐Transform-­‐Load
Does	
  not	
  meet	
  my	
  needs
In	
  terms	
  of	
  its	
  scalability,	
  ML	
  algorithms,	
  and	
  usability
I	
  ❤ scalable
SQL	
  query
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 24
Framework User	
  interface
Mahout Java	
  API	
  Programming
Spark	
  MLlib/MLI Scala	
  API	
  programming
Scala	
  Shell	
  (REPL)
H2O R	
  programming
GUI
Cloudera	
  Oryx Http	
  REST	
  API	
  programming
Vowpal	
  Wabbit
(w/	
  Hadoop	
  streaming)
C++	
  API	
  programming
Command	
  Line
Survey	
  on	
  existing	
  ML	
  frameworks
Existing	
  distributed	
  machine	
  learning	
  frameworks
are	
  NOT	
  easy	
  to	
  use
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 25
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 26
Motivation:	
  
Machine	
  Learning	
  need	
  to	
  be	
  more	
  easy	
  
for	
  developers	
  (esp.	
  data	
  engineers)!
People	
  are	
  saying	
  that	
  ..
Hivemall’s Vision:	
  ML	
  on	
  SQL
Classification	
  with	
  Mahout
CREATE	
  TABLE	
  lr_model	
  AS
SELECT
feature,	
  -­‐-­‐ reducers	
  perform	
  model	
  averaging	
  in	
  
parallel
avg(weight)	
  as	
  weight
FROM	
  (
SELECT	
  logress(features,label,..)	
  as	
  (feature,weight)
FROM	
  train
)	
  t	
  -­‐-­‐ map-­‐only	
  task
GROUP	
  BY	
  feature;	
  -­‐-­‐ shuffled	
  to	
  reducers
✓Machine	
  Learning	
  made	
  easy	
  for	
  SQL	
  
developers	
  (ML	
  for	
  the	
  rest	
  of	
  us)
✓Interactive	
  and	
  Stable	
  APIs	
  w/ SQL	
  abstraction
This	
  SQL	
  query	
  automatically	
  runs	
  in	
  
parallel	
  on	
  Hadoop	
  
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 27
Agenda
1. What	
  is	
  Hivemall
2. Why	
  Hivemall	
  (motivations	
  etc.)
3. Hivemall	
  Internals
4. How	
  to	
  use	
  Hivemall
• Logistic	
  regression	
  (RDBMS	
  integration)
• Matrix	
  Factorization
• Anomaly	
  Detection	
  (demo)
• Random	
  Forest	
  (demo)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 28
Implemented	
  machine	
  learning	
  algorithms	
  as	
  
User-­‐Defined	
  Table	
  generating	
  Functions	
  (UDTFs)
How	
  Hivemall	
  works	
  in	
  training
+1,	
  <1,2>
..
+1,	
  <1,7,9>
-­‐1,	
  <1,3,	
  9>
..
+1,	
  <3,8>
tuple
<label,	
  array<features>>
tuple<feature,	
  weights>
Prediction	
  model
UDTF
Relation
<feature,	
  weights>
param-­‐mix param-­‐mix
Training	
  
table
Shuffle	
  
by	
  feature
train train
● Resulting prediction model is a
relation of feature and its weight
● # of mapper and reducers are
configurable
UDTF	
  is	
  a	
  function	
  that	
  returns	
  a	
  relation
Parallelism	
  is	
  Powerful
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 29
train train
+1,	
  <1,2>
..
+1,	
  <1,7,9>
-­‐1,	
  <1,3,	
  9>
..
+1,	
  <3,8>
merge
tuple
<label,	
  array<features	
  >
array<weight>
array<sum	
  of	
  weight>,	
  
array<count>
Training	
  
table
Prediction	
  
model
-­‐1,	
  <2,7,	
  9>
..
+1,	
  <3,8>
final	
  
merge
merge
-­‐1,	
  <2,7,	
  9>
..
+1,	
  <3,8>
train train
array<weight
>
Why	
  not	
  UDAF
4	
  ops	
  in	
  parallel
2	
  ops	
  in	
  parallel
No	
  parallelism
Machine	
  learning	
  as	
  an	
  aggregate	
  function
Bottleneck	
  in	
  the	
  final	
  merge
Throughput	
  limited	
  by	
  its	
  fan	
  out
Memory	
  
consumption
grows
Parallelism
decreases
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 30
Problem	
  that	
  I	
  faced:	
  Iterations
Iterations	
  are	
  mandatory	
  to	
  get	
  a	
  good	
  prediction	
  
model
• However,	
  MapReduce is	
  not	
  suited	
  for	
  iterations	
  because	
  
IN/OUT	
  of	
  MR	
  job	
  is	
  through	
  HDFS
• Spark	
  avoid	
  it	
  by	
  in-­‐memory	
  computation
iter.	
  1 iter.	
  2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter.	
  1 iter.	
  2
Input
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 31
Training	
  with	
  Iterations	
  in	
  Spark
val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
Repeated	
  MapReduce	
  steps
to	
  do	
  gradient	
  descent
For	
  each	
  node,	
  loads	
  
data	
  in	
  memory	
  once
This	
  is	
  just	
  a	
  toy	
  example!	
  Why?
Logistic	
  Regression	
  example	
  of	
  Spark
Input	
  to	
  the	
  gradient	
  computation	
  should	
  be	
  shuffled	
  
for	
  each	
  iteration	
  (without	
  it,	
  more	
  iteration	
  is	
  required)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 32
What	
  MLlib	
  actually	
  do?
Val data = ..
for (i <- 1 to numIterations) {
val sampled =
val gradient =
w -= gradient
}
Mini-­‐batch	
  Gradient	
  Descent	
  with	
  Sampling
Iterations	
  are	
  mandatory	
  for	
  convergence	
  because	
  
each	
  iteration	
  uses	
  only	
  small	
  fraction	
  of	
  data
GradientDescent.scala
bit.ly/spark-­‐gd
sample subset of data (partitioned RDD)
averaging the subgradientsover the sampled data using Spark MapReduce
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 33
Alternative	
  Approach	
  in	
  Hivemall
Hivemall	
  provides	
  the amplify UDTF	
  to	
  enumerate	
  
iteration	
  effects	
  in	
  machine	
  learning	
  without	
  several	
  
MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY rand()
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 34
Map-­‐only	
  shuffling	
  and	
  amplifying
rand_amplify UDTF	
  randomly	
  shuffles	
  the	
  
input	
  rows	
  for	
  each	
  Map	
  task
CREATE VIEW training_x3
as
SELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *)
as (rowid, label, features)
FROM
training;
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 35
Detailed	
  plan	
  w/	
  map-­‐local	
  shuffle
…
Reduce	
  
task
Merge
Aggregate
Reduce	
  write
Map	
  
task
Table	
  scan
Rand	
  Amplifier
Map	
  write
Logress	
  UDTF
Partial	
  aggregate
Map	
  
task
Table	
  scan
Rand	
  Amplifier
Map	
  write
Logress UDTF
Partial	
  aggregate
Reduce	
  
task
Merge
Aggregate
Reduce	
  write
Scanned	
  entries	
  
are	
  amplified	
  and	
  
then	
  shuffled
Note	
  this	
  is	
  a	
  pipeline	
  op.
The	
  Rand	
  Amplifier	
  operator	
  is	
  interleaved	
  between	
  
the	
  table	
  scan	
  and	
  the	
  training	
  operator
Shuffle	
  
(distributed	
  by	
  
feature)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 36
Method
ELAPSED	
  TIME	
  
(sec)
AUC
Plain 89.718 0.734805
amplifier+clustered	
  by
(a.k.a.	
  global	
  shuffle)
479.855 0.746214
rand_amplifier	
  
(a.k.a.	
  map-­‐local	
  shuffle)
116.424 0.743392
Performance	
  effects	
  of	
  amplifiers
With	
  the	
  map-­‐local	
  shuffle,	
  prediction	
  accuracy	
  
got	
  improved	
  with	
  an	
  acceptable	
  overhead	
  
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 37
Agenda
1. What	
  is	
  Hivemall
2. Why	
  Hivemall	
  (motivations	
  etc.)
3. Hivemall	
  Internals
4. How	
  to	
  use	
  Hivemall
• Logistic	
  regression	
  (RDBMS	
  integration)
• Matrix	
  Factorization
• Anomaly	
  Detection	
  (demo)
• Random	
  Forest	
  (demo)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 38
How	
  to	
  use	
  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
  Vector
Feature	
  Vector
Label
Data	
  preparation 2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 39
CREATE  EXTERNAL  TABLE  e2006tfidf_train  (
rowid int,
label float,
features ARRAY<STRING>
)  
ROW  FORMAT  DELIMITED  
FIELDS  TERMINATED  BY  '¥t'  
COLLECTION  ITEMS  TERMINATED  BY  ",“
STORED  AS  TEXTFILE  LOCATION  '/dataset/E2006-­tfidf/train';;
How	
  to	
  use	
  Hivemall	
  -­‐ Data	
  preparation
Define	
  a	
  Hive	
  table	
  for	
  training/testing	
  data
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 40
How	
  to	
  use	
  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
  Vector
Feature	
  Vector
Label
Feature	
  Engineering
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 41
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature
Normalization
How	
  to	
  use	
  Hivemall	
  -­‐ Feature	
  Engineering
Transforming	
  a	
  label	
  value	
  
to	
  a	
  value	
  between	
  0.0	
  and	
  1.0
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 42
How	
  to	
  use	
  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
  Vector
Feature	
  Vector
Label
Training
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 43
How	
  to	
  use	
  Hivemall	
  -­‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training	
  by	
  logistic	
  regression
map-­‐only	
  task	
  to	
  learn	
  a	
  prediction	
  model
Shuffle	
  map-­‐outputs	
  to	
  reduces	
  by	
  feature
Reducers	
  perform	
  model	
  averaging	
  
in	
  parallel
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 44
How	
  to	
  use	
  Hivemall	
  -­‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training	
  of	
  Confidence	
  Weighted	
  Classifier
Vote	
  to	
  use	
  negative	
  or	
  positive	
  
weights	
  for	
  avg
+0.7,	
  +0.3,	
  +0.2,	
  -­‐0.1,	
  +0.7
Training	
  for	
  the	
  CW	
  classifier
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 45
create table news20mc_ensemble_model1as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label,feature;
Ensemble	
  learning	
  for	
  stable	
  prediction	
  performance
Just	
  stack	
  prediction	
  models	
  
by	
  union	
  all
26 / 43
462014/09/17	
  Talk@Japan	
  DataScientist	
  Society
How	
  to	
  use	
  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	
  Vector
Feature	
  Vector
Label
Prediction
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 47
How	
  to	
  use	
  Hivemall	
  -­‐ Prediction
CREATE	
  TABLE	
  lr_predict
as
SELECT
t.rowid,	
  
sigmoid(sum(m.weight))	
   as	
  prob
FROM
testing_exploded t	
  LEFT	
  OUTER	
  JOIN
lr_model m	
  ON	
  (t.feature =	
  m.feature)
GROUP	
  BY	
  
t.rowid
Prediction	
  is	
  done	
  by	
  LEFT	
  OUTER	
  JOIN
between	
  test	
  data	
  and	
  prediction	
  model
No	
  need	
  to	
  load	
  the	
  entire	
  model	
  into	
  memory
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 48
How	
  to	
  use	
  Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature	
  Vector
Feature	
  Vector
Label
Export	
  
prediction	
  model
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 49
Real-­‐time	
  Prediction	
  on	
  Treasure	
  Data
Run	
  batch	
  training
job	
  periodically
Real-­‐time	
  prediction
on	
  a	
  RDBMS
Periodical
export
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 50
Agenda
1. What	
  is	
  Hivemall
2. Why	
  Hivemall	
  (motivations	
  etc.)
3. Hivemall	
  Internals
4. How	
  to	
  use	
  Hivemall
• Logistic	
  regression	
  (RDBMS	
  integration)
• Matrix	
  Factorization
• Anomaly	
  Detection	
  (demo)
• Random	
  Forest	
  (demo)
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 51
Supervise	
  Learning:	
  Recommendation
Rating	
  prediction	
  of	
  a	
  Matrix	
  
Can	
  be	
  applied	
  for	
  user/Item	
  Recommendation
522014/09/17	
  Talk@Japan	
  DataScientist	
  Society
53
Matrix	
  Factorization
Factorize	
  a	
  matrix	
  
into	
  a	
  product	
  of	
  matrices
having	
  k-­‐latent	
  factor
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society
54
Mean	
  Rating
Matrix	
  Factorization
Regularization
Bias	
  
for	
  each	
  user/item
Criteria	
  of	
  Biased	
  MF
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society
Factorization
55
Training	
  of	
  Matrix	
  Factorization
Support iterative training using local disk cache
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society
56
Prediction	
  of	
  Matrix	
  Factorization
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society
ØAlgorithm	
  is	
  different
Spark:	
  ALS-­‐WR	
  
(considers	
  regularization)
Hivemall:	
  Biased-­‐MF	
  
(considers	
  regularization	
  and	
  biases)
ØUsability
Spark:	
  100+	
  line	
  Scala	
  coding
Hivemall:	
  SQL
ØPrediction	
  Accuracy
Almost	
  same	
  for	
  MovieLens 10M	
  datasets
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 57
Comparison	
  to	
  Spark	
  MLlib
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
052084323"]	
  	
  
Unsupervised	
  Learning:	
  Anomaly	
  Detection
Sensor	
  data	
  etc.
Anomaly	
  detection	
  runs	
  on	
  a	
  series	
  of	
  SQL	
  queries
582014/09/17	
  Talk@Japan	
  DataScientist	
  Society
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 59
Anomalies	
  in	
  a	
  Sensor	
  Data
Source:	
  https://codeiq.jp/q/207
Image	
  Source:	
  https://en.wikipedia.org/wiki/Local_outlier_factor
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 60
Local	
  Outlier	
  Factor	
  (LoF)
Basic	
  idea	
  of	
  LOF:	
  comparing	
  the	
  local	
  density	
  of	
  a	
  
point	
  with	
  the	
  densities of	
  its	
  neighbors
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 61
DEMO:	
  Local	
  Outlier	
  Factor
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
052084323"]	
  	
  
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 62
RandomForest	
  in	
  Hivemall	
  v0.4
Ensemble	
  of	
  Decision	
  Trees
Already	
  available	
  on	
  a	
  development	
  (smile)	
  branch
and	
  it’s	
  usage	
  is	
  explained	
  in	
  the	
  project	
  wiki
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 63
Training	
  of	
  RandomForest
Out-­‐of-­‐bag	
  tests	
  and	
  Variable	
  Importance	
  
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 64
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 65
Prediction	
  of	
  RandomForest
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 66
Jupyter Integration
DEMO
Conclusion	
  and	
  Takeaway
Hivemall	
  provides	
  a	
  collection	
  of	
  machine	
  
learning	
  algorithms	
  as	
  Hive	
  UDFs/UDTFs
Ø For	
  SQL	
  users	
  that	
  need	
  ML
Ø For	
  whom	
  already	
  using	
  Hive
Ø Easy-­‐of-­‐use	
  and	
  scalability	
  in	
  mind
Do	
  not	
  require	
  coding,	
  packaging,	
  compiling	
  or	
  
introducing	
  a	
  new	
  programming	
  language	
  or APIs.
Hivemall’s Positioning
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 67
v0.4	
  will	
  make	
  a	
  developmental	
  leap
5/12の第一回目では
Freakout, Scaleout様より利用事
例発表
10/20(火)の第2回目では
OISIX, Livesense様より利用事例
発表
dotsで近日募集開始
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 68
告知: Hivemall	
  meetup
2014/09/17	
  Talk@Japan	
  DataScientist	
  Society 69
Beyond	
  Query-­‐as-­‐a-­‐Service!
We	
  	
  	
  	
  	
  	
  	
  Open-­‐source!	
  We	
  invented	
  ..
We	
  are	
  hiring	
  machine	
  learning	
  engineer!

Contenu connexe

Tendances

Postgres: The NoSQL Cake You Can Eat
Postgres: The NoSQL Cake You Can EatPostgres: The NoSQL Cake You Can Eat
Postgres: The NoSQL Cake You Can EatEDB
 
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructureオラクルエンジニア通信
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow PresentationFelix Liao
 
Reducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with PostgresReducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with PostgresEDB
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013IntelAPAC
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
A Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkA Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkEDB
 
Unattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseedUnattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseedJazz Yao-Tsung Wang
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
 

Tendances (20)

Postgres: The NoSQL Cake You Can Eat
Postgres: The NoSQL Cake You Can EatPostgres: The NoSQL Cake You Can Eat
Postgres: The NoSQL Cake You Can Eat
 
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
 
Reducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with PostgresReducing Database Pain & Costs with Postgres
Reducing Database Pain & Costs with Postgres
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
A Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkA Peek in the Elephant's Trunk
A Peek in the Elephant's Trunk
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Unattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseedUnattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseed
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 

En vedette

Couchbase introduction-20150611
Couchbase introduction-20150611Couchbase introduction-20150611
Couchbase introduction-20150611Couchbase Japan KK
 
Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11
Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11
Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11MapR Technologies Japan
 
[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...
[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...
[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...
[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...
[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...
[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...
[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...
[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...
[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...
[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...
[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...
[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...
[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...
[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...
[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...
[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...
[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...
[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...
[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...Insight Technology, Inc.
 
[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...
[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...
[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...Insight Technology, Inc.
 
Dbts2015 tokyo vector_in_hadoop_vortex
Dbts2015 tokyo vector_in_hadoop_vortexDbts2015 tokyo vector_in_hadoop_vortex
Dbts2015 tokyo vector_in_hadoop_vortexKoji Shinkubo
 
[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう
[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう
[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼうdatastaxjp
 
[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...
[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...
[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...Insight Technology, Inc.
 
[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...
[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...
[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...Funada Yasunobu
 
[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...
[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...
[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...Insight Technology, Inc.
 
DBTS2015 Tokyo DBAが知っておくべき最新テクノロジー
DBTS2015 Tokyo DBAが知っておくべき最新テクノロジーDBTS2015 Tokyo DBAが知っておくべき最新テクノロジー
DBTS2015 Tokyo DBAが知っておくべき最新テクノロジーMasaya Ishikawa
 

En vedette (20)

Storm×couchbase serverで作るリアルタイム解析基盤
Storm×couchbase serverで作るリアルタイム解析基盤Storm×couchbase serverで作るリアルタイム解析基盤
Storm×couchbase serverで作るリアルタイム解析基盤
 
Couchbase introduction-20150611
Couchbase introduction-20150611Couchbase introduction-20150611
Couchbase introduction-20150611
 
Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11
Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11
Apache Drill で JSON 形式の オープンデータを分析してみる - db tech showcase Tokyo 2015 2015/06/11
 
[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...
[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...
[db tech showcase Tokyo 2015] D22:インメモリープラットホームSAP HANAのご紹介と最新情報 by SAPジャパン株式...
 
[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...
[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...
[db tech showcase Tokyo 2015] C33:ビッグデータ・IoT時代のキーテクノロジー、CEPの「今」を掴む! by 株式会社日立...
 
[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...
[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...
[db tech showcase Tokyo 2015] C25:HP NonStop SQLはなぜグローバルに分散DBを構築できるのか、 データの整合...
 
[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...
[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...
[db tech showcase Tokyo 2015] B34:データの仮想化を具体化するIBMのロジカルデータウェアハウス by 日本アイ・ビー・エ...
 
[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...
[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...
[db tech showcase Tokyo 2015] D32:HPの全方位インメモリDB化に向けた取り組みとSAP HANAインメモリDB の効果を...
 
[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...
[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...
[db tech showcase Tokyo 2015] D16:マイケルストーンブレーカー発の超高速データベースで実現する分析基盤の簡単構築・運用ステ...
 
[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...
[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...
[db tech showcase Tokyo 2015] D23:MySQLはドキュメントデータベースになり、HTTPもしゃべる - MySQL Lab...
 
[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...
[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...
[db tech showcase Tokyo 2015] B27:インメモリーDBとスケールアップマシンによりBig Dataの課題を解決する by S...
 
Presto in Treasure Data
Presto in Treasure DataPresto in Treasure Data
Presto in Treasure Data
 
[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...
[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...
[db tech showcase Tokyo 2015] D13:PCIeフラッシュで、高可用性高性能データベースシステム?! by 株式会社HGSTジ...
 
[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...
[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...
[db tech showcase Tokyo 2015] B36:Hitachi Advanced Data Binder 実践SQLチューニング方法 ...
 
Dbts2015 tokyo vector_in_hadoop_vortex
Dbts2015 tokyo vector_in_hadoop_vortexDbts2015 tokyo vector_in_hadoop_vortex
Dbts2015 tokyo vector_in_hadoop_vortex
 
[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう
[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう
[db tech showcase Tokyo 2015] E35: Web, IoT, モバイル時代のデータベース、Apache Cassandraを学ぼう
 
[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...
[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...
[db tech showcase Tokyo 2015] C17:MySQL Cluster ユーザー事例紹介~JR東日本情報システム様における導入事例...
 
[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...
[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...
[DB tech showcase Tokyo 2015] B37 :オンプレミスからAWS上のSAP HANAまで高信頼DBシステム構築にHAクラスタリ...
 
[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...
[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...
[db tech showcase Tokyo 2015] C32:「データ一貫性にこだわる日立のインメモリ分散KVS~こだわりの理由と実現方法とは~」 ...
 
DBTS2015 Tokyo DBAが知っておくべき最新テクノロジー
DBTS2015 Tokyo DBAが知っておくべき最新テクノロジーDBTS2015 Tokyo DBAが知っておくべき最新テクノロジー
DBTS2015 Tokyo DBAが知っておくべき最新テクノロジー
 

Similaire à Db tech show - hivemall

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsZitao Liu
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceMakoto Yui
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavSwapnil (Neil) Jadhav
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 

Similaire à Db tech show - hivemall (20)

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 

Plus de Makoto Yui

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-treesMakoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache HivemallMakoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiMakoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorMakoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myuiMakoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiMakoto Yui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetupMakoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using HivemallMakoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myuiMakoto Yui
 

Plus de Makoto Yui (20)

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
 

Dernier

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Dernier (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Db tech show - hivemall

  • 1. Introduction  to Machine  Learning  on   using  Hivemall Research  Engineer Makoto  YUI  @myui <myui@treasure-­‐data.com> 2014/09/17  Talk@Japan  DataScientist  Society 1
  • 2. Ø 2015.04  Joined  Treasure  Data,  Inc. 1st Research  Engineer  in  Treasure  Data My  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service Ø 2010.04-­‐2015.03  Senior  Researcher  at  National   Institute  of  Advanced  Industrial  Science  and   Technology,  Japan.   Worked  on  a  large-­‐scale  Machine  Learning  project   and  Parallel  Databases   Ø 2009.03  Ph.D.  in  Computer  Science  from  NAIST Ø Super  programmer  award  from  the  MITOU   Foundation   Super  creators  in  TD:    Sada  Furuhashi,  Keisuke  Nishida Who  am    I  ? 2014/09/17  Talk@Japan  DataScientist  Society 2
  • 3. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 3
  • 4. What  is  Hivemall Scalable  machine  learning  library  built  as  a  collection  of   Hive  UDFs,  licensed  under  the  Apache  License  v2 2014/09/17  Talk@Japan  DataScientist  Society 4 https://github.com/myui/hivemall
  • 5. What  is  Hivemall Hadoop  HDFS MapReduce (MR v1) Hive /  PIG Hivemall Apache  YARN Apache  Tez   DAG  processing MR  v2 Machine  Learning Query  Processing Parallel  Data   Processing  Framework Resource  Management Distributed  File  System 2014/09/17  Talk@Japan  DataScientist  Society 5 Scalable  machine  learning  library  built  as  a  collection  of   Hive  UDFs,  licensed  under  the  Apache  License  v2
  • 6. R M MM M M HDFS R MapReduce  and  DAG  engine MapReduce   DAG  engine (Tez /  Spark) No  intermediate  DFS  reads/writes! 62014/09/17  Talk@Japan  DataScientist  Society M MM M HDFS HDFS M M M R M M M R HDFS HDFS HDFS
  • 7. Won  IDG’s  InfoWorld  2014 Bossie  Awards  2014:  The  best  open  source  big  data  tools InfoWorld's  top  picks  in  distributed  data  processing,  data  analytics,   machine  learning,  NoSQL  databases,  and  the  Hadoop  ecosystem bit.ly/hivemall-­‐award 2014/09/17  Talk@Japan  DataScientist  Society 7
  • 8. List  of  Features  in  Hivemall  v0.3.2 Classification  (both   binary-­‐ and  multi-­‐class) ✓ Perceptron ✓ Passive  Aggressive  (PA) ✓ Confidence  Weighted  (CW) ✓ Adaptive  Regularization  of   Weight  Vectors  (AROW) ✓ Soft  Confidence  Weighted   (SCW) ✓ AdaGrad+RDA Regression ✓Logistic  Regression  (SGD) ✓PA  Regression ✓AROW  Regression ✓AdaGrad ✓AdaDELTA kNN and  Recommendation ✓ Minhash and  b-­‐Bit  Minhash (LSH  variant) ✓ Similarity   Search  using  K-­‐NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix  Factorization Feature  engineering ✓ Feature  Hashing ✓ Feature  Scaling (normalization,   z-­‐score)   ✓ TF-­‐IDF  vectorizer ✓ Polynomial  Expansion Anomaly  Detection ✓ Local  Outlier  Factor Treasure  Data  supports  Hivemall  v0.3.2-­‐3 2014/09/17  Talk@Japan  DataScientist  Society 8
  • 9. Algorithms News20.binary Classification  Accuracy Perceptron 0.9460   Passive-­‐Aggressive (a.k.a.  Online-­‐SVM) 0.9604   LibLinear 0.9636   LibSVM/TinySVM 0.9643   Confidence Weighted  (CW) 0.9656   AROW  [1] 0.9660   SCW  [2] 0.9662   Better CW-­‐variants  are  very  smart online ML  algorithm Hivemall  supports  the  state-­‐of-­‐the-­‐art  online  learning   algorithms  (for  classification and  regression) 2014/09/17  Talk@Japan  DataScientist  Society 9 List  of  Features  in  Hivemall
  • 10. Why  CW  variants  are  so  good? Suppose  a  binary  classification  setting  to  classify   sentences  positive  or  negative →  learn  the  weight  for  each  word  (each  word  is  a  feature) I  like  this  authorPositive I  like  this  author,  but  found  this  book  dullNegative Label Feature  Vector Naïve  update  will  reduce  both                                              at  same  rateWlike Wdull CW-­‐variants  adjust  weights  at  different  rates 2014/09/17  Talk@Japan  DataScientist  Society 10
  • 11. Why  CW  variants  are  so  good? weight weight Adjust  a  weight Adjust  a  weight  &   confidence 0.6 0.80.6 0.80.6 At  this  confidence,   the  weight  is  0.5 Confidence (covariance) 0.5 2014/09/17  Talk@Japan  DataScientist  Society 11
  • 12. Features to  be  supported  from  Hivemall  v0.4 2014/09/17  Talk@Japan  DataScientist  Society 12 1.RandomForest • classification,  regression 2.Gradient  Tree  Boosting • classifier,  regression 3.Factorization  Machine • classification,  regression  (factorization) 4.Online  LDA • topic  modeling,  clustering Planned  to  release  v0.4  in  Oct. Gradient  Boosting  and  Factorization  Machine are  often  used  by  data  science  competition  winners (very  important  for  practitioners)
  • 13. 2014/09/17  Talk@Japan  DataScientist  Society 13 Factorization  Machine Matrix  Factorization
  • 14. 2014/09/17  Talk@Japan  DataScientist  Society 14 Factorization  Machine Context  information  (e.g.,  time)   can  be  considered Source:  http://www.ismll.uni-­‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf
  • 15. 2014/09/17  Talk@Japan  DataScientist  Society 15 Factorization  Machine Factorization  Model  with  degress=2  (2-­‐way  interaction) Global Bias Regression coefficience of j-th variable Pairwise Interaction Factorization
  • 16. Ø CTR  prediction  of  Ad  click  logs • Algorithm:  Logistic  regression • Freakout Inc.,  Smartnews,  and  more Ø Gender  prediction  of  Ad  click  logs • Algorithm:  Classification • Scaleout Inc. Ø Churn  Detection • Algorithm:  Regression • OISIX  and  more Ø Item/User  recommendation • Algorithm:  Recommendation  (Matrix  Factorization  /  kNN)   • Wish.com,  DAC,  Real-­‐estate  Portal,  and  more Ø Value  prediction  of  Real  estates • Algorithm:    Regression • Livesense Industry  use  cases  of  Hivemall 162014/09/17  Talk@Japan  DataScientist  Society
  • 17. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 17
  • 18. Why  Hivemall 1. In  my  experience  working  on  ML,  I  used  Hive   for  preprocessing  and  Python  (scikit-­‐learn  etc.)   for  ML.  This  was  INEFFICIENT  and  ANNOYING.   Also,  Python  is  not  as  scalable  as  Hive. 2. Why  not  run  ML  algorithms  inside  Hive?  Less   components  to  manage  and  more  scalable. That’s  why  I  build  Hivemall. 2014/09/17  Talk@Japan  DataScientist  Society 18
  • 19. Data  Moving  in  Data  Analytics Data Collection Data Lake Data Processing Data Mart Amazon S3 Amazon EMR Redshift Amazon RDS Event  Data Insights  and  Decisions Data Analysis Data  Engineer Data  Scientist Data  Engineer 2014/09/17  Talk@Japan  DataScientist  Society 19
  • 20. 2014/09/17  Talk@Japan  DataScientist  Society 20 What  Data  Scientists  actually  Do What  Data  Scientists  Should  Do Data  Moving  in  Data  Analytics Hive is a great data preprocessing tool due to its easiness & efficiency for join, filtering, and selection (data preprocessing)
  • 21. How  I  used  to  do  ML  projects  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load Machine  Learning file 2014/09/17  Talk@Japan  DataScientist  Society 21
  • 22. How  I  used  to  do  ML  projects  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load file Need  to  do  expensive  data   preprocessing   (Joins,  Filtering,  and  Formatting  of  Data   that  does  not  fit  in  memory) Machine  Learning 2014/09/17  Talk@Japan  DataScientist  Society 22
  • 23. How  I  used  to  do  ML  projects  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load file Do  not  scale Have  to  learn  R/Python  APIs 2014/09/17  Talk@Japan  DataScientist  Society 23
  • 24. How  I  used  to  do  ML  before  Hivemall Given  raw  data  stored  on  Hadoop  HDFS Raw Data HDFS S3 Feature  Vector height:173cm weight:60kg age:34 gender:  man … Extract-­‐Transform-­‐Load Does  not  meet  my  needs In  terms  of  its  scalability,  ML  algorithms,  and  usability I  ❤ scalable SQL  query 2014/09/17  Talk@Japan  DataScientist  Society 24
  • 25. Framework User  interface Mahout Java  API  Programming Spark  MLlib/MLI Scala  API  programming Scala  Shell  (REPL) H2O R  programming GUI Cloudera  Oryx Http  REST  API  programming Vowpal  Wabbit (w/  Hadoop  streaming) C++  API  programming Command  Line Survey  on  existing  ML  frameworks Existing  distributed  machine  learning  frameworks are  NOT  easy  to  use 2014/09/17  Talk@Japan  DataScientist  Society 25
  • 26. 2014/09/17  Talk@Japan  DataScientist  Society 26 Motivation:   Machine  Learning  need  to  be  more  easy   for  developers  (esp.  data  engineers)! People  are  saying  that  ..
  • 27. Hivemall’s Vision:  ML  on  SQL Classification  with  Mahout CREATE  TABLE  lr_model  AS SELECT feature,  -­‐-­‐ reducers  perform  model  averaging  in   parallel avg(weight)  as  weight FROM  ( SELECT  logress(features,label,..)  as  (feature,weight) FROM  train )  t  -­‐-­‐ map-­‐only  task GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers ✓Machine  Learning  made  easy  for  SQL   developers  (ML  for  the  rest  of  us) ✓Interactive  and  Stable  APIs  w/ SQL  abstraction This  SQL  query  automatically  runs  in   parallel  on  Hadoop   2014/09/17  Talk@Japan  DataScientist  Society 27
  • 28. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 28
  • 29. Implemented  machine  learning  algorithms  as   User-­‐Defined  Table  generating  Functions  (UDTFs) How  Hivemall  works  in  training +1,  <1,2> .. +1,  <1,7,9> -­‐1,  <1,3,  9> .. +1,  <3,8> tuple <label,  array<features>> tuple<feature,  weights> Prediction  model UDTF Relation <feature,  weights> param-­‐mix param-­‐mix Training   table Shuffle   by  feature train train ● Resulting prediction model is a relation of feature and its weight ● # of mapper and reducers are configurable UDTF  is  a  function  that  returns  a  relation Parallelism  is  Powerful 2014/09/17  Talk@Japan  DataScientist  Society 29
  • 30. train train +1,  <1,2> .. +1,  <1,7,9> -­‐1,  <1,3,  9> .. +1,  <3,8> merge tuple <label,  array<features  > array<weight> array<sum  of  weight>,   array<count> Training   table Prediction   model -­‐1,  <2,7,  9> .. +1,  <3,8> final   merge merge -­‐1,  <2,7,  9> .. +1,  <3,8> train train array<weight > Why  not  UDAF 4  ops  in  parallel 2  ops  in  parallel No  parallelism Machine  learning  as  an  aggregate  function Bottleneck  in  the  final  merge Throughput  limited  by  its  fan  out Memory   consumption grows Parallelism decreases 2014/09/17  Talk@Japan  DataScientist  Society 30
  • 31. Problem  that  I  faced:  Iterations Iterations  are  mandatory  to  get  a  good  prediction   model • However,  MapReduce is  not  suited  for  iterations  because   IN/OUT  of  MR  job  is  through  HDFS • Spark  avoid  it  by  in-­‐memory  computation iter.  1 iter.  2 . . . Input HDFS read HDFS write HDFS read HDFS write iter.  1 iter.  2 Input 2014/09/17  Talk@Japan  DataScientist  Society 31
  • 32. Training  with  Iterations  in  Spark val data = spark.textFile(...).map(readPoint).cache() for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } Repeated  MapReduce  steps to  do  gradient  descent For  each  node,  loads   data  in  memory  once This  is  just  a  toy  example!  Why? Logistic  Regression  example  of  Spark Input  to  the  gradient  computation  should  be  shuffled   for  each  iteration  (without  it,  more  iteration  is  required) 2014/09/17  Talk@Japan  DataScientist  Society 32
  • 33. What  MLlib  actually  do? Val data = .. for (i <- 1 to numIterations) { val sampled = val gradient = w -= gradient } Mini-­‐batch  Gradient  Descent  with  Sampling Iterations  are  mandatory  for  convergence  because   each  iteration  uses  only  small  fraction  of  data GradientDescent.scala bit.ly/spark-­‐gd sample subset of data (partitioned RDD) averaging the subgradientsover the sampled data using Spark MapReduce 2014/09/17  Talk@Japan  DataScientist  Society 33
  • 34. Alternative  Approach  in  Hivemall Hivemall  provides  the amplify UDTF  to  enumerate   iteration  effects  in  machine  learning  without  several   MapReduce steps SET hivevar:xtimes=3; CREATE VIEW training_x3 as SELECT * FROM ( SELECT amplify(${xtimes}, *) as (rowid, label, features) FROM training ) t CLUSTER BY rand() 2014/09/17  Talk@Japan  DataScientist  Society 34
  • 35. Map-­‐only  shuffling  and  amplifying rand_amplify UDTF  randomly  shuffles  the   input  rows  for  each  Map  task CREATE VIEW training_x3 as SELECT rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features) FROM training; 2014/09/17  Talk@Japan  DataScientist  Society 35
  • 36. Detailed  plan  w/  map-­‐local  shuffle … Reduce   task Merge Aggregate Reduce  write Map   task Table  scan Rand  Amplifier Map  write Logress  UDTF Partial  aggregate Map   task Table  scan Rand  Amplifier Map  write Logress UDTF Partial  aggregate Reduce   task Merge Aggregate Reduce  write Scanned  entries   are  amplified  and   then  shuffled Note  this  is  a  pipeline  op. The  Rand  Amplifier  operator  is  interleaved  between   the  table  scan  and  the  training  operator Shuffle   (distributed  by   feature) 2014/09/17  Talk@Japan  DataScientist  Society 36
  • 37. Method ELAPSED  TIME   (sec) AUC Plain 89.718 0.734805 amplifier+clustered  by (a.k.a.  global  shuffle) 479.855 0.746214 rand_amplifier   (a.k.a.  map-­‐local  shuffle) 116.424 0.743392 Performance  effects  of  amplifiers With  the  map-­‐local  shuffle,  prediction  accuracy   got  improved  with  an  acceptable  overhead   2014/09/17  Talk@Japan  DataScientist  Society 37
  • 38. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 38
  • 39. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Data  preparation 2014/09/17  Talk@Japan  DataScientist  Society 39
  • 40. CREATE  EXTERNAL  TABLE  e2006tfidf_train  ( rowid int, label float, features ARRAY<STRING> )   ROW  FORMAT  DELIMITED   FIELDS  TERMINATED  BY  '¥t'   COLLECTION  ITEMS  TERMINATED  BY  ",“ STORED  AS  TEXTFILE  LOCATION  '/dataset/E2006-­tfidf/train';; How  to  use  Hivemall  -­‐ Data  preparation Define  a  Hive  table  for  training/testing  data 2014/09/17  Talk@Japan  DataScientist  Society 40
  • 41. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Feature  Engineering 2014/09/17  Talk@Japan  DataScientist  Society 41
  • 42. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How  to  use  Hivemall  -­‐ Feature  Engineering Transforming  a  label  value   to  a  value  between  0.0  and  1.0 2014/09/17  Talk@Japan  DataScientist  Society 42
  • 43. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Training 2014/09/17  Talk@Japan  DataScientist  Society 43
  • 44. How  to  use  Hivemall  -­‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training  by  logistic  regression map-­‐only  task  to  learn  a  prediction  model Shuffle  map-­‐outputs  to  reduces  by  feature Reducers  perform  model  averaging   in  parallel 2014/09/17  Talk@Japan  DataScientist  Society 44
  • 45. How  to  use  Hivemall  -­‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training  of  Confidence  Weighted  Classifier Vote  to  use  negative  or  positive   weights  for  avg +0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7 Training  for  the  CW  classifier 2014/09/17  Talk@Japan  DataScientist  Society 45
  • 46. create table news20mc_ensemble_model1as select label, cast(feature as int) as feature, cast(voted_avg(weight)as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label,feature; Ensemble  learning  for  stable  prediction  performance Just  stack  prediction  models   by  union  all 26 / 43 462014/09/17  Talk@Japan  DataScientist  Society
  • 47. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Prediction 2014/09/17  Talk@Japan  DataScientist  Society 47
  • 48. How  to  use  Hivemall  -­‐ Prediction CREATE  TABLE  lr_predict as SELECT t.rowid,   sigmoid(sum(m.weight))   as  prob FROM testing_exploded t  LEFT  OUTER  JOIN lr_model m  ON  (t.feature =  m.feature) GROUP  BY   t.rowid Prediction  is  done  by  LEFT  OUTER  JOIN between  test  data  and  prediction  model No  need  to  load  the  entire  model  into  memory 2014/09/17  Talk@Japan  DataScientist  Society 48
  • 49. How  to  use  Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature  Vector Feature  Vector Label Export   prediction  model 2014/09/17  Talk@Japan  DataScientist  Society 49
  • 50. Real-­‐time  Prediction  on  Treasure  Data Run  batch  training job  periodically Real-­‐time  prediction on  a  RDBMS Periodical export 2014/09/17  Talk@Japan  DataScientist  Society 50
  • 51. Agenda 1. What  is  Hivemall 2. Why  Hivemall  (motivations  etc.) 3. Hivemall  Internals 4. How  to  use  Hivemall • Logistic  regression  (RDBMS  integration) • Matrix  Factorization • Anomaly  Detection  (demo) • Random  Forest  (demo) 2014/09/17  Talk@Japan  DataScientist  Society 51
  • 52. Supervise  Learning:  Recommendation Rating  prediction  of  a  Matrix   Can  be  applied  for  user/Item  Recommendation 522014/09/17  Talk@Japan  DataScientist  Society
  • 53. 53 Matrix  Factorization Factorize  a  matrix   into  a  product  of  matrices having  k-­‐latent  factor 2014/09/17  Talk@Japan  DataScientist  Society
  • 54. 54 Mean  Rating Matrix  Factorization Regularization Bias   for  each  user/item Criteria  of  Biased  MF 2014/09/17  Talk@Japan  DataScientist  Society Factorization
  • 55. 55 Training  of  Matrix  Factorization Support iterative training using local disk cache 2014/09/17  Talk@Japan  DataScientist  Society
  • 56. 56 Prediction  of  Matrix  Factorization 2014/09/17  Talk@Japan  DataScientist  Society
  • 57. ØAlgorithm  is  different Spark:  ALS-­‐WR   (considers  regularization) Hivemall:  Biased-­‐MF   (considers  regularization  and  biases) ØUsability Spark:  100+  line  Scala  coding Hivemall:  SQL ØPrediction  Accuracy Almost  same  for  MovieLens 10M  datasets 2014/09/17  Talk@Japan  DataScientist  Society 57 Comparison  to  Spark  MLlib
  • 58. rowid features 1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0. 0"] 2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0. 13255163"] 3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0. 052084323"]     Unsupervised  Learning:  Anomaly  Detection Sensor  data  etc. Anomaly  detection  runs  on  a  series  of  SQL  queries 582014/09/17  Talk@Japan  DataScientist  Society
  • 59. 2014/09/17  Talk@Japan  DataScientist  Society 59 Anomalies  in  a  Sensor  Data Source:  https://codeiq.jp/q/207
  • 60. Image  Source:  https://en.wikipedia.org/wiki/Local_outlier_factor 2014/09/17  Talk@Japan  DataScientist  Society 60 Local  Outlier  Factor  (LoF) Basic  idea  of  LOF:  comparing  the  local  density  of  a   point  with  the  densities of  its  neighbors
  • 61. 2014/09/17  Talk@Japan  DataScientist  Society 61 DEMO:  Local  Outlier  Factor rowid features 1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0. 0"] 2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0. 13255163"] 3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0. 052084323"]    
  • 62. 2014/09/17  Talk@Japan  DataScientist  Society 62 RandomForest  in  Hivemall  v0.4 Ensemble  of  Decision  Trees Already  available  on  a  development  (smile)  branch and  it’s  usage  is  explained  in  the  project  wiki
  • 63. 2014/09/17  Talk@Japan  DataScientist  Society 63 Training  of  RandomForest
  • 64. Out-­‐of-­‐bag  tests  and  Variable  Importance   2014/09/17  Talk@Japan  DataScientist  Society 64
  • 65. 2014/09/17  Talk@Japan  DataScientist  Society 65 Prediction  of  RandomForest
  • 66. 2014/09/17  Talk@Japan  DataScientist  Society 66 Jupyter Integration DEMO
  • 67. Conclusion  and  Takeaway Hivemall  provides  a  collection  of  machine   learning  algorithms  as  Hive  UDFs/UDTFs Ø For  SQL  users  that  need  ML Ø For  whom  already  using  Hive Ø Easy-­‐of-­‐use  and  scalability  in  mind Do  not  require  coding,  packaging,  compiling  or   introducing  a  new  programming  language  or APIs. Hivemall’s Positioning 2014/09/17  Talk@Japan  DataScientist  Society 67 v0.4  will  make  a  developmental  leap
  • 69. 2014/09/17  Talk@Japan  DataScientist  Society 69 Beyond  Query-­‐as-­‐a-­‐Service! We              Open-­‐source!  We  invented  .. We  are  hiring  machine  learning  engineer!