Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

20160908 hivemall meetup

1 393 vues

Publié le

A slide for Hivemall Meetup#3

Publié dans : Ingénierie
  • Soyez le premier à commenter

20160908 hivemall meetup

  1. 1. Copyright©2016 NTT corp. All Rights Reserved. Hivemall  Meets  XGBoost  in DataFrame/Spark 2016/9/8 Takeshi  Yamamuro  (maropu)  @  NTT
  2. 2. 2Copyright©2016 NTT corp. All Rights Reserved. Who  am  I?
  3. 3. 3Copyright©2016 NTT corp. All Rights Reserved. • Short  for  eXtreme  Gradient  Boosting •  https://github.com/dmlc/xgboost • It  is... •  variant  of  the  gradient  boosting  machine •  tree-‐‑‒based  model •  open-‐‑‒sourced  tool  (Apache2  license)   •  written  in  C++ •  R/python/Julia/Java/Scala  interfaces  provided •  widely  used  in  Kaggle  competitions                 is...
  4. 4. 4Copyright©2016 NTT corp. All Rights Reserved. • Most  of  Hivemall  functions  supported  in   Spark-‐‑‒v1.6  and  v2.0 •  the  v2.0  support  not  released  yet • XGBoost  integration  under  development •  distributed/parallel  predictions •  native  libraries  bundled  for  major  platforms •  Mac /Linux  on  x86_̲64 •  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/ 33794b293ee937e99b8fb0788843fa3f Hivemall  in  DataFrame/Spark
  5. 5. 5Copyright©2016 NTT corp. All Rights Reserved. Spark  Quick  Examples • Fetch  a  binary  Spark  v2.0.0 •  http://spark.apache.org/downloads.html $ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  6. 6. 6Copyright©2016 NTT corp. All Rights Reserved. Fetch  training  and  test  data • E2006  tfidf  regression  dataset •  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/ datasets/regression.html#E2006-‐‑‒tfidf $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2
  7. 7. 7Copyright©2016 NTT corp. All Rights Reserved. XGBoost  in  spark-‐‑‒shell • Scala  interface  bundled  in  the  Hivemall  jar $ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)
  8. 8. 8Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  9. 9. 9Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… testDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable • #partitions  depends  on  three  parameters •  spark.default.parallelism:  #cores  by  default •  spark.sql.files.maxPartitionBytes:  128MB  by  default •  spark.sql.files.openCostInBytes:  4MB  by  default
  10. 10. 10Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions Do  predictions  in  parallel scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  11. 11. 11Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions • Broadcast  cross-‐‑‒joins  expected •  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to   spark.sql.autoBroadcastJoinThreshold  (10MB  by  default) Do  predictions  in  parallel testDf rowid label features 1 0.392 1:0.3  5:0.1… 2 0.929 3:0.2… 3 0.132 2:0.9… 4 0.3923 5:0.4… … modelsDf model_̲id pred_̲model xgb_̲0001.model <binary  data> xgb_̲0002.model <binary  data> cross-joins in parallel
  12. 12. 12Copyright©2016 NTT corp. All Rights Reserved. • Structured  Streaming  in  Spark-‐‑‒2.0 •  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine   built  on  the  Spark  SQL  engine •  alpha  component  in  v2.0 Do  predictions  for  streaming  data scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  13. 13. 13Copyright©2016 NTT corp. All Rights Reserved. • One  model  for  a  partition •  WIP:  Build  models  with  different  parameters Build  models  in  parallel scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)
  14. 14. 14Copyright©2016 NTT corp. All Rights Reserved. • If  you  get  stuck  in  UnsatisfiedLinkError,  you   need  to  compile  a  binary  by  yourself Compile  a  binary  on  your  platform   $ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar
  15. 15. 15Copyright©2016 NTT corp. All Rights Reserved. • Rabbit  integration  for  parallel  learning •  http://dmlc.cs.washington.edu/rabit.html • Python  supports • spark.ml  interface  supports • Bundle  more  binaries  for  portability •  Windows  and  x86  platforms • Others? Future  Work

×