Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct)

Construire le cluster le plus rapide pour
l'analyse des datas : benchmarks sur un
régresseur
Christopher Bourez
16 / 02 / 2016
http://christopher5106.github.io/

Présentation
1. Random Forest distribué sur cluster (Spark MLlib)
2. Auto-terminating cloud clusters
3. Random Forest sous GPU (Scala BIDMach)
4. Cluster de GPU
5. Benchmarks

Random Forest MLlib (MLib Python)
Développement local avec pyspark
pyspark --master local[4]
Soumission du script en local
spark-submit --master local[4] src/main/python/compute_rf.py myfile.csv
Soumission sur le cluster

from pyspark import SparkConf, SparkContext
sc = SparkContext()
file = sc.textFile(filename)
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
Script .py

Transformation au format LabeledPoint
def transform_line_to_labeledpoint(line):
values = line.split(";")
if values[index_label] == "":
label = 0.0
else:
label = float(values[index_label])
vector = []
for i in range(index_expl_start, index_expl_stop):
if values[i] == "":
vector.append(0.0)
else:
vector.append(float(values[i]))
return LabeledPoint(label,vector)

Calcul du modèle
file = sc.textFile(“file://test.csv”)
data = file.filter(lambda w: not w.startswith(header)).map
(transform_line_to_labeledpoint)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
trainingData.cache()
testData.cache()
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo=
{},numTrees=2500, featureSubsetStrategy="sqrt",impurity='variance')

Prédictions
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float
(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

Random Forest MLlib (MLlib Scala)
Compilation : sbt package
bin/spark-submit --master local[4] --class "App" target/scala-2.10/project_2.10-
1.0.jar myfile.csv
object App {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Application")
val sc = new SparkContext(conf)

Script scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint

Transformation scala
val LabeledPointData = data.map(r => {
val values = r.split(";")
val expl = values.zipWithIndex.flatMap {
case ("", i:Int) if (i >= index_expl_start) => Some(0.0)
case (s:String,i:Int) if (i >= index_expl_start) => Some(s.toDouble)
case _ => None }
val label = if(values(index_label) == "") 0.0 else values(index_label).toDouble
LabeledPoint(label,Vectors.dense(expl))
})

Calcul du modèle
val data = sc.textFile(“file://test.csv”)
val LabeledPointData = data.map(r => { … } )
val splits = LabeledPointData.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val model = RandomForest.trainRegressor(trainingData,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth,
maxBins)

Evaluation
val labelsAndPredictions = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.
mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression forest model:n" + model.toDebugString)

AWS auto-terminating cloud clusters

Logistic regression Criteo Dataset

Logistic Regression on Reuters data

Same on LDA, Matrix Factorization...

Calcul du modèle
val (mm,opts) = RandomForest.learner("data%02d.fmat.lz4","label%02d.imat.lz4")
opts.batchSize = 1000
opts.nend = 50
opts.depth = 5
opts.ncats = 2 // number of categories of label
opts.ntrees = 20
opts.impurity = 0
opts.nsamps = 12
opts.nnodes = 50000
opts.nbits = 16
opts.gain = 0.001f
mm.train
opts.useGPU = true

Spark+GPU
Création de plusieurs AMI :
- AMI avec NVIDIA driver et Cuda 7.5
- AMI avec NVIDIA driver, Cuda 7.5, JCuda, BIDMat et BIDMach
- AMI avec NVIDIA driver, Cuda 7.5, JCuda, BIDMat, BIDMach et Spark
Compilation Spark sous Scala 2.11
Lancement du cluster à la mano :
- configuration de conf/slaves et conf/spark-env.sh
- ajout de la clé à l’agent-ssh
- démarrage sbin/start-all.sh

Run Spark-shell
./bin/spark-shell
--master=spark://ec2-54-229-155-126.eu-west-1.compute.amazonaws.com:7077
--jars /home/ec2-user/BIDMach/BIDMach.jar,/home/ec2-user/BIDMach/lib/BIDMat.jar,/home/ec2-
user/BIDMach/lib/jhdf5.jar,/home/ec2-user/BIDMach/lib/commons-math3-3.2.jar,/home/ec2-
user/BIDMach/lib/lz4-1.3.jar,/home/ec2-user/BIDMach/lib/json-io-4.1.6.jar,/home/ec2-
user/BIDMach/lib/jcommon-1.0.23.jar,/home/ec2-user/BIDMach/lib/jcuda-0.7.5.jar,/home/ec2-
user/BIDMach/lib/jcublas-0.7.5.jar,/home/ec2-user/BIDMach/lib/jcufft-0.7.5.jar,/home/ec2-
user/BIDMach/lib/jcurand-0.7.5.jar,/home/ec2-user/BIDMach/lib/jcusparse-0.7.5.jar
--driver-library-path="/home/ec2-user/BIDMach/lib"
--conf "spark.executor.extraLibraryPath=/home/ec2-user/BIDMach/lib"

Création des data "data%02d.fmat.lz4","label%02d.
imat.lz4"
val file = sc.textFile("myfile.csv")
val header_line = file.first()
val tail_file = file.filter( _ != header_line)
val allData = tail_file.mapPartitionsWithIndex( upload_lz4_fmat_to_S3 )
allData.collect()

Hyper parameter tuning
import BIDMat.{CMat, CSMat, DMat, Dict, FMat, FND, GMat, GDMat, GIMat, GLMat, GSMat, GSDMat,
HMat, IDict, Image, IMat, LMat, Mat, SMat, SBMat, SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMach.models.RandomForest
val ndepths = icol(1, 2, 3, 4, 5) // 5 values
val ntrees = icol(5, 10, 20) // 3 values
val ndepthsparams = iones(ntrees.nrows, 1) ⊗ ndepths
val ntreesparams = ntrees ⊗ iones(ndepths.nrows,1)
val hyperparameters = ndepthsparams ntreesparams
val hyperparamSeq = for( i <- Range(0, hyperparameters.nrows) ) yield(hyperparameters(i,?))
val hyperparamRDD = sc.parallelize(hyperparamSeq,2)

Hyper parameter tuning
hyperparamRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[BIDMat.IMat]) => {
it.toList.map(x => {
// get the data
val (mm,opts) = RandomForest.learner("data%02d.fmat.lz4","label%02d.imat.lz4")
opts.batchSize = 1000
opts.nend = 50
opts.depth = x(0,0)
opts.ncats = 2
opts.ntrees = x(0,1)
opts.impurity = 0
opts.nsamps = 12
opts.nnodes = 50000
opts.nbits = 16
opts.gain = 0.001f
mm.train
index + ": ndepth "+x(0,0) + " & ntrees "+x(0,1)
} ).iterator
}).collect

Benchmarks
2500 arbres
Scikit-learn - Grid 30 param (x6)
- sqrt(250)
8 cpu 4j $0.532 par heure
Spark MLlib sqrt(250) 100 instances /
400 cpu
5 min $26 par heure
BIDMach max-depth 4 1 gpu 2,5h $0.65 par heure

Merci pour votre attention!
http://christopher5106.github.io/

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct)

Similaire à Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct) (20)

Plus de Modern Data Stack France

Plus de Modern Data Stack France (20)

Dernier

Dernier (20)

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des datas : benchmarks sur un régresseur par Christopher Bourez (Axa Global Direct)