TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and Andy Feng

Lee Yang (Verizon/Oath/Yahoo)
Andy Feng (Nvidia)
TensorFlowOnSpark
Enhanced
Scala, Pipelines and Beyond
#DLSAIS16

Review
• Distributed TensorFlow on Spark clusters
• Access to datasets on existing Spark
infrastructure
• Open-sourced Feb 2017
2#DLSAIS16

Goals
• Run distributed TF with minimal code changes
• Support all TF functionality
• Integrate with existing data pipelines
• Lightweight with minimal dependencies
3#DLSAIS16

Input Modes
• InputMode.SPARK
HDFS → RDD.mapPartitions → feed_dict
• InputMode.TENSORFLOW
TFReader + QueueRunner ← HDFS
tf.data ← HDFS
4#DLSAIS16

API Example
cluster = TFCluster.run(sc, map_fn, args, num_executors, num_ps,
tensorboard, input_mode)
# InputMode.SPARK only
cluster.train(dataRDD, num_epochs=0)
cluster.inference(dataRDD)
cluster.shutdown()
5#DLSAIS16

What’s New?
• Spark ML Pipelines API
• Scala Inferencing API
• tf.estimator.train_and_evaluate()
• Failure Recovery
6#DLSAIS16

Spark ML Pipelines API
7#DLSAIS16

Spark ML Pipelines API
• Training & Inferencing APIs
– TFEstimator
– TFModel
• Helper APIs
– TFNode.export_saved_model()
– dfutil
• Developed in collaboration with Databricks*
* Thanks: Tim Hunter, Sue Ann Hong, Philip Yang
8#DLSAIS16

TFEstimator
estimator = TFEstimator(main_fun, sys.argv,
export_fn=inception_export.export)
.setModelDir(args.train_dir)
.setExportDir(args.export_dir)
.setTFRecordDir(args.tfrecord_dir)
.setClusterSize(args.cluster_size)
.setNumPS(args.num_ps)
.setInputMode(TFCluster.InputMode.TENSORFLOW)
.setTensorboard(args.tensorboard)
model = estimator.fit(df)
9#DLSAIS16

TFModel
preds = model.setTagSet(tf.saved_model.tag_constants.SERVING)
.setSignatureDefKey(
tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY)
.setInputMapping({'image/encoded': 'jpegs'})
.setOutputMapping({'logits': 'output'})
.transform(df)
preds.write.json(args.output)
10#DLSAIS16

TFNode.export_saved_model()
signatures = {
tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: {
‘inputs': { 'image': x },
'outputs': { 'prediction': prediction },
'method_name': tf.saved_model.signature_constants.PREDICT_METHOD_NAME
}
}
TFNode.export_saved_model(sess, args.export_dir,
tf.saved_model.tag_constants.SERVING,
signatures)
11#DLSAIS16

dfutil
# Load TFRecords as a Spark DataFrame
# Note: String and Binary features are both stored as `bytes_list`
df = dfutil.loadTFRecords(sc, args.train_data,
binary_features=['image/encoded’])
# Save Spark DataFrame as TFRecords
dfutil.saveAsTFRecords(df, args.tfrecord_dir)
12#DLSAIS16

dfutil Type Conversions
13#DLSAIS16
Spark DataType.simpleString() TFRecord
float, double float_list
boolean, tinyint,
smallint, int, bigint, long
int64_list
binary, string bytes_list

Scala Inferencing API
14#DLSAIS16

• Inferencing only
– Uses EXPERIMENTAL TensorFlow Java API (1.4+)
– Loads previously-trained SavedModels
• Follows the Spark ML Pipelines API
– TFEstimator
– TFModel
• Helper APIs
– DFUtil
15#DLSAIS16

${SPARK_HOME}/bin/spark-submit
--class com.yahoo.tensorflowonspark.Inference
target/tensorflowonspark-1.0-SNAPSHOT.jar
--export_dir mnist_export
--input mnist/tfr/test
--schema_hint 'struct<image:array<float>,label:array<float>>'
--input_mapping '{"image": ”x", "label": ”y_"}'
--output_mapping '{"prediction": "prediction", ”Relu": "features"}'
--output predictions
16#DLSAIS16

tf.estimator.train_and_evaluate()
17#DLSAIS16

“This utility function provides consistent behavior
for both local (non-distributed) and distributed
configurations. Currently, the only supported
distributed training configuration is between-graph
replication.”
https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate
18#DLSAIS16

• Eliminates “distributed” boiler-plate code:
– tf.train.Server
– if PS else worker
– tf.train.Supervisor/tf.train.MonitoredTrainingSession
– sess.run()
• Uses TF_CONFIG environment variable for cluster_spec
• Requires a ”master” node in addition to “ps” and “worker”
• TFCluster.run(sc, map_fun, …, master_node=‘master’)
19#DLSAIS16

Failure Recovery
• TF exceptions in InputMode.SPARK*
• Reservation timeout*
• GPU allocation hang
• Cluster size mismatch
• But, still no dynamic clusters in TF
* Thanks: Erik Ordentlich
22#DLSAIS16

Failure Recovery Goals
• Surface errors and let Spark attempt recovery
– Task: data partition feed or TF worker
– Job: dataset feed or TFCluster
• Let higher-level orchestration try to recover
Spark application, e.g. restart training from last
checkpoint
23#DLSAIS16

What Else is New?
• tf.data.Dataset Example
• Keras Example
24#DLSAIS16

tf.data.Dataset Example
• InputMode.TENSORFLOW
• InputMode.SPARK
– tf.data.Dataset.from_generator()
– Generator yields items from TFNode.DataFeed
25#DLSAIS16

Keras Example
Then:
• K.set_session(sess)
• model.fit() OR model.fit_generator()
• HDFS checkpoints via LambdaCallback
Now:
• tf.keras.estimator.model_to_estimator()
• tf.data.Datasets AND input_fn
• tf.estimator.train_and_evaluate()
26#DLSAIS16

Roadmap
• Hadoop 3.0
– GPU scheduling
– Docker Containers
• Dynamic Cluster Management
27#DLSAIS16

Summary
• TensorFlowOnSpark continues to evolve to
support additional use cases and APIs.
• Can often leverage new TensorFlow APIs
without any changes.
• Runs distributed TensorFlow applications
(training and inferencing) on your existing Spark
infrastructure.
28#DLSAIS16

Thanks!
29#DLSAIS16
And our open-source contributors!

Questions?
https://github.com/yahoo/TensorFlowOnSpark
30#DLSAIS16

TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and Andy Feng

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and Andy Feng

Similaire à TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and Andy Feng (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and Andy Feng