TensorFlowOnSpark (TFoS) was open sourced in Q1 2017, and it has gained strong adoption within the Spark community for running TensorFlow training and inferencing jobs on Spark clusters. At Spark Summit 2017, we explained how TFoS enables Python applications to conduct distributed TensorFlow training and inference efficiently by leveraging key built-in capabilities of PySpark and TensorFlow.
In this talk, we cover the major enhancements of TFoS in recent months. We will introduce a new Scala API for users who want to integrate previously trained models into an existing Scala/Spark workflow. We will describe a new Python API for Spark ML pipelines to train all types of TensorFlow models, and conduct inference/featurization without any custom code. Additionally, we will cover the support for TensorFlow Keras API, and TensorFlow Datasets.
2. Review
• Distributed TensorFlow on Spark clusters
• Access to datasets on existing Spark
infrastructure
• Open-sourced Feb 2017
2#DLSAIS16
3. Goals
• Run distributed TF with minimal code changes
• Support all TF functionality
• Integrate with existing data pipelines
• Lightweight with minimal dependencies
3#DLSAIS16
8. Spark ML Pipelines API
• Training & Inferencing APIs
– TFEstimator
– TFModel
• Helper APIs
– TFNode.export_saved_model()
– dfutil
• Developed in collaboration with Databricks*
* Thanks: Tim Hunter, Sue Ann Hong, Philip Yang
8#DLSAIS16
12. dfutil
# Load TFRecords as a Spark DataFrame
# Note: String and Binary features are both stored as `bytes_list`
df = dfutil.loadTFRecords(sc, args.train_data,
binary_features=['image/encoded’])
# Save Spark DataFrame as TFRecords
dfutil.saveAsTFRecords(df, args.tfrecord_dir)
12#DLSAIS16
18. tf.estimator.train_and_evaluate()
“This utility function provides consistent behavior
for both local (non-distributed) and distributed
configurations. Currently, the only supported
distributed training configuration is between-graph
replication.”
https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate
18#DLSAIS16
19. tf.estimator.train_and_evaluate()
• Eliminates “distributed” boiler-plate code:
– tf.train.Server
– if PS else worker
– tf.train.Supervisor/tf.train.MonitoredTrainingSession
– sess.run()
• Uses TF_CONFIG environment variable for cluster_spec
• Requires a ”master” node in addition to “ps” and “worker”
• TFCluster.run(sc, map_fun, …, master_node=‘master’)
19#DLSAIS16
22. Failure Recovery
• TF exceptions in InputMode.SPARK*
• Reservation timeout*
• GPU allocation hang
• Cluster size mismatch
• But, still no dynamic clusters in TF
* Thanks: Erik Ordentlich
22#DLSAIS16
23. Failure Recovery Goals
• Surface errors and let Spark attempt recovery
– Task: data partition feed or TF worker
– Job: dataset feed or TFCluster
• Let higher-level orchestration try to recover
Spark application, e.g. restart training from last
checkpoint
23#DLSAIS16
24. What Else is New?
• tf.data.Dataset Example
• Keras Example
24#DLSAIS16
26. Keras Example
Then:
• K.set_session(sess)
• model.fit() OR model.fit_generator()
• HDFS checkpoints via LambdaCallback
Now:
• tf.keras.estimator.model_to_estimator()
• tf.data.Datasets AND input_fn
• tf.estimator.train_and_evaluate()
26#DLSAIS16
28. Summary
• TensorFlowOnSpark continues to evolve to
support additional use cases and APIs.
• Can often leverage new TensorFlow APIs
without any changes.
• Runs distributed TensorFlow applications
(training and inferencing) on your existing Spark
infrastructure.
28#DLSAIS16