TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
TFLite NNAPI and GPU Delegates
1. TFLite NNAPI and
GPU Delegates
Koan-Sin Tan
freedom@computer.org
Aug 18th, 2019
COSCUP 2019, Taipei, Taiwan
2. • disclaimer: Opinions Are My Own
• feel free to interrupt me if you have any questions
• questions in English, Taiwanese, and Mandarin are fine
• note that i am gonna skip memory related code in the talk
because of time constraint. Memory management,
including locality and zero-copy, is always a crucial part of
high-performance computing
2
3. who i am
• Used open source before the term “open
source” is used
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Used to be a programming language junkie
• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
• Recently, on NN performance on edge devices
related stuff
• Contributed from time to time to TensorFlow
Lite
• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
4. Delegation
• Delegation: one of the commonly
used old mechanisms mentioned in
the GoF book
• presumably, you know this well
already
• in case no, delegate definitions
from dictionaries work
figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
5. So, what is a TFLite
delegate?
• “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another
executor.”
• Why delegates?
• running computation-intensive NN models on mobile devices is resource demanding for
mobile CPUs, processing power and energy consumption could be problems
• and matrix-multiplication which is there core of convolution and fully connected ops is
highly parallel
• Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better
performance and higher energy efficiency thru Android NNAPI
• To use NNAPI, TFLite has an NNAPI delegate
• Why I want to share what I know
• used TFLite, contributed some code, e.g., label_image for TFLite
• wrote quick-and-dirty TFLite GPU delegate benchmarks
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
6. What is TFLite
• An lightweight inference engine
• originally for Android and
similar platforms. Extended to
micro-controllers (e.g., ARM
Cortex-M series)
• Interpreter-based (what other
choices do they have?)
• ops are organized as a
directed acyclic graph (DAG)
• execute / interpret ops one bye
one if no delegates involved
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
7. TfLiteContext
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors
• TfLiteNode: a single node or
operation
• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485
ResizeTensor()
ReportError()
AddTensors()
GetNodeAndRegistration()
ReplaceNodeSubsetsWithDelegateKernels
GetExternalContext()
SetExternalContext()
…
tensors_size
tensors
impl_
recommended_num_threads
allow_fp32_relax_to_fp16
profiler
…
TfLiteContext
8. TfLiteNode
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors
• TfLiteNode: a single node or
operation
• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409
inputs
outputs
intermediates
temporaries
user_data
builtin_data
custom_initial_data
custom_initial_data_size
delegate
…
TfLiteNode
9. TfLiteRegistration
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors
• TfLiteNode: a single node or
operation
• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544
init()
free()
prepare()
invoke()
profilling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration
10. To know more
• Read [1][2] and create a custom op will help
understanding TfLiteRegistration, TfLiteNode, and
TfLiteContext deeper
[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
inference.md#write-a-custom-operator
[2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
ops_custom.md
11. TfLiteDelegate: the
interface
• In case you didn’t notices it
yet, TFLite is mainly written in
C++
• C API for FFI from other
high level languages
• I hacked a Smalltalk one
• many classes are structs and
no member functions so that it
could be used in C API easily
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
12. How TFLite delegates
work?
• Let's say we have a simple model graph such as the following:
• Let's assume that there is a delegate "MyDelegate," which has a faster
implementation for Conv2D and Mean operations. The resulting main graph
will be updated to look like below.
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
14. delegates in TFLite
• NNAPI delegate
• mainly for Android
• GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not
popular (yet)
• GL ES Compute shader on Android
• Metal shader on iOS
• FlexDelegate: eager mode to run some ops
• useful when not all ops are supported by TFLite or accelerators (thru something
like NNAPI or GPU delegate)
• not in TensorFlow repo: EdgeTPU delegate
15. NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards15
17. Android NN API
• Announced/published with Android 8.1
Preview 1
• Available to developer in NDK
• yes, NDK
• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices
• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caffe2, or others) that build and train
neural networks
• The API is available on all devices running
Android 8.1 (API level 27) or higher
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
17
18. So, what a delegate is
supposed to implement
• Understanding how to
add a delegate helps
• define a kernel node,
which means to
implement
TfLiteRegistration
• create an instance of
TfLiteDelegate, then
register the kernel node in
Prepare()
typedef struct TfLiteDelegate {
void* data_;
TfLiteStatus (*Prepare)(TfLiteContext* context,
struct TfLiteDelegate* delegate);
TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
void (*FreeBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle* handle);
int64_t flags;
} TfLiteDelegate;
typedef struct _TfLiteRegistration {
void* (*init)(TfLiteContext* context, const char* buffer, size_t
length);
void (*free)(TfLiteContext* context, void* buffer);
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
const char* (*profiling_string)(const TfLiteContext* context, const
TfLiteNode* node);
int32_t builtin_code;
const char* custom_name;
int version;
} TfLiteRegistration;
19. NNAPI delegate
• C++ code: instead of C style
one
• derived from TfLiteDelegate
• Some private data
structures
• extra member functions
corresponding to private
data structures
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/
nnapi_delegate.h#L29-L161
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
GetOptions()
RegisteNnMemory()
GetTensorMemoryMap()
…
data_
flags
acceleration_name
(options)
(memory_registration)
…
StateFullNnApiDelegate
20. data
• execution_preference
• power/perf tradeoff: not
widely supported as far as I
can tell
• accelerator_name: e.g.,
“fallback” and “hvx”
• cache_dir
• model_token
• tensor_memory_map:
MemoryRegistration
struct Data {
// Preferred Power/perf trade-off.
Options::ExecutionPreference execution_preference;
// Selected NNAPI accelerator name.
std::string accelerator_name;
// The cache dir for NNAPI model.
std::string cache_dir;
// The unique token string for NNAPI model.
std::string model_token;
// Tensor to ANeuralNetworksMemory mapping.
std::vector<MemoryRegistration> tensor_memory_map;
};
// Encapsulates all fields related to memory
registration for internal
// bookkeeping only.
struct MemoryRegistration {
ANeuralNetworksMemory* memory;
CopyToHostTensorFnPtr callback;
void* callback_context;
};
22. Init() of NNAPI Delegate
Kernel
• mainly for NNAPI initialization
ANeuralNetworksCompilation_*()
• and build graph
• if NNAPI >= 1.2, checking
there is “real” NNAPI device
• one interesting conversion is
INT8 -> UINT8
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
23. INT8 —> UINT8 conversion
• Original TFLite and NNAPI uses asymmetric UINT8 quantization
• asymmetric one provides more flexibilities, but usually symmetric INT8 is more
hardware friendly
• more and more INT8 code for TFLite
• NNAPI doesn’t change as fast as TFLite, so conversion is needed
• See the quantization paper for TFLite [1] and MLIR’s quantization doc [2]
[1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877
[2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
24. Invoke() of NNAPI Delegate
Kernel
• mainly memory management
and
ANeuralNetworksExecution*()
• To digger more we have to go
thru more TFLite and NNAPI
data structures
• asking NNAPI to work for you
is quite trivial when everything
is well-prepared
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
25. DoPrepare
• for NNAPI >=1.2 (Android Q and
later), if no real accelerators there,
i.e., only NNAPI CPU fallback is
there, computation is not
offloaded.
• Check for every node to see if it is
supported
• NN API Delegate Registration:
previous pages
• Request TFLite to partition the
graph and make kernels for each
independent node subset a new
nnapi_delegate_kernel
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
26. partition graph
• in the end of DoPrepare(),
ReplaceNodeSubsetsWithDele
gateKernels() is called
• DoPrepare() ->
Subgraph::ReplaceNodeSubs
etsWithDelegateKernels() ->
tflite::PartitionGraphIntoIndepe
ndentNodeSubsets() ->
tflite::Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/
subgraph.cc#L298-L363
27. tflite::Partition() did most
partition job
• part of Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
30. GPU Metal Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()
• init()
• no free()
• prepare() is quite simple
• invoke(): simply calls node-
>Invoke()
• context ->
ReplaceNodeSubsetsWithDele
gateKernels()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
31. GPU Metal Delegate
• TfLiteDelegate
• Prepare: yup, just Prepare()
• class Delegate, which is quite
large
• NewGpuDelege()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
32. GPU delegate kernels
• GPU backends require initialization
involving shader compilation and
optimization by the driver before
inference
• PHWC4: P stands for plane
• Reshape is expensive on GPU
• RGBA is better than RGB on GPU
• a tensor of shape [B,H,W,5], for
instance, is twice as expensive as [B, H,
W, 4], but about the same as [B, H, W,
8], then the architect can tune around
those 4-channel boundaries rather than
trying to optimize on other boundaries.
•
https://arxiv.org/pdf/1907.01989.pdf
33. Flex Delegate
• Another delegate is the
one that provides
selected set of ops in
Eager mode
• It’s much easier to check
what it does
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
34. Edge TPU’s canned model
• supported ops are packed into
single op for Edge TPU
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
34
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1
35. Edge TPU C++ API
https://coral.withgoogle.com/docs/edgetpu/api-intro/
36. EdgeTPU Delegate
• There is dynamic delegate plugin interface. Currently it’s
only used by EdgeTPU’s
https://coral.withgoogle.com/docs/edgetpu/api-intro/
37. There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
37
https://github.com/tensorflow/tensorflow/pull/28618