TFLite NNAPI and GPU Delegates

TFLite NNAPI and
GPU Delegates
Koan-Sin Tan

freedom@computer.org

Aug 18th, 2019

COSCUP 2019, Taipei, Taiwan

• disclaimer: Opinions Are My Own

• feel free to interrupt me if you have any questions

• questions in English, Taiwanese, and Mandarin are ﬁne

• note that i am gonna skip memory related code in the talk
because of time constraint. Memory management,
including locality and zero-copy, is always a crucial part of
high-performance computing
2

who i am
• Used open source before the term “open
source” is used

• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD

• Used to be a programming language junkie

• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components

• Recently, on NN performance on edge devices
related stuff

• Contributed from time to time to TensorFlow
Lite

• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3

Delegation
• Delegation: one of the commonly
used old mechanisms mentioned in
the GoF book

• presumably, you know this well
already

• in case no, delegate deﬁnitions
from dictionaries work

ﬁgure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4

So, what is a TFLite
delegate?
• “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another
executor.”

• Why delegates?

• running computation-intensive NN models on mobile devices is resource demanding for
mobile CPUs, processing power and energy consumption could be problems

• and matrix-multiplication which is there core of convolution and fully connected ops is
highly parallel

• Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better
performance and higher energy efficiency thru Android NNAPI

• To use NNAPI, TFLite has an NNAPI delegate

• Why I want to share what I know

• used TFLite, contributed some code, e.g., label_image for TFLite

• wrote quick-and-dirty TFLite GPU delegate benchmarks
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md

What is TFLite
• An lightweight inference engine

• originally for Android and
similar platforms. Extended to
micro-controllers (e.g., ARM
Cortex-M series)

• Interpreter-based (what other
choices do they have?)

• ops are organized as a
directed acyclic graph (DAG)

• execute / interpret ops one bye
one if no delegates involved
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798

TfLiteContext
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485
ResizeTensor()
ReportError()
AddTensors()
GetNodeAndRegistration()
ReplaceNodeSubsetsWithDelegateKernels
GetExternalContext()
SetExternalContext()
…
tensors_size
tensors
impl_
recommended_num_threads
allow_fp32_relax_to_fp16
profiler
…
TfLiteContext

TfLiteNode
tensors

operation

implementation of a
inputs
outputs
intermediates
temporaries
user_data
builtin_data
custom_initial_data
custom_initial_data_size
delegate
…
TfLiteNode

TfLiteRegistration
tensors

operation

implementation of a
init()
free()
prepare()
invoke()
proﬁlling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration

To know more
• Read [1][2] and create a custom op will help
understanding TfLiteRegistration, TfLiteNode, and
TfLiteContext deeper

[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
inference.md#write-a-custom-operator

[2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
ops_custom.md

TfLiteDelegate: the
interface
• In case you didn’t notices it
yet, TFLite is mainly written in
C++

• C API for FFI from other
high level languages

• I hacked a Smalltalk one

• many classes are structs and
no member functions so that it
could be used in C API easily
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
ﬂags
…
TfLiteDelegate

How TFLite delegates
work?
• Let's say we have a simple model graph such as the following:

• Let's assume that there is a delegate "MyDelegate," which has a faster
implementation for Conv2D and Mean operations. The resulting main graph
will be updated to look like below.
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md

1×224×224×3
1×1001
TfLiteNnapiDelegate
1 32×3×3×3
2 1×3×3×512
3 512×1×1×512
4 1×3×3×512
5 512×1×1×512
6 1×3×3×512
7 1024×1×1×512
8 1×3×3×1024
9 1024×1×1×1024
10 1×3×3×32
11 64×1×1×32
12 1×3×3×64
13 128×1×1×64
14 1×3×3×128
15 128×1×1×128
16 1×3×3×128
17 256×1×1×128
18 1×3×3×256
19 256×1×1×256
20 1×3×3×256
21 512×1×1×256
22 1×3×3×512
23 512×1×1×512
24 1×3×3×512
25 512×1×1×512
26 1×3×3×512
27 512×1×1×512
28 1001
29 1001×1×1×1024
30 2
31 32
32 512
33 512
34 512
35 512
36 512
37 1024
38 1024
39 1024
40 32
41 64
42 64
43 128
44 128
45 128
46 128
47 256
48 256
49 256
50 256
51 512
52 512
53 512
54 512
55 512
56 512
57 512
input
Reshape_1
What does a real model
look like?
• With the NNAPI delegate
rewrite backed from Nov,
2018, a subgraph delegated to
an “accelerator” is an op
(named Delegate) in TFLite
now

• subgraph

• all-or-nothing —> per op
1×224×224×3
1×112×112×32
1×112×112×32
1×112×112×64
1×56×56×64
1×56×56×128
1×56×56×128
1×56×56×128
1×28×28×128
1×28×28×256
1×28×28×256
1×28×28×256
1×14×14×256
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×7×7×512
1×7×7×1024
1×7×7×1024
1×7×7×1024
1×1×1×1024
1×1×1×1001
1×1001
1×1001
Conv2D
weights 32×3×3×3
bias 32
DepthwiseConv2D
weights 1×3×3×32
bias 32
Conv2D
weights 64×1×1×32
bias 64
DepthwiseConv2D
weights 1×3×3×64
bias 64
Conv2D
weights 128×1×1×64
bias 128
DepthwiseConv2D
bias 128
Conv2D
weights 128×1×1×128
bias 128
DepthwiseConv2D
bias 128
Conv2D
weights 256×1×1×128
bias 256
DepthwiseConv2D
bias 256
Conv2D
weights 256×1×1×256
bias 256
DepthwiseConv2D
bias 256
Conv2D
weights 512×1×1×256
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
bias 512
Conv2D
weights 1024×1×1×512
bias 1024
DepthwiseConv2D
weights 1×3×3×1024
bias 1024
Conv2D
weights 1024×1×1×1024
bias 1024
AveragePool2D
Conv2D
weights 1001×1×1×1024
bias 1001
Squeeze
Softmax
input
Reshape_1
http://localhost:8080/, http://localhost:8090/

delegates in TFLite
• NNAPI delegate

• mainly for Android

• GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not
popular (yet)

• GL ES Compute shader on Android

• Metal shader on iOS

• FlexDelegate: eager mode to run some ops

• useful when not all ops are supported by TFLite or accelerators (thru something
like NNAPI or GPU delegate)

• not in TensorFlow repo: EdgeTPU delegate

NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards15

16
GL ES compute shader capable devices ~ 50%
https://developer.android.com/about/dashboards

Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caﬀe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
17

So, what a delegate is
supposed to implement
• Understanding how to
add a delegate helps

• deﬁne a kernel node,
which means to
implement
TfLiteRegistration

• create an instance of
TfLiteDelegate, then
register the kernel node in
Prepare()
typedef struct TfLiteDelegate {
void* data_;
TfLiteStatus (*Prepare)(TfLiteContext* context,
struct TfLiteDelegate* delegate);
TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
void (*FreeBufferHandle)(TfLiteContext* context,
TfLiteBufferHandle* handle);
int64_t flags;
} TfLiteDelegate;
typedef struct _TfLiteRegistration {
void* (*init)(TfLiteContext* context, const char* buffer, size_t
length);
void (*free)(TfLiteContext* context, void* buffer);
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
const char* (*profiling_string)(const TfLiteContext* context, const
TfLiteNode* node);
int32_t builtin_code;
const char* custom_name;
int version;
} TfLiteRegistration;

NNAPI delegate
• C++ code: instead of C style
one

• derived from TfLiteDelegate

• Some private data
structures

• extra member functions
corresponding to private
data structures
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/
nnapi_delegate.h#L29-L161
Prepare()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
Prepare()
FreeBufferHandler()
GetOptions()
RegisteNnMemory()
GetTensorMemoryMap()
…
data_
flags
acceleration_name
(options)
(memory_registration)
…
StateFullNnApiDelegate

data
• execution_preference

• power/perf tradeoﬀ: not
widely supported as far as I
can tell

• accelerator_name: e.g.,
“fallback” and “hvx”

• cache_dir

• model_token

• tensor_memory_map:
MemoryRegistration
struct Data {
// Preferred Power/perf trade-off.
Options::ExecutionPreference execution_preference;
// Selected NNAPI accelerator name.
std::string accelerator_name;
// The cache dir for NNAPI model.
std::string cache_dir;
// The unique token string for NNAPI model.
std::string model_token;
// Tensor to ANeuralNetworksMemory mapping.
std::vector<MemoryRegistration> tensor_memory_map;
};
// Encapsulates all fields related to memory
registration for internal
// bookkeeping only.
struct MemoryRegistration {
ANeuralNetworksMemory* memory;
CopyToHostTensorFnPtr callback;
void* callback_context;
};

TfLiteRegistration for
nnapi_delegate_kernel
• init()

• free()

• prepare()

• invoke()

• no proﬁling_string()

• builtin_code = …

• custom_name
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607
init()
free()
prepare()
invoke()
proﬁlling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration

Init() of NNAPI Delegate
Kernel
• mainly for NNAPI initialization

ANeuralNetworksCompilation_*()
• and build graph

• if NNAPI >= 1.2, checking
there is “real” NNAPI device

• one interesting conversion is
INT8 -> UINT8

INT8 —> UINT8 conversion
• Original TFLite and NNAPI uses asymmetric UINT8 quantization

• asymmetric one provides more flexibilities, but usually symmetric INT8 is more
hardware friendly

• more and more INT8 code for TFLite

• NNAPI doesn’t change as fast as TFLite, so conversion is needed

• See the quantization paper for TFLite [1] and MLIR’s quantization doc [2]

[1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877

[2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md

Invoke() of NNAPI Delegate
Kernel
• mainly memory management
and

ANeuralNetworksExecution*()
• To digger more we have to go
thru more TFLite and NNAPI
data structures

• asking NNAPI to work for you
is quite trivial when everything
is well-prepared

DoPrepare
• for NNAPI >=1.2 (Android Q and
later), if no real accelerators there,
i.e., only NNAPI CPU fallback is
there, computation is not
offloaded.

• Check for every node to see if it is
supported

• NN API Delegate Registration:
previous pages

• Request TFLite to partition the
graph and make kernels for each
independent node subset a new
nnapi_delegate_kernel
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457

partition graph
• in the end of DoPrepare(),
ReplaceNodeSubsetsWithDele
gateKernels() is called

• DoPrepare() ->
Subgraph::ReplaceNodeSubs
etsWithDelegateKernels() ->
tflite::PartitionGraphIntoIndepe
ndentNodeSubsets() ->
tflite::Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/
subgraph.cc#L298-L363

tflite::Partition() did most
partition job
• part of Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118

GPU GL Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite simple

• invoke(): simply calls node-
>Invoke()

• context ->
gateKernels()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431

GPU GL Delegate
• TfLiteDelegate

• Prepare

• CopyFromBuﬀerHandle

• CopyToBuﬀerHandle

• class Delegate

• TFLiteGpuDelegateCreate()

GPU Metal Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite simple

• invoke(): simply calls node-
>Invoke()

• context ->
gateKernels()

GPU Metal Delegate
• TfLiteDelegate

• Prepare: yup, just Prepare()

• class Delegate, which is quite
large

• NewGpuDelege()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532

GPU delegate kernels
• GPU backends require initialization
involving shader compilation and
optimization by the driver before
inference

• PHWC4: P stands for plane

• Reshape is expensive on GPU

• RGBA is better than RGB on GPU

• a tensor of shape [B,H,W,5], for
instance, is twice as expensive as [B, H,
W, 4], but about the same as [B, H, W,
8], then the architect can tune around
those 4-channel boundaries rather than
trying to optimize on other boundaries.

•
https://arxiv.org/pdf/1907.01989.pdf

Flex Delegate
• Another delegate is the
one that provides
selected set of ops in
Eager mode

• It’s much easier to check
what it does
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573

Edge TPU’s canned model
• supported ops are packed into
single op for Edge TPU
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
34
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1

Edge TPU C++ API
https://coral.withgoogle.com/docs/edgetpu/api-intro/

EdgeTPU Delegate
• There is dynamic delegate plugin interface. Currently it’s
only used by EdgeTPU’s
https://coral.withgoogle.com/docs/edgetpu/api-intro/

There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
37
https://github.com/tensorflow/tensorflow/pull/28618

TFLite NNAPI and GPU Delegates

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à TFLite NNAPI and GPU Delegates

Similaire à TFLite NNAPI and GPU Delegates (20)

Plus de Koan-Sin Tan

Plus de Koan-Sin Tan (15)

Dernier

Dernier (20)

TFLite NNAPI and GPU Delegates