SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
TFLite NNAPI and
GPU Delegates
Koan-Sin Tan

freedom@computer.org

Aug 18th, 2019

COSCUP 2019, Taipei, Taiwan
• disclaimer: Opinions Are My Own

• feel free to interrupt me if you have any questions

• questions in English, Taiwanese, and Mandarin are fine

• note that i am gonna skip memory related code in the talk
because of time constraint. Memory management,
including locality and zero-copy, is always a crucial part of
high-performance computing
2
who i am
• Used open source before the term “open
source” is used

• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD

• Used to be a programming language junkie

• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components

• Recently, on NN performance on edge devices
related stuff

• Contributed from time to time to TensorFlow
Lite

• started a command line label_image for
TFLite
https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0
http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
Delegation
• Delegation: one of the commonly
used old mechanisms mentioned in
the GoF book

• presumably, you know this well
already

• in case no, delegate definitions
from dictionaries work

figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
So, what is a TFLite
delegate?
• “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another
executor.”

• Why delegates?

• running computation-intensive NN models on mobile devices is resource demanding for
mobile CPUs, processing power and energy consumption could be problems

• and matrix-multiplication which is there core of convolution and fully connected ops is
highly parallel

• Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better
performance and higher energy efficiency thru Android NNAPI

• To use NNAPI, TFLite has an NNAPI delegate

• Why I want to share what I know

• used TFLite, contributed some code, e.g., label_image for TFLite

• wrote quick-and-dirty TFLite GPU delegate benchmarks
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
What is TFLite
• An lightweight inference engine

• originally for Android and
similar platforms. Extended to
micro-controllers (e.g., ARM
Cortex-M series)

• Interpreter-based (what other
choices do they have?)

• ops are organized as a
directed acyclic graph (DAG)

• execute / interpret ops one bye
one if no delegates involved
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
TfLiteContext
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485
ResizeTensor()
ReportError()
AddTensors()
GetNodeAndRegistration()
ReplaceNodeSubsetsWithDelegateKernels
GetExternalContext()
SetExternalContext()
…
tensors_size
tensors
impl_
recommended_num_threads
allow_fp32_relax_to_fp16
profiler
…
TfLiteContext
TfLiteNode
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409
inputs
outputs
intermediates
temporaries
user_data
builtin_data
custom_initial_data
custom_initial_data_size
delegate
…
TfLiteNode
TfLiteRegistration
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a single node or
operation

• TfLiteRegistration: the
implementation of a
conceptual operation
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544
init()
free()
prepare()
invoke()
profilling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration
To know more
• Read [1][2] and create a custom op will help
understanding TfLiteRegistration, TfLiteNode, and
TfLiteContext deeper

[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
inference.md#write-a-custom-operator

[2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/
ops_custom.md
TfLiteDelegate: the
interface
• In case you didn’t notices it
yet, TFLite is mainly written in
C++

• C API for FFI from other
high level languages

• I hacked a Smalltalk one

• many classes are structs and
no member functions so that it
could be used in C API easily
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
How TFLite delegates
work?
• Let's say we have a simple model graph such as the following:

• Let's assume that there is a delegate "MyDelegate," which has a faster
implementation for Conv2D and Mean operations. The resulting main graph
will be updated to look like below.
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
1×224×224×3
1×1001
TfLiteNnapiDelegate
1 32×3×3×3
2 1×3×3×512
3 512×1×1×512
4 1×3×3×512
5 512×1×1×512
6 1×3×3×512
7 1024×1×1×512
8 1×3×3×1024
9 1024×1×1×1024
10 1×3×3×32
11 64×1×1×32
12 1×3×3×64
13 128×1×1×64
14 1×3×3×128
15 128×1×1×128
16 1×3×3×128
17 256×1×1×128
18 1×3×3×256
19 256×1×1×256
20 1×3×3×256
21 512×1×1×256
22 1×3×3×512
23 512×1×1×512
24 1×3×3×512
25 512×1×1×512
26 1×3×3×512
27 512×1×1×512
28 1001
29 1001×1×1×1024
30 2
31 32
32 512
33 512
34 512
35 512
36 512
37 1024
38 1024
39 1024
40 32
41 64
42 64
43 128
44 128
45 128
46 128
47 256
48 256
49 256
50 256
51 512
52 512
53 512
54 512
55 512
56 512
57 512
input
Reshape_1
What does a real model
look like?
• With the NNAPI delegate
rewrite backed from Nov,
2018, a subgraph delegated to
an “accelerator” is an op
(named Delegate) in TFLite
now

• subgraph

• all-or-nothing —> per op
1×224×224×3
1×112×112×32
1×112×112×32
1×112×112×64
1×56×56×64
1×56×56×128
1×56×56×128
1×56×56×128
1×28×28×128
1×28×28×256
1×28×28×256
1×28×28×256
1×14×14×256
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×14×14×512
1×7×7×512
1×7×7×1024
1×7×7×1024
1×7×7×1024
1×1×1×1024
1×1×1×1001
1×1001
1×1001
Conv2D
weights 32×3×3×3
bias 32
DepthwiseConv2D
weights 1×3×3×32
bias 32
Conv2D
weights 64×1×1×32
bias 64
DepthwiseConv2D
weights 1×3×3×64
bias 64
Conv2D
weights 128×1×1×64
bias 128
DepthwiseConv2D
weights 1×3×3×128
bias 128
Conv2D
weights 128×1×1×128
bias 128
DepthwiseConv2D
weights 1×3×3×128
bias 128
Conv2D
weights 256×1×1×128
bias 256
DepthwiseConv2D
weights 1×3×3×256
bias 256
Conv2D
weights 256×1×1×256
bias 256
DepthwiseConv2D
weights 1×3×3×256
bias 256
Conv2D
weights 512×1×1×256
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 512×1×1×512
bias 512
DepthwiseConv2D
weights 1×3×3×512
bias 512
Conv2D
weights 1024×1×1×512
bias 1024
DepthwiseConv2D
weights 1×3×3×1024
bias 1024
Conv2D
weights 1024×1×1×1024
bias 1024
AveragePool2D
Conv2D
weights 1001×1×1×1024
bias 1001
Squeeze
Softmax
input
Reshape_1
http://localhost:8080/, http://localhost:8090/
delegates in TFLite
• NNAPI delegate

• mainly for Android

• GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not
popular (yet)

• GL ES Compute shader on Android

• Metal shader on iOS

• FlexDelegate: eager mode to run some ops

• useful when not all ops are supported by TFLite or accelerators (thru something
like NNAPI or GPU delegate)

• not in TensorFlow repo: EdgeTPU delegate
NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards15
16
GL ES compute shader capable devices ~ 50%
https://developer.android.com/about/dashboards
Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Android Neural Networks API (NNAPI)
is an Android C API designed for running
computationally intensive operations for
machine learning on mobile devices

• NNAPI is designed to provide a base layer
of functionality for higher-level machine
learning frameworks (such as TensorFlow
Lite, Caffe2, or others) that build and train
neural networks

• The API is available on all devices running
Android 8.1 (API level 27) or higher
https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png
17
So, what a delegate is
supposed to implement
• Understanding how to
add a delegate helps

• define a kernel node,
which means to
implement
TfLiteRegistration

• create an instance of
TfLiteDelegate, then
register the kernel node in
Prepare()
typedef struct TfLiteDelegate {
void* data_;
TfLiteStatus (*Prepare)(TfLiteContext* context,
struct TfLiteDelegate* delegate);
TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle buffer_handle,
TfLiteTensor* tensor);
void (*FreeBufferHandle)(TfLiteContext* context,
struct TfLiteDelegate* delegate,
TfLiteBufferHandle* handle);
int64_t flags;
} TfLiteDelegate;
typedef struct _TfLiteRegistration {
void* (*init)(TfLiteContext* context, const char* buffer, size_t
length);
void (*free)(TfLiteContext* context, void* buffer);
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
const char* (*profiling_string)(const TfLiteContext* context, const
TfLiteNode* node);
int32_t builtin_code;
const char* custom_name;
int version;
} TfLiteRegistration;
NNAPI delegate
• C++ code: instead of C style
one

• derived from TfLiteDelegate

• Some private data
structures

• extra member functions
corresponding to private
data structures
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/
nnapi_delegate.h#L29-L161
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
…
data_
flags
…
TfLiteDelegate
Prepare()
CopyFromBufferHandle()
CopyToBufferHandle()
FreeBufferHandler()
GetOptions()
RegisteNnMemory()
GetTensorMemoryMap()
…
data_
flags
acceleration_name
(options)
(memory_registration)
…
StateFullNnApiDelegate
data
• execution_preference

• power/perf tradeoff: not
widely supported as far as I
can tell

• accelerator_name: e.g.,
“fallback” and “hvx”

• cache_dir

• model_token

• tensor_memory_map:
MemoryRegistration
struct Data {
// Preferred Power/perf trade-off.
Options::ExecutionPreference execution_preference;
// Selected NNAPI accelerator name.
std::string accelerator_name;
// The cache dir for NNAPI model.
std::string cache_dir;
// The unique token string for NNAPI model.
std::string model_token;
// Tensor to ANeuralNetworksMemory mapping.
std::vector<MemoryRegistration> tensor_memory_map;
};
// Encapsulates all fields related to memory
registration for internal
// bookkeeping only.
struct MemoryRegistration {
ANeuralNetworksMemory* memory;
CopyToHostTensorFnPtr callback;
void* callback_context;
};
TfLiteRegistration for
nnapi_delegate_kernel
• init()

• free()

• prepare()

• invoke()

• no profiling_string()

• builtin_code = …

• custom_name
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607
init()
free()
prepare()
invoke()
profilling_string()
…
builtin_code
custom_name
version
…
TfLiteRegistration
Init() of NNAPI Delegate
Kernel
• mainly for NNAPI initialization

ANeuralNetworksCompilation_*()
• and build graph

• if NNAPI >= 1.2, checking
there is “real” NNAPI device

• one interesting conversion is
INT8 -> UINT8
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
INT8 —> UINT8 conversion
• Original TFLite and NNAPI uses asymmetric UINT8 quantization

• asymmetric one provides more flexibilities, but usually symmetric INT8 is more
hardware friendly

• more and more INT8 code for TFLite

• NNAPI doesn’t change as fast as TFLite, so conversion is needed

• See the quantization paper for TFLite [1] and MLIR’s quantization doc [2]

[1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877

[2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
Invoke() of NNAPI Delegate
Kernel
• mainly memory management
and 

ANeuralNetworksExecution*()
• To digger more we have to go
thru more TFLite and NNAPI
data structures

• asking NNAPI to work for you
is quite trivial when everything
is well-prepared
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
DoPrepare
• for NNAPI >=1.2 (Android Q and
later), if no real accelerators there,
i.e., only NNAPI CPU fallback is
there, computation is not
offloaded.

• Check for every node to see if it is
supported

• NN API Delegate Registration:
previous pages

• Request TFLite to partition the
graph and make kernels for each
independent node subset a new
nnapi_delegate_kernel
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
partition graph
• in the end of DoPrepare(),
ReplaceNodeSubsetsWithDele
gateKernels() is called

• DoPrepare() ->
Subgraph::ReplaceNodeSubs
etsWithDelegateKernels() ->
tflite::PartitionGraphIntoIndepe
ndentNodeSubsets() ->
tflite::Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/
subgraph.cc#L298-L363
tflite::Partition() did most
partition job
• part of Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
GPU GL Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite simple

• invoke(): simply calls node-
>Invoke()

• context ->
ReplaceNodeSubsetsWithDele
gateKernels()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
GPU GL Delegate
• TfLiteDelegate

• Prepare

• CopyFromBufferHandle

• CopyToBufferHandle

• class Delegate

• TFLiteGpuDelegateCreate()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L464-L470
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
GPU Metal Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite simple

• invoke(): simply calls node-
>Invoke()

• context ->
ReplaceNodeSubsetsWithDele
gateKernels()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
GPU Metal Delegate
• TfLiteDelegate

• Prepare: yup, just Prepare()

• class Delegate, which is quite
large

• NewGpuDelege()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
GPU delegate kernels
• GPU backends require initialization
involving shader compilation and
optimization by the driver before
inference

• PHWC4: P stands for plane

• Reshape is expensive on GPU

• RGBA is better than RGB on GPU

• a tensor of shape [B,H,W,5], for
instance, is twice as expensive as [B, H,
W, 4], but about the same as [B, H, W,
8], then the architect can tune around
those 4-channel boundaries rather than
trying to optimize on other boundaries. 

•
https://arxiv.org/pdf/1907.01989.pdf
Flex Delegate
• Another delegate is the
one that provides
selected set of ops in
Eager mode

• It’s much easier to check
what it does
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
Edge TPU’s canned model
• supported ops are packed into
single op for Edge TPU
The compiler creates a single custom op for all Edge TPU
compatible ops; anything else stays the same
https://coral.withgoogle.com/docs/edgetpu/models-intro/
34
MobileNet V1 1×224×224×3
1×1001
edgetpu-custom-op
input
Softmax
1×300×300×3
1×1917×91
1×10×4 1×10 1×10 1
edgetpu-custom-op
TFLite_Detection_PostProcess
3 1917×4
normalized_input_image_tensor
TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3
SSD MobileNet V1
Edge TPU C++ API
https://coral.withgoogle.com/docs/edgetpu/api-intro/
EdgeTPU Delegate
• There is dynamic delegate plugin interface. Currently it’s
only used by EdgeTPU’s
https://coral.withgoogle.com/docs/edgetpu/api-intro/
There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are not well-documented
• There are many many warnings when building TensorFlow from source
code
• a trivial fix in May, 2019 by me
37
https://github.com/tensorflow/tensorflow/pull/28618

Contenu connexe

Tendances

Boost UDP Transaction Performance
Boost UDP Transaction PerformanceBoost UDP Transaction Performance
Boost UDP Transaction PerformanceLF Events
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Koan-Sin Tan
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and ToolsBrendan Gregg
 
Onieで遊んでみようとした話
Onieで遊んでみようとした話Onieで遊んでみようとした話
Onieで遊んでみようとした話Masaru Oki
 
Using GTP on Linux with libgtpnl
Using GTP on Linux with libgtpnlUsing GTP on Linux with libgtpnl
Using GTP on Linux with libgtpnlKentaro Ebisawa
 
LinuxのFull ticklessを試してみた
LinuxのFull ticklessを試してみたLinuxのFull ticklessを試してみた
LinuxのFull ticklessを試してみたHiraku Toyooka
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3Linaro
 

Tendances (20)

Boost UDP Transaction Performance
Boost UDP Transaction PerformanceBoost UDP Transaction Performance
Boost UDP Transaction Performance
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
 
GPU
GPUGPU
GPU
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020
 
Embedded Android : System Development - Part III (Audio / Video HAL)
Embedded Android : System Development - Part III (Audio / Video HAL)Embedded Android : System Development - Part III (Audio / Video HAL)
Embedded Android : System Development - Part III (Audio / Video HAL)
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
Onieで遊んでみようとした話
Onieで遊んでみようとした話Onieで遊んでみようとした話
Onieで遊んでみようとした話
 
Using GTP on Linux with libgtpnl
Using GTP on Linux with libgtpnlUsing GTP on Linux with libgtpnl
Using GTP on Linux with libgtpnl
 
Embedded Linux on ARM
Embedded Linux on ARMEmbedded Linux on ARM
Embedded Linux on ARM
 
Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)
 
LinuxのFull ticklessを試してみた
LinuxのFull ticklessを試してみたLinuxのFull ticklessを試してみた
LinuxのFull ticklessを試してみた
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
 

Similaire à TFLite NNAPI and GPU Delegates

Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?GetInData
 
LAS16-200: Firmware summit - Tianocore Progress and Status
LAS16-200:  Firmware summit - Tianocore Progress and StatusLAS16-200:  Firmware summit - Tianocore Progress and Status
LAS16-200: Firmware summit - Tianocore Progress and StatusLinaro
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)Oracle Developers
 
Some wonderful Linux softwares for daily use
Some wonderful Linux softwares for daily useSome wonderful Linux softwares for daily use
Some wonderful Linux softwares for daily usearun.arwachin
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkNavid Kalaei
 
TEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityTEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityLinaro
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCDamienCarpy
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLinaro
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftYaniv cohen
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleJérôme Petazzoni
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudBrendan Gregg
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryKobe Yu
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunesBringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunesDroidConTLV
 
Bringing TensorFlow to Android - a War Story
Bringing TensorFlow to Android - a War StoryBringing TensorFlow to Android - a War Story
Bringing TensorFlow to Android - a War StoryYoni Tsafir
 

Similaire à TFLite NNAPI and GPU Delegates (20)

Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?
 
LAS16-200: Firmware summit - Tianocore Progress and Status
LAS16-200:  Firmware summit - Tianocore Progress and StatusLAS16-200:  Firmware summit - Tianocore Progress and Status
LAS16-200: Firmware summit - Tianocore Progress and Status
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)
 
Some wonderful Linux softwares for daily use
Some wonderful Linux softwares for daily useSome wonderful Linux softwares for daily use
Some wonderful Linux softwares for daily use
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
 
TEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityTEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source security
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShift
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
Edge and ai
Edge and aiEdge and ai
Edge and ai
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute Library
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunesBringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
Bringing TensorFlow to Android: a war story - Yoni Tsafir, JoyTunes
 
Bringing TensorFlow to Android - a War Story
Bringing TensorFlow to Android - a War StoryBringing TensorFlow to Android - a War Story
Bringing TensorFlow to Android - a War Story
 

Plus de Koan-Sin Tan

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan
 
Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsKoan-Sin Tan
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolKoan-Sin Tan
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowKoan-Sin Tan
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPUKoan-Sin Tan
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Koan-Sin Tan
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphonesKoan-Sin Tan
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchKoan-Sin Tan
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android BenchmarksKoan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08Koan-Sin Tan
 

Plus de Koan-Sin Tan (15)

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source Tools
 
A Peek into TFRT
A Peek into TFRTA Peek into TFRT
A Peek into TFRT
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlow
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Caffe2 on Android
Caffe2 on AndroidCaffe2 on Android
Caffe2 on Android
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Dernier

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 

Dernier (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 

TFLite NNAPI and GPU Delegates

  • 1. TFLite NNAPI and GPU Delegates Koan-Sin Tan freedom@computer.org Aug 18th, 2019 COSCUP 2019, Taipei, Taiwan
  • 2. • disclaimer: Opinions Are My Own • feel free to interrupt me if you have any questions • questions in English, Taiwanese, and Mandarin are fine • note that i am gonna skip memory related code in the talk because of time constraint. Memory management, including locality and zero-copy, is always a crucial part of high-performance computing 2
  • 3. who i am • Used open source before the term “open source” is used • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Used to be a programming language junkie • Worked on various system software, e.g., CPU scheduling and power management of non- CPU components • Recently, on NN performance on edge devices related stuff • Contributed from time to time to TensorFlow Lite • started a command line label_image for TFLite https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0 http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 3
  • 4. Delegation • Delegation: one of the commonly used old mechanisms mentioned in the GoF book • presumably, you know this well already • in case no, delegate definitions from dictionaries work figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
  • 5. So, what is a TFLite delegate? • “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.” • Why delegates? • running computation-intensive NN models on mobile devices is resource demanding for mobile CPUs, processing power and energy consumption could be problems • and matrix-multiplication which is there core of convolution and fully connected ops is highly parallel • Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better performance and higher energy efficiency thru Android NNAPI • To use NNAPI, TFLite has an NNAPI delegate • Why I want to share what I know • used TFLite, contributed some code, e.g., label_image for TFLite • wrote quick-and-dirty TFLite GPU delegate benchmarks https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  • 6. What is TFLite • An lightweight inference engine • originally for Android and similar platforms. Extended to micro-controllers (e.g., ARM Cortex-M series) • Interpreter-based (what other choices do they have?) • ops are organized as a directed acyclic graph (DAG) • execute / interpret ops one bye one if no delegates involved https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
  • 7. TfLiteContext • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485 ResizeTensor() ReportError() AddTensors() GetNodeAndRegistration() ReplaceNodeSubsetsWithDelegateKernels GetExternalContext() SetExternalContext() … tensors_size tensors impl_ recommended_num_threads allow_fp32_relax_to_fp16 profiler … TfLiteContext
  • 8. TfLiteNode • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409 inputs outputs intermediates temporaries user_data builtin_data custom_initial_data custom_initial_data_size delegate … TfLiteNode
  • 9. TfLiteRegistration • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  • 10. To know more • Read [1][2] and create a custom op will help understanding TfLiteRegistration, TfLiteNode, and TfLiteContext deeper [1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ inference.md#write-a-custom-operator [2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ ops_custom.md
  • 11. TfLiteDelegate: the interface • In case you didn’t notices it yet, TFLite is mainly written in C++ • C API for FFI from other high level languages • I hacked a Smalltalk one • many classes are structs and no member functions so that it could be used in C API easily https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate
  • 12. How TFLite delegates work? • Let's say we have a simple model graph such as the following: • Let's assume that there is a delegate "MyDelegate," which has a faster implementation for Conv2D and Mean operations. The resulting main graph will be updated to look like below. https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  • 13. 1×224×224×3 1×1001 TfLiteNnapiDelegate 1 32×3×3×3 2 1×3×3×512 3 512×1×1×512 4 1×3×3×512 5 512×1×1×512 6 1×3×3×512 7 1024×1×1×512 8 1×3×3×1024 9 1024×1×1×1024 10 1×3×3×32 11 64×1×1×32 12 1×3×3×64 13 128×1×1×64 14 1×3×3×128 15 128×1×1×128 16 1×3×3×128 17 256×1×1×128 18 1×3×3×256 19 256×1×1×256 20 1×3×3×256 21 512×1×1×256 22 1×3×3×512 23 512×1×1×512 24 1×3×3×512 25 512×1×1×512 26 1×3×3×512 27 512×1×1×512 28 1001 29 1001×1×1×1024 30 2 31 32 32 512 33 512 34 512 35 512 36 512 37 1024 38 1024 39 1024 40 32 41 64 42 64 43 128 44 128 45 128 46 128 47 256 48 256 49 256 50 256 51 512 52 512 53 512 54 512 55 512 56 512 57 512 input Reshape_1 What does a real model look like? • With the NNAPI delegate rewrite backed from Nov, 2018, a subgraph delegated to an “accelerator” is an op (named Delegate) in TFLite now • subgraph • all-or-nothing —> per op 1×224×224×3 1×112×112×32 1×112×112×32 1×112×112×64 1×56×56×64 1×56×56×128 1×56×56×128 1×56×56×128 1×28×28×128 1×28×28×256 1×28×28×256 1×28×28×256 1×14×14×256 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×7×7×512 1×7×7×1024 1×7×7×1024 1×7×7×1024 1×1×1×1024 1×1×1×1001 1×1001 1×1001 Conv2D weights 32×3×3×3 bias 32 DepthwiseConv2D weights 1×3×3×32 bias 32 Conv2D weights 64×1×1×32 bias 64 DepthwiseConv2D weights 1×3×3×64 bias 64 Conv2D weights 128×1×1×64 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 128×1×1×128 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 256×1×1×128 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 256×1×1×256 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 512×1×1×256 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 1024×1×1×512 bias 1024 DepthwiseConv2D weights 1×3×3×1024 bias 1024 Conv2D weights 1024×1×1×1024 bias 1024 AveragePool2D Conv2D weights 1001×1×1×1024 bias 1001 Squeeze Softmax input Reshape_1 http://localhost:8080/, http://localhost:8090/
  • 14. delegates in TFLite • NNAPI delegate • mainly for Android • GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not popular (yet) • GL ES Compute shader on Android • Metal shader on iOS • FlexDelegate: eager mode to run some ops • useful when not all ops are supported by TFLite or accelerators (thru something like NNAPI or GPU delegate) • not in TensorFlow repo: EdgeTPU delegate
  • 15. NNAPI-enabled devices ~ 25.8% around May 7, 2019 https://developer.android.com/about/dashboards15
  • 16. 16 GL ES compute shader capable devices ~ 50% https://developer.android.com/about/dashboards
  • 17. Android NN API • Announced/published with Android 8.1 Preview 1 • Available to developer in NDK • yes, NDK • The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on mobile devices • NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks (such as TensorFlow Lite, Caffe2, or others) that build and train neural networks • The API is available on all devices running Android 8.1 (API level 27) or higher https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png 17
  • 18. So, what a delegate is supposed to implement • Understanding how to add a delegate helps • define a kernel node, which means to implement TfLiteRegistration • create an instance of TfLiteDelegate, then register the kernel node in Prepare() typedef struct TfLiteDelegate { void* data_; TfLiteStatus (*Prepare)(TfLiteContext* context, struct TfLiteDelegate* delegate); TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); void (*FreeBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle* handle); int64_t flags; } TfLiteDelegate; typedef struct _TfLiteRegistration { void* (*init)(TfLiteContext* context, const char* buffer, size_t length); void (*free)(TfLiteContext* context, void* buffer); TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node); TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node); const char* (*profiling_string)(const TfLiteContext* context, const TfLiteNode* node); int32_t builtin_code; const char* custom_name; int version; } TfLiteRegistration;
  • 19. NNAPI delegate • C++ code: instead of C style one • derived from TfLiteDelegate • Some private data structures • extra member functions corresponding to private data structures https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/ nnapi_delegate.h#L29-L161 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() GetOptions() RegisteNnMemory() GetTensorMemoryMap() … data_ flags acceleration_name (options) (memory_registration) … StateFullNnApiDelegate
  • 20. data • execution_preference • power/perf tradeoff: not widely supported as far as I can tell • accelerator_name: e.g., “fallback” and “hvx” • cache_dir • model_token • tensor_memory_map: MemoryRegistration struct Data { // Preferred Power/perf trade-off. Options::ExecutionPreference execution_preference; // Selected NNAPI accelerator name. std::string accelerator_name; // The cache dir for NNAPI model. std::string cache_dir; // The unique token string for NNAPI model. std::string model_token; // Tensor to ANeuralNetworksMemory mapping. std::vector<MemoryRegistration> tensor_memory_map; }; // Encapsulates all fields related to memory registration for internal // bookkeeping only. struct MemoryRegistration { ANeuralNetworksMemory* memory; CopyToHostTensorFnPtr callback; void* callback_context; };
  • 21. TfLiteRegistration for nnapi_delegate_kernel • init() • free() • prepare() • invoke() • no profiling_string() • builtin_code = … • custom_name https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  • 22. Init() of NNAPI Delegate Kernel • mainly for NNAPI initialization ANeuralNetworksCompilation_*() • and build graph • if NNAPI >= 1.2, checking there is “real” NNAPI device • one interesting conversion is INT8 -> UINT8 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
  • 23. INT8 —> UINT8 conversion • Original TFLite and NNAPI uses asymmetric UINT8 quantization • asymmetric one provides more flexibilities, but usually symmetric INT8 is more hardware friendly • more and more INT8 code for TFLite • NNAPI doesn’t change as fast as TFLite, so conversion is needed • See the quantization paper for TFLite [1] and MLIR’s quantization doc [2] [1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer- Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877 [2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
  • 24. Invoke() of NNAPI Delegate Kernel • mainly memory management and ANeuralNetworksExecution*() • To digger more we have to go thru more TFLite and NNAPI data structures • asking NNAPI to work for you is quite trivial when everything is well-prepared https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
  • 25. DoPrepare • for NNAPI >=1.2 (Android Q and later), if no real accelerators there, i.e., only NNAPI CPU fallback is there, computation is not offloaded. • Check for every node to see if it is supported • NN API Delegate Registration: previous pages • Request TFLite to partition the graph and make kernels for each independent node subset a new nnapi_delegate_kernel https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
  • 26. partition graph • in the end of DoPrepare(), ReplaceNodeSubsetsWithDele gateKernels() is called • DoPrepare() -> Subgraph::ReplaceNodeSubs etsWithDelegateKernels() -> tflite::PartitionGraphIntoIndepe ndentNodeSubsets() -> tflite::Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/ subgraph.cc#L298-L363
  • 27. tflite::Partition() did most partition job • part of Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
  • 28. GPU GL Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  • 29. GPU GL Delegate • TfLiteDelegate • Prepare • CopyFromBufferHandle • CopyToBufferHandle • class Delegate • TFLiteGpuDelegateCreate() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L464-L470 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
  • 30. GPU Metal Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  • 31. GPU Metal Delegate • TfLiteDelegate • Prepare: yup, just Prepare() • class Delegate, which is quite large • NewGpuDelege() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
  • 32. GPU delegate kernels • GPU backends require initialization involving shader compilation and optimization by the driver before inference • PHWC4: P stands for plane • Reshape is expensive on GPU • RGBA is better than RGB on GPU • a tensor of shape [B,H,W,5], for instance, is twice as expensive as [B, H, W, 4], but about the same as [B, H, W, 8], then the architect can tune around those 4-channel boundaries rather than trying to optimize on other boundaries. • https://arxiv.org/pdf/1907.01989.pdf
  • 33. Flex Delegate • Another delegate is the one that provides selected set of ops in Eager mode • It’s much easier to check what it does https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
  • 34. Edge TPU’s canned model • supported ops are packed into single op for Edge TPU The compiler creates a single custom op for all Edge TPU compatible ops; anything else stays the same https://coral.withgoogle.com/docs/edgetpu/models-intro/ 34 MobileNet V1 1×224×224×3 1×1001 edgetpu-custom-op input Softmax 1×300×300×3 1×1917×91 1×10×4 1×10 1×10 1 edgetpu-custom-op TFLite_Detection_PostProcess 3 1917×4 normalized_input_image_tensor TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3 SSD MobileNet V1
  • 35. Edge TPU C++ API https://coral.withgoogle.com/docs/edgetpu/api-intro/
  • 36. EdgeTPU Delegate • There is dynamic delegate plugin interface. Currently it’s only used by EdgeTPU’s https://coral.withgoogle.com/docs/edgetpu/api-intro/
  • 37. There still are many trivial bugs in TensorFlow • There are many typos in comments of TensorFlow code • Many things are not well-documented • There are many many warnings when building TensorFlow from source code • a trivial fix in May, 2019 by me 37 https://github.com/tensorflow/tensorflow/pull/28618