SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Koan-Sin Tan,
freedom@computer.org
COSCUP, Aug 2nd, 2020
TensorFlow Runtime
A Peek into the Future of TensorFlow
1
• disclaimer: opinions are my own

• feel free to interrupt me if you have any questions during the presentation

• questions could be Taiwanese, English, or Mandarin

• most of TFRT materials are adapted from TFRT deep dive in MLIR design meeting [1] and TFRT docs [2]

• code around Aug 1, 2020 (git commit ecf1c20 [3])

[1] TFRT Deep Dive,  slides - recording, https://mlir.llvm.org/talks/

[2] https://github.com/tensorflow/runtime/tree/master/documents

[3] https://github.com/tensorflow/runtime/commit/ecf1c20
2
• Used open source before the term “open
source” is used
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Used to be a programming language junkie
• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
• Recently, on NN performance on edge devices
related stuff
• Contributed from time to time to TensorFlow Lite
• started a command line label_image for TFLite
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
What is TFRT
• TensorFlow Runtime (TFRT) is one of the two new MLIR runtimes emerged in 2020 so far. 

• The other one is Intermediate Representation Execution Environment, IREE. It seems
so far tfrt has better design documentation

• Both of them have mobile / edge environment in mind. 

• I didn’t see mobile accelerated code in TFRT yet. 

• IREE has some Vulkan related code and some simple code works on Android already

• ResNet GPU inference is 28% faster with TFRT

• https://github.com/tensorflow/runtime, https://youtu.be/15tiQoPpuZ8
4
Build it
• if you follow the instructions described in README.md, it should just work. At least on x86_64 linux.

• however, it’s not tested for non Linux environment yet

• ssize_t and int64_t

• on Mac OS X: ssize_t: long, int64_t: long long
• current code mixed the use of ssize_t and int64_t

• test: one the acclaimed features of TFRT, like MLIR, is its use of 

LLVM FileCheck

• my hacks, shape related (ssize_t) tests not fixed yet

• it’s not tested on non-x86 platforms, such as aarch64, either 

•
5
• The three key directories under the TFRT root directory are

• lib: Contains core TFRT infrastructure code

• backends: Contains device specific infrastructure and op/kernel implementations

• include: Contains public header files for core TFRT infrastructure
6
Walking thru the tutorial
• unfortunately, it seems it’s not easy to jump directly into source code without having
some background knowledge

• so we’ll walk thru the tutorial [1]

• What are in the tutorial

• print hello world

• print integer

• adding kernels

[1] https://github.com/tensorflow/runtime/blob/master/documents/tutorial.md
7
using tfrt and tfrt_test
hello.mlir
func @hello() {
%chain = tfrt.new.chain
// Create a string containing "hello world" and store it in %hello.
%hello = "tfrt_test.get_string"() { string_attr = "hello world" } : () -> !tfrt.string
// Print the string in %hello.
"tfrt_test.print_string"(%hello, %chain) : (!tfrt.string, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
The ‘@hello function above shows how to create and print a string. The text after each ‘:’ specifies the types involved:

• ()->!tfrt.string means that tfrt_test.get_string takes no arguments and returns a !tfrt.string. tfrt is a
MLIR dialect prefix (or namespace) for TFRT

• (!tfrt.string, !tfrt.chain) -> !tfrt.chain means that tfrt_test.print_string takes two arguments (!
tfrt.string and !tfrt.chain) and returns a !tfrt.chain. chain [1] is a TFRT abstraction to manage dependencies

[1] https://github.com/tensorflow/runtime/blob/master/documents/explicit_dependency.md
8
hello world in MLIR
func @stringconstant() -> !llvm<"[12 x i8]"> {
%1 = llvm.constant("Hello world!") : !llvm<"i8*">
// CHECK: ret [12 x i8] c"Hello world!"
llvm.return %1 : !llvm<"i8*">
}
func @main() {
%0 = llvm.constant(0) : !llvm.i64
%1 = call @stringconstant() : () -> !llvm<"[12 x i8]">
%2 = llvm.getelementptr %1[%0] : (!llvm<"[12 x i8]">, !llvm.i64) -> !llvm<"i8*">
%3 = llvm.bitcast %2 : !llvm<"i8*"> to !llvm<"i8*">
%32 = llvm.call @puts(%2) : (!llvm<"i8*">) -> !llvm.i32
return
}
func @puts(!llvm<"i8*">) -> !llvm.i32
• MLIR “standard dialect” doesn’t have I/O functions 

• there is LLVM dialect, of course we can use LLVM to call standard libc
function
9
Hello integer
func @hello_integers() {
%chain = tfrt.new.chain
// Create an integer containing 42.
%forty_two = tfrt.constant.i32 42
// Print 42.
tfrt.print.i32 %forty_two, %chain
tfrt.return
}
• as stated in the tutorial, we can run other functions in the same modular

• we can turn to more basic ones, such as integers or floating point numbers

• @hello_integers shows how to create and print integers

• This example does not have the verbose type information we saw in @hello because there are
custom parsers for the tfrt.constant.i32 and tfrt.print.32 kernels in
basic_kernels.td
10
basic_kernels.td
• .td (table description?) files are for LLVM TableGen

[1] TableGen, https://llvm.org/docs/TableGen/
class ConstantOp<string suffix, Type baseType, Attr attr>
: TFRT_Op<"constant." # suffix, [NoSideEffect]> {
let summary = "host executor constant value constructor";
let arguments = (ins attr:$value);
let results = (outs baseType);
}
class PrintOp<string suffix, Type type> : TFRT_Op<"print." # suffix> {
let summary = "tfrt.print operation";
let description = [{
An operation takes a number input and a chain input.
It prints the number to stdout and returns a chain output.
The chain input must be the second operand.
Example:
%2 = tfrt.print.i32 %0, %1
}];
let arguments = (ins type, TFRT_ChainType);
let results = (outs TFRT_ChainType);
let assemblyFormat = "operands attr-dict";
let verifier = ?;
}
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L376-L390
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L58-L64
11
Define kernels
12
user defined kernels
func @print_coordinate() {
%chain = tfrt.new.chain
%two = tfrt.constant.i32 2
%four = tfrt.constant.i32 4
%coordinate = "my.create_coordinate"(%two, %four) : (i32, i32) -> !my.coordinate
"my.print_coordinate"(%coordinate, %chain) : (!my.coordinate, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
coordinate.mlir shows several TFRT features:

• MLIR types that begin with exclamation mark (!) are user-defined types like !my.coordinate,
compared to built-in types like i32

• Kernels are just C++ functions with a name in MLIR: my.print_coordinate is the MLIR name for
the C++ PrintCoordinate function

• Kernels may pass arbitrary user-defined types: my.create_coordinate passes a custom
Coordinate struct to my.print_coordinate 13
to dig into some code we need
more system information
14
Host Runtime
15
• TensorFlow user passes into TFRT a
TensorFlow graph created via high-level
TensorFlow APIs, and

• TFRT then calls the MLIR-based graph
compiler to optimize and lower the
graph into BEF, a Binary Executable
Format for TFRT graph execution (MLIR
is the compiler infrastructure that we
use to represent TFRT host programs). 

• The blue arrows in the simplified
TensorFlow training stack diagram
show this flow.
16
• In the README.md we are told to build two
binaries: tfrt_translate and bef_excutor

• tfrt_translate

• The tfrt_translate program does round trip
translation between MLIR and BEF, similar
to an assembler and disassembler.

• bef_executor

• The bef_executor program is the
execution driver of BEF files. It reads in a
BEF file, sets up runtime, and
asynchronously executes function(s) in
that file.
17
TFRT Host Runtime
• Foundation of TFRT: schedules work on the host and devices

• Clean separation between host and device runtimes:

• Host runtime does not know anything about devices, just their runtimes (sets of kernels) 

• Key design points:

• Fully asynchronous - kernel executions can not block

• Excellent error propagation in the presence of asynchrony

• Performance as a first-class concern, for graph and eager

• Outline:

• Common runtime infrastructure

• Graph execution

• Op-by-op execution (“eager”)
18
• Container for data or resources

• Not Tensor specific

• A “future” type, fulfilled with exactly one value, or an error

• Lock-free, low memory overhead, type erased, reference
counted	 

• Helper class AsyncValueRef<T> provides type safety when
contained type is known
• AsyncValues enable efficient asynchronous compute

• Asynchronous functions return unavailable AsyncValues
• Caller can schedule dependent
computations with AsyncValue::AndThen()
• Caller need not block until AsyncValue
becomes available
Key Abstraction: AsyncValue 

https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/async_value.h
19
Kernels
• Kernel: unit of computation scheduled by the runtime

• Similar to kernel concept in current TensorFlow

• Kernels accept AsyncValue inputs and produce AsyncValue output

• Runtime coordinates dataflow of AsyncValues between kernels

• Outputs may not be immediately available, unlike current TensorFlow

• Runtime generally does not understand kernel semantics
//	Kernel	that	adds	two	integers.	
//	AsyncKernelFrame	holds	the	kernel’s	arguments	and	results.	
static	void	TFRTAdd(AsyncKernelFrame*	frame)	{	
		//	Fetch	the	kernel’s	0th	argument.	
		AsyncValue*	arg1	=	frame->GetArgAt(0);	
		//	Fetch	the	kernel’s	1st	argument.	
		AsyncValue*	arg2	=	frame->GetArgAt(1);	
		int	v1	=	arg1->get<int>();	
		int	v2	=	arg2->get<int>();	
		//	Set	the	kernel’s	0th	result.	
		frame->EmplaceResultAt<int>(0,	v1	+	v2);	
}	
https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md
https://github.com/tensorflow/runtime/blob/master/lib/basic_kernels/integer_kernels.cc#L39-L45
https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/kernel_utils.h#L61-L149
20
Host Program
• Host programs encode a dataflow graph

• Similar to GraphDef in current TensorFlow

• Expressed in MLIR. Typically compiler generated

• Designed for low-level dispatch efficiency

• Designed for compiler transformations and analysis, e.g., 

• Use dataflow analysis for buffer reuse
func @sample_function() -> i32 {
%one = tfrt.constant.i32 1 // Make AsyncValue with value 1
%two = tfrt.constant.i32 2 // Make AsyncValue with value 2
%three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2)
%ch0 = tfrt.new.chain
tfrt.print.i32 %three, %ch0 // Print AsyncValue %three
tfrt.return %three : i32 // Return AsyncValue %three
}
21
TFRT Binary Executable Format (BEF)
• BEF encodes a hardware-specific lowered graph
function

• Primary interface between compiler and runtime 

• Designed for efficient execution

• Low overhead: execute program by reading mmap’d
byte array 

• Persistent and stable: Compile once offline, run
many times 

online. Great for inference use-cases 

• Composed of sections, similar to ELF. Each section
has its own format 

• Extensible: BEF is versioned, reader ignores unknown
sections, new versions may define new sections 
 https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md
22
BEF Executor
• BEF Executor evaluates a BEF dataflow graph “executor” style:

• Not a bytecode-like interpreter: no concept of program counter

• “Strict” execution by default: run a kernel only when all its inputs are available

• Executor features:

• Lock-free: atomics instead of mutexes

• Non-blocking: defer dependent work with AsyncValue::AndThen

• Supports “non-strict” execution: may run a kernel when some of its
inputs are available

• Good for efficiently forwarding unavailable inputs to outputs

• Key concepts:

• BEF: dataflow graph

• Kernel: dataflow node

• AsyncValues: dataflow edge
https://github.com/tensorflow/runtime/blob/master/lib/bef_executor/bef_interpreter.cc#L223-L25423
Host Runtime Summary 

24
How about Core Runtime?
• Surely, we can do similar walkthrough, but that will takes more time

• Two things

• Op Execution API, Execute()

• BEF Executor can handle it too
void CoreRuntime::Impl::Execute(const ExecutionContext& exec_ctx,
string_view op_name, OpHandler* op_handler,
MutableArrayRef<TensorHandle> arguments,
const OpAttrsRef& attrs,
MutableArrayRef<TensorHandle> results,
AsyncValueRef<Chain>* chain) {
// Ask the op_handler to execute the op. If successful, we're done.
auto op_handle = op_handler->MakeOp(op_name);
if (op_handle) {
op_handle.get()(exec_ctx, arguments, attrs, results, chain);
return;
}
// Otherwise, we fail with an 'unknown op' error.
auto err =
EmitErrorAsync(exec_ctx, "op '" + op_name.str() + "' is not supported");
for (auto& result : results) result = TensorHandle(err.CopyRef());
if (chain) *chain = std::move(err);
}
25
https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/core_runtime.cc#L124-L143
https://github.com/tensorflow/runtime/blob/master/documents/
tfrt_op_by_op_execution_design.md
BEF Executor for “op” graph
• corert.executeop

• sample
26
https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/kernels.cc
func @example() -> !tfrt.chain {
%cpu = corert.get_op_handler("cpu")
// Create TensorHandles
%lhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-1.0 : f32] }
%rhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-2.0 : f32] }
%result = corert.executeop(%cpu) "test.add" (%lhs, %rhs)
%ch0 = tfrt.new.chain
%ch1 = corert.print_tensorhandle(%result, %ch0)
tfrt.return %ch1 : !tfrt.chain
}
func @example() -> !tfrt.chain {
%ch0 = tfrt.new.chain
%cpu = corert.get_op_handler %ch0 "cpu"
// Create TensorHandles
%lhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-1.0 : f32] } : 1
%rhs = corert.executeop(%cpu)
"test.create_dense_tensor"() { shape = [1, 1], values = [-2.0 : f32] } : 1
%result = corert.executeop(%cpu) "test.add" (%lhs, %rhs) : 1
%ch1 = "corert.print_tensorhandle"(%result, %ch0) : (!corert.tensorhandle, !tfrt.chain) -> !tfrt.chain
tfrt.return %ch1 : !tfrt.chain
}
Device Runtime
CPU
27
//===----------------------------------------------------------------------===//
// CPU Relu kernels
//===----------------------------------------------------------------------===//
// Computes B = Relu(A).
template <typename T>
static AsyncValueRef<Chain> Relu(const DenseHostTensor& A, DenseHostTensor* B,
const ExecutionContext& exec_ctx) {
auto fn = [](auto& a, auto& b) { return a.cwiseMax(static_cast<T>(0)); };
return ::tfrt::compat::UnaryEigenKernelAsync<T, T>(A, B, std::move(fn),
exec_ctx);
}
//===----------------------------------------------------------------------===//
// CPU BiasAdd kernels
//===----------------------------------------------------------------------===//
// A special case of tf.add where bias is restricted to be 1-D.
// Currently only support NHWC data format.
template <typename T, size_t RANK>
static AsyncValueRef<Chain> BiasAdd(const DenseHostTensor& input,
const DenseHostTensor& bias,
DenseHostTensor* output,
const ExecutionContext& exec_ctx) {
DHTIndexableView<T, RANK> input_view(&input);
MutableDHTIndexableView<T, RANK> output_view(output);
DHTIndexableView<T, 1> bias_view(&bias);
const auto& shape_input = input_view.FixedShape();
const auto& shape_bias = bias_view.FixedShape();
const auto& shape_output = output_view.FixedShape();
if (shape_input != shape_output) {
return EmitErrorAsync(exec_ctx, "unexpected output shape");
}
if (shape_bias[0] != shape_input[RANK - 1]) {
return EmitErrorAsync(exec_ctx, "bias shape does not match input shape");
}
// Reshape bias to the shape of input. Broadcast along the last axis of input.
Eigen::array<Eigen::Index, RANK> reshape_dims;
Eigen::array<Eigen::Index, RANK> broadcast_dims;
for (size_t i = 0; i < RANK - 1; ++i) {
reshape_dims[i] = static_cast<Eigen::Index>(1);
broadcast_dims[i] = static_cast<Eigen::Index>(shape_input[i]);
}
reshape_dims[RANK - 1] = static_cast<Eigen::Index>(shape_bias[0]);
broadcast_dims[RANK - 1] = static_cast<Eigen::Index>(1);
auto input_t = AsEigenConstTensor(input_view);
auto bias_t = AsEigenConstTensor(bias_view);
auto output_t = AsEigenTensor(output_view);
auto expr = input_t + bias_t.reshape(reshape_dims).broadcast(broadcast_dims);
return AsyncAssign(
exec_ctx.host()->GetOrCreateSharedContext<EigenHostContext>(),
std::move(output_t), std::move(expr),
KeepBuffers::alive(&input, &bias, output));
}
https://github.com/tensorflow/runtime/blob/master/backends/cpu/lib/kernels/cpu_kernels.h
Dialects we can see now
• tfrt: we know what this is for

• tfrt_test: to test tfrt

• tfrt_data: tf.data, to deal with input pipeline

• tfrt_dht: dense host tensor

• corert: Core Runtime, eager execution

• ts: tensor shape

• coo: COOrdinate list sparse tensor

• eigen: wrapper around the eigen library

• btf: binary tensor format

• cuda: you know what cuda means :-)
28
Concluding Remarks
• MLIR related talks and publications, https://mlir.llvm.org/talks/

• We scratched the surface of TFRT host runtime and core runtime. There are more details

• threading model: thread pool / work queue,

• memory allocation: tcmalloc for server, other small allocators for embedded systems,

• non-strict execution, and

• registers: BEF executor is a register machine

• we didn’t touch other important components such as device runtimes, eps. the GPU
part, and distributed environment
29
Fin
30
Device Runtime Design Principles 

• A thin wrapper of low-level (driver) APIs, exposing device capabilities to graph compiler

• Memory Allocation

• Async host <-> device transfer, and kernel execution

• Dependency management

• Focus on mechanism instead of policy

• E.g. No built-in special-purpose streams for GPU support:
• For pure eager execution, can default to one stream for everything 

• For tf.function execution, compiler can pick streams
31

Contenu connexe

Tendances

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
Programming Language
Programming  LanguageProgramming  Language
Programming Language
Adeel Hamid
 

Tendances (20)

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Building Embedded Linux Systems Introduction
Building Embedded Linux Systems IntroductionBuilding Embedded Linux Systems Introduction
Building Embedded Linux Systems Introduction
 
EE5440 – Computer Architecture - Lecture 1
EE5440 – Computer Architecture - Lecture 1EE5440 – Computer Architecture - Lecture 1
EE5440 – Computer Architecture - Lecture 1
 
Paradigmas de programação
Paradigmas de programaçãoParadigmas de programação
Paradigmas de programação
 
RISC-V Unconstrained
RISC-V UnconstrainedRISC-V Unconstrained
RISC-V Unconstrained
 
Embedded C - Day 2
Embedded C - Day 2Embedded C - Day 2
Embedded C - Day 2
 
1 Computer Architecture
1 Computer Architecture1 Computer Architecture
1 Computer Architecture
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
RISC and CISC Processors
RISC and CISC ProcessorsRISC and CISC Processors
RISC and CISC Processors
 
Declarative programming language
Declarative programming languageDeclarative programming language
Declarative programming language
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Aula javascript
Aula  javascriptAula  javascript
Aula javascript
 
Qemu Introduction
Qemu IntroductionQemu Introduction
Qemu Introduction
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
Yocto project and open embedded training
Yocto project and open embedded trainingYocto project and open embedded training
Yocto project and open embedded training
 
Aula01-JavaScript
Aula01-JavaScriptAula01-JavaScript
Aula01-JavaScript
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 
Programming Language
Programming  LanguageProgramming  Language
Programming Language
 
Introduction to the LLVM Compiler System
Introduction to the LLVM  Compiler SystemIntroduction to the LLVM  Compiler System
Introduction to the LLVM Compiler System
 

Similaire à A Peek into TFRT

Virtual platform
Virtual platformVirtual platform
Virtual platform
sean chen
 

Similaire à A Peek into TFRT (20)

A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlow
 
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Os lectures
Os lecturesOs lectures
Os lectures
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
Threads
ThreadsThreads
Threads
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
 
freertos-proj.pdf
freertos-proj.pdffreertos-proj.pdf
freertos-proj.pdf
 
Tensorflow internal
Tensorflow internalTensorflow internal
Tensorflow internal
 
2004 ugm-tips-tricks
2004 ugm-tips-tricks2004 ugm-tips-tricks
2004 ugm-tips-tricks
 
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
 
C++ Advanced Features
C++ Advanced FeaturesC++ Advanced Features
C++ Advanced Features
 
OpenSAF Symposium_Python Bindings_9.21.11
OpenSAF Symposium_Python Bindings_9.21.11OpenSAF Symposium_Python Bindings_9.21.11
OpenSAF Symposium_Python Bindings_9.21.11
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
Standard Library Functions
Standard Library FunctionsStandard Library Functions
Standard Library Functions
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
 
Linux Perf Tools
Linux Perf ToolsLinux Perf Tools
Linux Perf Tools
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 

Plus de Koan-Sin Tan

Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
Koan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Koan-Sin Tan
 

Plus de Koan-Sin Tan (15)

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source Tools
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Caffe2 on Android
Caffe2 on AndroidCaffe2 on Android
Caffe2 on Android
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

A Peek into TFRT

  • 1. Koan-Sin Tan, freedom@computer.org COSCUP, Aug 2nd, 2020 TensorFlow Runtime A Peek into the Future of TensorFlow 1
  • 2. • disclaimer: opinions are my own • feel free to interrupt me if you have any questions during the presentation • questions could be Taiwanese, English, or Mandarin • most of TFRT materials are adapted from TFRT deep dive in MLIR design meeting [1] and TFRT docs [2] • code around Aug 1, 2020 (git commit ecf1c20 [3]) [1] TFRT Deep Dive,  slides - recording, https://mlir.llvm.org/talks/ [2] https://github.com/tensorflow/runtime/tree/master/documents [3] https://github.com/tensorflow/runtime/commit/ecf1c20 2
  • 3. • Used open source before the term “open source” is used • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Used to be a programming language junkie • Worked on various system software, e.g., CPU scheduling and power management of non- CPU components • Recently, on NN performance on edge devices related stuff • Contributed from time to time to TensorFlow Lite • started a command line label_image for TFLite who i am https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 3
  • 4. What is TFRT • TensorFlow Runtime (TFRT) is one of the two new MLIR runtimes emerged in 2020 so far. • The other one is Intermediate Representation Execution Environment, IREE. It seems so far tfrt has better design documentation • Both of them have mobile / edge environment in mind. • I didn’t see mobile accelerated code in TFRT yet. • IREE has some Vulkan related code and some simple code works on Android already • ResNet GPU inference is 28% faster with TFRT • https://github.com/tensorflow/runtime, https://youtu.be/15tiQoPpuZ8 4
  • 5. Build it • if you follow the instructions described in README.md, it should just work. At least on x86_64 linux. • however, it’s not tested for non Linux environment yet • ssize_t and int64_t • on Mac OS X: ssize_t: long, int64_t: long long • current code mixed the use of ssize_t and int64_t • test: one the acclaimed features of TFRT, like MLIR, is its use of 
 LLVM FileCheck • my hacks, shape related (ssize_t) tests not fixed yet • it’s not tested on non-x86 platforms, such as aarch64, either 
 • 5
  • 6. • The three key directories under the TFRT root directory are • lib: Contains core TFRT infrastructure code • backends: Contains device specific infrastructure and op/kernel implementations • include: Contains public header files for core TFRT infrastructure 6
  • 7. Walking thru the tutorial • unfortunately, it seems it’s not easy to jump directly into source code without having some background knowledge • so we’ll walk thru the tutorial [1] • What are in the tutorial • print hello world • print integer • adding kernels [1] https://github.com/tensorflow/runtime/blob/master/documents/tutorial.md 7
  • 8. using tfrt and tfrt_test hello.mlir func @hello() { %chain = tfrt.new.chain // Create a string containing "hello world" and store it in %hello. %hello = "tfrt_test.get_string"() { string_attr = "hello world" } : () -> !tfrt.string // Print the string in %hello. "tfrt_test.print_string"(%hello, %chain) : (!tfrt.string, !tfrt.chain) -> !tfrt.chain tfrt.return } The ‘@hello function above shows how to create and print a string. The text after each ‘:’ specifies the types involved: • ()->!tfrt.string means that tfrt_test.get_string takes no arguments and returns a !tfrt.string. tfrt is a MLIR dialect prefix (or namespace) for TFRT • (!tfrt.string, !tfrt.chain) -> !tfrt.chain means that tfrt_test.print_string takes two arguments (! tfrt.string and !tfrt.chain) and returns a !tfrt.chain. chain [1] is a TFRT abstraction to manage dependencies [1] https://github.com/tensorflow/runtime/blob/master/documents/explicit_dependency.md 8
  • 9. hello world in MLIR func @stringconstant() -> !llvm<"[12 x i8]"> { %1 = llvm.constant("Hello world!") : !llvm<"i8*"> // CHECK: ret [12 x i8] c"Hello world!" llvm.return %1 : !llvm<"i8*"> } func @main() { %0 = llvm.constant(0) : !llvm.i64 %1 = call @stringconstant() : () -> !llvm<"[12 x i8]"> %2 = llvm.getelementptr %1[%0] : (!llvm<"[12 x i8]">, !llvm.i64) -> !llvm<"i8*"> %3 = llvm.bitcast %2 : !llvm<"i8*"> to !llvm<"i8*"> %32 = llvm.call @puts(%2) : (!llvm<"i8*">) -> !llvm.i32 return } func @puts(!llvm<"i8*">) -> !llvm.i32 • MLIR “standard dialect” doesn’t have I/O functions • there is LLVM dialect, of course we can use LLVM to call standard libc function 9
  • 10. Hello integer func @hello_integers() { %chain = tfrt.new.chain // Create an integer containing 42. %forty_two = tfrt.constant.i32 42 // Print 42. tfrt.print.i32 %forty_two, %chain tfrt.return } • as stated in the tutorial, we can run other functions in the same modular • we can turn to more basic ones, such as integers or floating point numbers • @hello_integers shows how to create and print integers • This example does not have the verbose type information we saw in @hello because there are custom parsers for the tfrt.constant.i32 and tfrt.print.32 kernels in basic_kernels.td 10
  • 11. basic_kernels.td • .td (table description?) files are for LLVM TableGen [1] TableGen, https://llvm.org/docs/TableGen/ class ConstantOp<string suffix, Type baseType, Attr attr> : TFRT_Op<"constant." # suffix, [NoSideEffect]> { let summary = "host executor constant value constructor"; let arguments = (ins attr:$value); let results = (outs baseType); } class PrintOp<string suffix, Type type> : TFRT_Op<"print." # suffix> { let summary = "tfrt.print operation"; let description = [{ An operation takes a number input and a chain input. It prints the number to stdout and returns a chain output. The chain input must be the second operand. Example: %2 = tfrt.print.i32 %0, %1 }]; let arguments = (ins type, TFRT_ChainType); let results = (outs TFRT_ChainType); let assemblyFormat = "operands attr-dict"; let verifier = ?; } https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L376-L390 https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L58-L64 11
  • 13. user defined kernels func @print_coordinate() { %chain = tfrt.new.chain %two = tfrt.constant.i32 2 %four = tfrt.constant.i32 4 %coordinate = "my.create_coordinate"(%two, %four) : (i32, i32) -> !my.coordinate "my.print_coordinate"(%coordinate, %chain) : (!my.coordinate, !tfrt.chain) -> !tfrt.chain tfrt.return } coordinate.mlir shows several TFRT features: • MLIR types that begin with exclamation mark (!) are user-defined types like !my.coordinate, compared to built-in types like i32 • Kernels are just C++ functions with a name in MLIR: my.print_coordinate is the MLIR name for the C++ PrintCoordinate function • Kernels may pass arbitrary user-defined types: my.create_coordinate passes a custom Coordinate struct to my.print_coordinate 13
  • 14. to dig into some code we need more system information 14
  • 16. • TensorFlow user passes into TFRT a TensorFlow graph created via high-level TensorFlow APIs, and • TFRT then calls the MLIR-based graph compiler to optimize and lower the graph into BEF, a Binary Executable Format for TFRT graph execution (MLIR is the compiler infrastructure that we use to represent TFRT host programs). • The blue arrows in the simplified TensorFlow training stack diagram show this flow. 16
  • 17. • In the README.md we are told to build two binaries: tfrt_translate and bef_excutor • tfrt_translate • The tfrt_translate program does round trip translation between MLIR and BEF, similar to an assembler and disassembler. • bef_executor • The bef_executor program is the execution driver of BEF files. It reads in a BEF file, sets up runtime, and asynchronously executes function(s) in that file. 17
  • 18. TFRT Host Runtime • Foundation of TFRT: schedules work on the host and devices • Clean separation between host and device runtimes: • Host runtime does not know anything about devices, just their runtimes (sets of kernels) • Key design points: • Fully asynchronous - kernel executions can not block • Excellent error propagation in the presence of asynchrony • Performance as a first-class concern, for graph and eager • Outline: • Common runtime infrastructure • Graph execution • Op-by-op execution (“eager”) 18
  • 19. • Container for data or resources • Not Tensor specific • A “future” type, fulfilled with exactly one value, or an error • Lock-free, low memory overhead, type erased, reference counted • Helper class AsyncValueRef<T> provides type safety when contained type is known • AsyncValues enable efficient asynchronous compute • Asynchronous functions return unavailable AsyncValues • Caller can schedule dependent computations with AsyncValue::AndThen() • Caller need not block until AsyncValue becomes available Key Abstraction: AsyncValue https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/async_value.h 19
  • 20. Kernels • Kernel: unit of computation scheduled by the runtime • Similar to kernel concept in current TensorFlow • Kernels accept AsyncValue inputs and produce AsyncValue output • Runtime coordinates dataflow of AsyncValues between kernels • Outputs may not be immediately available, unlike current TensorFlow • Runtime generally does not understand kernel semantics // Kernel that adds two integers. // AsyncKernelFrame holds the kernel’s arguments and results. static void TFRTAdd(AsyncKernelFrame* frame) { // Fetch the kernel’s 0th argument. AsyncValue* arg1 = frame->GetArgAt(0); // Fetch the kernel’s 1st argument. AsyncValue* arg2 = frame->GetArgAt(1); int v1 = arg1->get<int>(); int v2 = arg2->get<int>(); // Set the kernel’s 0th result. frame->EmplaceResultAt<int>(0, v1 + v2); } https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md https://github.com/tensorflow/runtime/blob/master/lib/basic_kernels/integer_kernels.cc#L39-L45 https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/kernel_utils.h#L61-L149 20
  • 21. Host Program • Host programs encode a dataflow graph • Similar to GraphDef in current TensorFlow • Expressed in MLIR. Typically compiler generated • Designed for low-level dispatch efficiency • Designed for compiler transformations and analysis, e.g., • Use dataflow analysis for buffer reuse func @sample_function() -> i32 { %one = tfrt.constant.i32 1 // Make AsyncValue with value 1 %two = tfrt.constant.i32 2 // Make AsyncValue with value 2 %three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2) %ch0 = tfrt.new.chain tfrt.print.i32 %three, %ch0 // Print AsyncValue %three tfrt.return %three : i32 // Return AsyncValue %three } 21
  • 22. TFRT Binary Executable Format (BEF) • BEF encodes a hardware-specific lowered graph function • Primary interface between compiler and runtime 
 • Designed for efficient execution • Low overhead: execute program by reading mmap’d byte array 
 • Persistent and stable: Compile once offline, run many times 
 online. Great for inference use-cases 
 • Composed of sections, similar to ELF. Each section has its own format 
 • Extensible: BEF is versioned, reader ignores unknown sections, new versions may define new sections 
 https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md 22
  • 23. BEF Executor • BEF Executor evaluates a BEF dataflow graph “executor” style: • Not a bytecode-like interpreter: no concept of program counter • “Strict” execution by default: run a kernel only when all its inputs are available • Executor features: • Lock-free: atomics instead of mutexes • Non-blocking: defer dependent work with AsyncValue::AndThen • Supports “non-strict” execution: may run a kernel when some of its inputs are available • Good for efficiently forwarding unavailable inputs to outputs • Key concepts: • BEF: dataflow graph • Kernel: dataflow node • AsyncValues: dataflow edge https://github.com/tensorflow/runtime/blob/master/lib/bef_executor/bef_interpreter.cc#L223-L25423
  • 25. How about Core Runtime? • Surely, we can do similar walkthrough, but that will takes more time • Two things • Op Execution API, Execute() • BEF Executor can handle it too void CoreRuntime::Impl::Execute(const ExecutionContext& exec_ctx, string_view op_name, OpHandler* op_handler, MutableArrayRef<TensorHandle> arguments, const OpAttrsRef& attrs, MutableArrayRef<TensorHandle> results, AsyncValueRef<Chain>* chain) { // Ask the op_handler to execute the op. If successful, we're done. auto op_handle = op_handler->MakeOp(op_name); if (op_handle) { op_handle.get()(exec_ctx, arguments, attrs, results, chain); return; } // Otherwise, we fail with an 'unknown op' error. auto err = EmitErrorAsync(exec_ctx, "op '" + op_name.str() + "' is not supported"); for (auto& result : results) result = TensorHandle(err.CopyRef()); if (chain) *chain = std::move(err); } 25 https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/core_runtime.cc#L124-L143 https://github.com/tensorflow/runtime/blob/master/documents/ tfrt_op_by_op_execution_design.md
  • 26. BEF Executor for “op” graph • corert.executeop • sample 26 https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/kernels.cc func @example() -> !tfrt.chain { %cpu = corert.get_op_handler("cpu") // Create TensorHandles %lhs = corert.executeop(%cpu) "test.create_dense_tensor"() { shape = [1, 1], values = [-1.0 : f32] } %rhs = corert.executeop(%cpu) "test.create_dense_tensor"() { shape = [1, 1], values = [-2.0 : f32] } %result = corert.executeop(%cpu) "test.add" (%lhs, %rhs) %ch0 = tfrt.new.chain %ch1 = corert.print_tensorhandle(%result, %ch0) tfrt.return %ch1 : !tfrt.chain } func @example() -> !tfrt.chain { %ch0 = tfrt.new.chain %cpu = corert.get_op_handler %ch0 "cpu" // Create TensorHandles %lhs = corert.executeop(%cpu) "test.create_dense_tensor"() { shape = [1, 1], values = [-1.0 : f32] } : 1 %rhs = corert.executeop(%cpu) "test.create_dense_tensor"() { shape = [1, 1], values = [-2.0 : f32] } : 1 %result = corert.executeop(%cpu) "test.add" (%lhs, %rhs) : 1 %ch1 = "corert.print_tensorhandle"(%result, %ch0) : (!corert.tensorhandle, !tfrt.chain) -> !tfrt.chain tfrt.return %ch1 : !tfrt.chain }
  • 27. Device Runtime CPU 27 //===----------------------------------------------------------------------===// // CPU Relu kernels //===----------------------------------------------------------------------===// // Computes B = Relu(A). template <typename T> static AsyncValueRef<Chain> Relu(const DenseHostTensor& A, DenseHostTensor* B, const ExecutionContext& exec_ctx) { auto fn = [](auto& a, auto& b) { return a.cwiseMax(static_cast<T>(0)); }; return ::tfrt::compat::UnaryEigenKernelAsync<T, T>(A, B, std::move(fn), exec_ctx); } //===----------------------------------------------------------------------===// // CPU BiasAdd kernels //===----------------------------------------------------------------------===// // A special case of tf.add where bias is restricted to be 1-D. // Currently only support NHWC data format. template <typename T, size_t RANK> static AsyncValueRef<Chain> BiasAdd(const DenseHostTensor& input, const DenseHostTensor& bias, DenseHostTensor* output, const ExecutionContext& exec_ctx) { DHTIndexableView<T, RANK> input_view(&input); MutableDHTIndexableView<T, RANK> output_view(output); DHTIndexableView<T, 1> bias_view(&bias); const auto& shape_input = input_view.FixedShape(); const auto& shape_bias = bias_view.FixedShape(); const auto& shape_output = output_view.FixedShape(); if (shape_input != shape_output) { return EmitErrorAsync(exec_ctx, "unexpected output shape"); } if (shape_bias[0] != shape_input[RANK - 1]) { return EmitErrorAsync(exec_ctx, "bias shape does not match input shape"); } // Reshape bias to the shape of input. Broadcast along the last axis of input. Eigen::array<Eigen::Index, RANK> reshape_dims; Eigen::array<Eigen::Index, RANK> broadcast_dims; for (size_t i = 0; i < RANK - 1; ++i) { reshape_dims[i] = static_cast<Eigen::Index>(1); broadcast_dims[i] = static_cast<Eigen::Index>(shape_input[i]); } reshape_dims[RANK - 1] = static_cast<Eigen::Index>(shape_bias[0]); broadcast_dims[RANK - 1] = static_cast<Eigen::Index>(1); auto input_t = AsEigenConstTensor(input_view); auto bias_t = AsEigenConstTensor(bias_view); auto output_t = AsEigenTensor(output_view); auto expr = input_t + bias_t.reshape(reshape_dims).broadcast(broadcast_dims); return AsyncAssign( exec_ctx.host()->GetOrCreateSharedContext<EigenHostContext>(), std::move(output_t), std::move(expr), KeepBuffers::alive(&input, &bias, output)); } https://github.com/tensorflow/runtime/blob/master/backends/cpu/lib/kernels/cpu_kernels.h
  • 28. Dialects we can see now • tfrt: we know what this is for • tfrt_test: to test tfrt • tfrt_data: tf.data, to deal with input pipeline • tfrt_dht: dense host tensor • corert: Core Runtime, eager execution • ts: tensor shape • coo: COOrdinate list sparse tensor • eigen: wrapper around the eigen library • btf: binary tensor format • cuda: you know what cuda means :-) 28
  • 29. Concluding Remarks • MLIR related talks and publications, https://mlir.llvm.org/talks/ • We scratched the surface of TFRT host runtime and core runtime. There are more details • threading model: thread pool / work queue, • memory allocation: tcmalloc for server, other small allocators for embedded systems, • non-strict execution, and • registers: BEF executor is a register machine • we didn’t touch other important components such as device runtimes, eps. the GPU part, and distributed environment 29
  • 31. Device Runtime Design Principles • A thin wrapper of low-level (driver) APIs, exposing device capabilities to graph compiler • Memory Allocation • Async host <-> device transfer, and kernel execution • Dependency management • Focus on mechanism instead of policy • E.g. No built-in special-purpose streams for GPU support: • For pure eager execution, can default to one stream for everything • For tf.function execution, compiler can pick streams 31