2. • disclaimer: opinions are my own
• feel free to interrupt me if you have any questions during the presentation
• questions could be Taiwanese, English, or Mandarin
• most of TFRT materials are adapted from TFRT deep dive in MLIR design meeting [1] and TFRT docs [2]
• code around Aug 1, 2020 (git commit ecf1c20 [3])
[1] TFRT Deep Dive, slides - recording, https://mlir.llvm.org/talks/
[2] https://github.com/tensorflow/runtime/tree/master/documents
[3] https://github.com/tensorflow/runtime/commit/ecf1c20
2
3. • Used open source before the term “open
source” is used
• A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
• Used to be a programming language junkie
• Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
• Recently, on NN performance on edge devices
related stuff
• Contributed from time to time to TensorFlow Lite
• started a command line label_image for TFLite
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3
4. What is TFRT
• TensorFlow Runtime (TFRT) is one of the two new MLIR runtimes emerged in 2020 so far.
• The other one is Intermediate Representation Execution Environment, IREE. It seems
so far tfrt has better design documentation
• Both of them have mobile / edge environment in mind.
• I didn’t see mobile accelerated code in TFRT yet.
• IREE has some Vulkan related code and some simple code works on Android already
• ResNet GPU inference is 28% faster with TFRT
• https://github.com/tensorflow/runtime, https://youtu.be/15tiQoPpuZ8
4
5. Build it
• if you follow the instructions described in README.md, it should just work. At least on x86_64 linux.
• however, it’s not tested for non Linux environment yet
• ssize_t and int64_t
• on Mac OS X: ssize_t: long, int64_t: long long
• current code mixed the use of ssize_t and int64_t
• test: one the acclaimed features of TFRT, like MLIR, is its use of
LLVM FileCheck
• my hacks, shape related (ssize_t) tests not fixed yet
• it’s not tested on non-x86 platforms, such as aarch64, either
•
5
6. • The three key directories under the TFRT root directory are
• lib: Contains core TFRT infrastructure code
• backends: Contains device specific infrastructure and op/kernel implementations
• include: Contains public header files for core TFRT infrastructure
6
7. Walking thru the tutorial
• unfortunately, it seems it’s not easy to jump directly into source code without having
some background knowledge
• so we’ll walk thru the tutorial [1]
• What are in the tutorial
• print hello world
• print integer
• adding kernels
[1] https://github.com/tensorflow/runtime/blob/master/documents/tutorial.md
7
8. using tfrt and tfrt_test
hello.mlir
func @hello() {
%chain = tfrt.new.chain
// Create a string containing "hello world" and store it in %hello.
%hello = "tfrt_test.get_string"() { string_attr = "hello world" } : () -> !tfrt.string
// Print the string in %hello.
"tfrt_test.print_string"(%hello, %chain) : (!tfrt.string, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
The ‘@hello function above shows how to create and print a string. The text after each ‘:’ specifies the types involved:
• ()->!tfrt.string means that tfrt_test.get_string takes no arguments and returns a !tfrt.string. tfrt is a
MLIR dialect prefix (or namespace) for TFRT
• (!tfrt.string, !tfrt.chain) -> !tfrt.chain means that tfrt_test.print_string takes two arguments (!
tfrt.string and !tfrt.chain) and returns a !tfrt.chain. chain [1] is a TFRT abstraction to manage dependencies
[1] https://github.com/tensorflow/runtime/blob/master/documents/explicit_dependency.md
8
9. hello world in MLIR
func @stringconstant() -> !llvm<"[12 x i8]"> {
%1 = llvm.constant("Hello world!") : !llvm<"i8*">
// CHECK: ret [12 x i8] c"Hello world!"
llvm.return %1 : !llvm<"i8*">
}
func @main() {
%0 = llvm.constant(0) : !llvm.i64
%1 = call @stringconstant() : () -> !llvm<"[12 x i8]">
%2 = llvm.getelementptr %1[%0] : (!llvm<"[12 x i8]">, !llvm.i64) -> !llvm<"i8*">
%3 = llvm.bitcast %2 : !llvm<"i8*"> to !llvm<"i8*">
%32 = llvm.call @puts(%2) : (!llvm<"i8*">) -> !llvm.i32
return
}
func @puts(!llvm<"i8*">) -> !llvm.i32
• MLIR “standard dialect” doesn’t have I/O functions
• there is LLVM dialect, of course we can use LLVM to call standard libc
function
9
10. Hello integer
func @hello_integers() {
%chain = tfrt.new.chain
// Create an integer containing 42.
%forty_two = tfrt.constant.i32 42
// Print 42.
tfrt.print.i32 %forty_two, %chain
tfrt.return
}
• as stated in the tutorial, we can run other functions in the same modular
• we can turn to more basic ones, such as integers or floating point numbers
• @hello_integers shows how to create and print integers
• This example does not have the verbose type information we saw in @hello because there are
custom parsers for the tfrt.constant.i32 and tfrt.print.32 kernels in
basic_kernels.td
10
11. basic_kernels.td
• .td (table description?) files are for LLVM TableGen
[1] TableGen, https://llvm.org/docs/TableGen/
class ConstantOp<string suffix, Type baseType, Attr attr>
: TFRT_Op<"constant." # suffix, [NoSideEffect]> {
let summary = "host executor constant value constructor";
let arguments = (ins attr:$value);
let results = (outs baseType);
}
class PrintOp<string suffix, Type type> : TFRT_Op<"print." # suffix> {
let summary = "tfrt.print operation";
let description = [{
An operation takes a number input and a chain input.
It prints the number to stdout and returns a chain output.
The chain input must be the second operand.
Example:
%2 = tfrt.print.i32 %0, %1
}];
let arguments = (ins type, TFRT_ChainType);
let results = (outs TFRT_ChainType);
let assemblyFormat = "operands attr-dict";
let verifier = ?;
}
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L376-L390
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L58-L64
11
13. user defined kernels
func @print_coordinate() {
%chain = tfrt.new.chain
%two = tfrt.constant.i32 2
%four = tfrt.constant.i32 4
%coordinate = "my.create_coordinate"(%two, %four) : (i32, i32) -> !my.coordinate
"my.print_coordinate"(%coordinate, %chain) : (!my.coordinate, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
coordinate.mlir shows several TFRT features:
• MLIR types that begin with exclamation mark (!) are user-defined types like !my.coordinate,
compared to built-in types like i32
• Kernels are just C++ functions with a name in MLIR: my.print_coordinate is the MLIR name for
the C++ PrintCoordinate function
• Kernels may pass arbitrary user-defined types: my.create_coordinate passes a custom
Coordinate struct to my.print_coordinate 13
14. to dig into some code we need
more system information
14
16. • TensorFlow user passes into TFRT a
TensorFlow graph created via high-level
TensorFlow APIs, and
• TFRT then calls the MLIR-based graph
compiler to optimize and lower the
graph into BEF, a Binary Executable
Format for TFRT graph execution (MLIR
is the compiler infrastructure that we
use to represent TFRT host programs).
• The blue arrows in the simplified
TensorFlow training stack diagram
show this flow.
16
17. • In the README.md we are told to build two
binaries: tfrt_translate and bef_excutor
• tfrt_translate
• The tfrt_translate program does round trip
translation between MLIR and BEF, similar
to an assembler and disassembler.
• bef_executor
• The bef_executor program is the
execution driver of BEF files. It reads in a
BEF file, sets up runtime, and
asynchronously executes function(s) in
that file.
17
18. TFRT Host Runtime
• Foundation of TFRT: schedules work on the host and devices
• Clean separation between host and device runtimes:
• Host runtime does not know anything about devices, just their runtimes (sets of kernels)
• Key design points:
• Fully asynchronous - kernel executions can not block
• Excellent error propagation in the presence of asynchrony
• Performance as a first-class concern, for graph and eager
• Outline:
• Common runtime infrastructure
• Graph execution
• Op-by-op execution (“eager”)
18
19. • Container for data or resources
• Not Tensor specific
• A “future” type, fulfilled with exactly one value, or an error
• Lock-free, low memory overhead, type erased, reference
counted
• Helper class AsyncValueRef<T> provides type safety when
contained type is known
• AsyncValues enable efficient asynchronous compute
• Asynchronous functions return unavailable AsyncValues
• Caller can schedule dependent
computations with AsyncValue::AndThen()
• Caller need not block until AsyncValue
becomes available
Key Abstraction: AsyncValue
https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/async_value.h
19
20. Kernels
• Kernel: unit of computation scheduled by the runtime
• Similar to kernel concept in current TensorFlow
• Kernels accept AsyncValue inputs and produce AsyncValue output
• Runtime coordinates dataflow of AsyncValues between kernels
• Outputs may not be immediately available, unlike current TensorFlow
• Runtime generally does not understand kernel semantics
// Kernel that adds two integers.
// AsyncKernelFrame holds the kernel’s arguments and results.
static void TFRTAdd(AsyncKernelFrame* frame) {
// Fetch the kernel’s 0th argument.
AsyncValue* arg1 = frame->GetArgAt(0);
// Fetch the kernel’s 1st argument.
AsyncValue* arg2 = frame->GetArgAt(1);
int v1 = arg1->get<int>();
int v2 = arg2->get<int>();
// Set the kernel’s 0th result.
frame->EmplaceResultAt<int>(0, v1 + v2);
}
https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md
https://github.com/tensorflow/runtime/blob/master/lib/basic_kernels/integer_kernels.cc#L39-L45
https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/kernel_utils.h#L61-L149
20
21. Host Program
• Host programs encode a dataflow graph
• Similar to GraphDef in current TensorFlow
• Expressed in MLIR. Typically compiler generated
• Designed for low-level dispatch efficiency
• Designed for compiler transformations and analysis, e.g.,
• Use dataflow analysis for buffer reuse
func @sample_function() -> i32 {
%one = tfrt.constant.i32 1 // Make AsyncValue with value 1
%two = tfrt.constant.i32 2 // Make AsyncValue with value 2
%three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2)
%ch0 = tfrt.new.chain
tfrt.print.i32 %three, %ch0 // Print AsyncValue %three
tfrt.return %three : i32 // Return AsyncValue %three
}
21
22. TFRT Binary Executable Format (BEF)
• BEF encodes a hardware-specific lowered graph
function
• Primary interface between compiler and runtime
• Designed for efficient execution
• Low overhead: execute program by reading mmap’d
byte array
• Persistent and stable: Compile once offline, run
many times
online. Great for inference use-cases
• Composed of sections, similar to ELF. Each section
has its own format
• Extensible: BEF is versioned, reader ignores unknown
sections, new versions may define new sections https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md
22
23. BEF Executor
• BEF Executor evaluates a BEF dataflow graph “executor” style:
• Not a bytecode-like interpreter: no concept of program counter
• “Strict” execution by default: run a kernel only when all its inputs are available
• Executor features:
• Lock-free: atomics instead of mutexes
• Non-blocking: defer dependent work with AsyncValue::AndThen
• Supports “non-strict” execution: may run a kernel when some of its
inputs are available
• Good for efficiently forwarding unavailable inputs to outputs
• Key concepts:
• BEF: dataflow graph
• Kernel: dataflow node
• AsyncValues: dataflow edge
https://github.com/tensorflow/runtime/blob/master/lib/bef_executor/bef_interpreter.cc#L223-L25423
25. How about Core Runtime?
• Surely, we can do similar walkthrough, but that will takes more time
• Two things
• Op Execution API, Execute()
• BEF Executor can handle it too
void CoreRuntime::Impl::Execute(const ExecutionContext& exec_ctx,
string_view op_name, OpHandler* op_handler,
MutableArrayRef<TensorHandle> arguments,
const OpAttrsRef& attrs,
MutableArrayRef<TensorHandle> results,
AsyncValueRef<Chain>* chain) {
// Ask the op_handler to execute the op. If successful, we're done.
auto op_handle = op_handler->MakeOp(op_name);
if (op_handle) {
op_handle.get()(exec_ctx, arguments, attrs, results, chain);
return;
}
// Otherwise, we fail with an 'unknown op' error.
auto err =
EmitErrorAsync(exec_ctx, "op '" + op_name.str() + "' is not supported");
for (auto& result : results) result = TensorHandle(err.CopyRef());
if (chain) *chain = std::move(err);
}
25
https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/core_runtime.cc#L124-L143
https://github.com/tensorflow/runtime/blob/master/documents/
tfrt_op_by_op_execution_design.md
27. Device Runtime
CPU
27
//===----------------------------------------------------------------------===//
// CPU Relu kernels
//===----------------------------------------------------------------------===//
// Computes B = Relu(A).
template <typename T>
static AsyncValueRef<Chain> Relu(const DenseHostTensor& A, DenseHostTensor* B,
const ExecutionContext& exec_ctx) {
auto fn = [](auto& a, auto& b) { return a.cwiseMax(static_cast<T>(0)); };
return ::tfrt::compat::UnaryEigenKernelAsync<T, T>(A, B, std::move(fn),
exec_ctx);
}
//===----------------------------------------------------------------------===//
// CPU BiasAdd kernels
//===----------------------------------------------------------------------===//
// A special case of tf.add where bias is restricted to be 1-D.
// Currently only support NHWC data format.
template <typename T, size_t RANK>
static AsyncValueRef<Chain> BiasAdd(const DenseHostTensor& input,
const DenseHostTensor& bias,
DenseHostTensor* output,
const ExecutionContext& exec_ctx) {
DHTIndexableView<T, RANK> input_view(&input);
MutableDHTIndexableView<T, RANK> output_view(output);
DHTIndexableView<T, 1> bias_view(&bias);
const auto& shape_input = input_view.FixedShape();
const auto& shape_bias = bias_view.FixedShape();
const auto& shape_output = output_view.FixedShape();
if (shape_input != shape_output) {
return EmitErrorAsync(exec_ctx, "unexpected output shape");
}
if (shape_bias[0] != shape_input[RANK - 1]) {
return EmitErrorAsync(exec_ctx, "bias shape does not match input shape");
}
// Reshape bias to the shape of input. Broadcast along the last axis of input.
Eigen::array<Eigen::Index, RANK> reshape_dims;
Eigen::array<Eigen::Index, RANK> broadcast_dims;
for (size_t i = 0; i < RANK - 1; ++i) {
reshape_dims[i] = static_cast<Eigen::Index>(1);
broadcast_dims[i] = static_cast<Eigen::Index>(shape_input[i]);
}
reshape_dims[RANK - 1] = static_cast<Eigen::Index>(shape_bias[0]);
broadcast_dims[RANK - 1] = static_cast<Eigen::Index>(1);
auto input_t = AsEigenConstTensor(input_view);
auto bias_t = AsEigenConstTensor(bias_view);
auto output_t = AsEigenTensor(output_view);
auto expr = input_t + bias_t.reshape(reshape_dims).broadcast(broadcast_dims);
return AsyncAssign(
exec_ctx.host()->GetOrCreateSharedContext<EigenHostContext>(),
std::move(output_t), std::move(expr),
KeepBuffers::alive(&input, &bias, output));
}
https://github.com/tensorflow/runtime/blob/master/backends/cpu/lib/kernels/cpu_kernels.h
28. Dialects we can see now
• tfrt: we know what this is for
• tfrt_test: to test tfrt
• tfrt_data: tf.data, to deal with input pipeline
• tfrt_dht: dense host tensor
• corert: Core Runtime, eager execution
• ts: tensor shape
• coo: COOrdinate list sparse tensor
• eigen: wrapper around the eigen library
• btf: binary tensor format
• cuda: you know what cuda means :-)
28
29. Concluding Remarks
• MLIR related talks and publications, https://mlir.llvm.org/talks/
• We scratched the surface of TFRT host runtime and core runtime. There are more details
• threading model: thread pool / work queue,
• memory allocation: tcmalloc for server, other small allocators for embedded systems,
• non-strict execution, and
• registers: BEF executor is a register machine
• we didn’t touch other important components such as device runtimes, eps. the GPU
part, and distributed environment
29
31. Device Runtime Design Principles
• A thin wrapper of low-level (driver) APIs, exposing device capabilities to graph compiler
• Memory Allocation
• Async host <-> device transfer, and kernel execution
• Dependency management
• Focus on mechanism instead of policy
• E.g. No built-in special-purpose streams for GPU support:
• For pure eager execution, can default to one stream for everything
• For tf.function execution, compiler can pick streams
31