Skip to content

Commit 391360e

Browse files
gharibianEdd Wilder-James
authored and
Edd Wilder-James
committed
RFC: String Tensor Unification (#91)
* RFC: String Tensor Unification * Updated rfcs/20190411-string-unification.md Updated TFLite sections to address feedback from @jdduke. Marked as Accepted.
1 parent 320a18a commit 391360e

File tree

2 files changed

+300
-0
lines changed

2 files changed

+300
-0
lines changed

rfcs/20190411-string-unification.md

Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
# ABI Stable Unified String Tensors
2+
3+
| Status | Accepted |
4+
:-------------- |:---------------------------------------------------- |
5+
| **Author(s)** | Dero Gharibian (dero@google.com) |
6+
| **Sponsor** | Gunhan Gulsoy (gunan@google.com) |
7+
| **Updated** | 2019-04-11 |
8+
9+
## Objective
10+
11+
To unify and define the byte interface of a string tensor across TensorFlow’s C
12+
API (`TF_STRING`), TF Lite (`kTfLiteString`), and TF-Core/C++ (`DT_STRING`) with
13+
the purpose of enabling
14+
[modular TensorFlow](https://github.com/tensorflow/community/pull/77)
15+
and mitigating the performance overhead of string tensor conversions.
16+
17+
## Background
18+
19+
[C++ string tensors](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/framework/types.h#L392)
20+
([`DT_STRING`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/framework/types.proto?#L24))
21+
in TensorFlow are defined as a
22+
[contiguous array](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/framework/allocator.h?#L126)
23+
of `std::strings`.
24+
25+
In contrast, C string
26+
([`TF_STRING`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.h?#L106))
27+
and TFLite
28+
([`kTfLiteString`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/lite/c/c_api_internal.h?#L178))
29+
strings tensors have a different public byte layout. In C, string tensors
30+
[are defined as](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.h?#L205)
31+
a list of uint64 offsets to varint prefixed char strings (where the varint
32+
defines the length of the string). Unlike C++ tensor strings, which can
33+
allocate larger strings on the heap, C string tensors are defined in a single
34+
block of contiguous memory. Similarly, in TFLite, string tensors
35+
[are defined as](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/lite/string_util.h?#L16)
36+
a list of integer offsets to character strings. The offset list is prefixed by
37+
total string count and suffixed by total buffer size. TFLite strings do not
38+
explicitly specify the string length for each string, instead they are inferred
39+
from the offset table. Unlike C strings, TFLite string tensors contain explicit
40+
string counts and the total buffer size in the buffer. Furthermore, since the
41+
endianness of TFLite string tensor description is explicit, TFLite strings are
42+
self-describing, exportable, and effectively mmap-able.
43+
44+
When string tensors are marshalled across the C API, an expensive conversion
45+
process, via
46+
[`TF_TensorToTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L417)
47+
and
48+
[`TF_TensorFromTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L489),
49+
is done to convert a
50+
`TF_STRING` to a `DT_STRING` and vice-versa. This results in a performance hit
51+
at external language binding boundaries for string tensors. Furthermore, the
52+
current implementation of the C API does not provide setters/getters or other
53+
ancillary methods for constructing a `TF_STRING`. As a result, downstream
54+
language bindings to
55+
[Java](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/java/src/main/native/tensor_jni.cc?#L270),
56+
[golang](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/go/tensor.go?#L395),
57+
etc, modify a raw buffer in order to build
58+
the index offset list of strings. A similar conversion is done when TFLite
59+
strings are passed to C++ kernels.
60+
61+
Our aim with modular TensorFlow is to facilitate the creation of externally
62+
built and dynamically loaded kernels. With modular TensorFlow, we plan to
63+
provide a thin header only C++ API that depends on the C API. If we do not
64+
update our approach to string tensors for modular TensorFlow, we will incur a
65+
heavy cost when processing string tensors in kernels due to the constant
66+
conversion between `DT_STRING` and `TF_STRING` across the C API. Currently, the
67+
marshalling of string tensors across TFLite and TFCore incurs a similar cost.
68+
69+
In order to mitigate unnecessary performance degradation, we need to have a
70+
single definition of a string tensor which is ABI compatible across TFLite, C
71+
and C++. Furthermore, the string implementation needs to be ABI compatible
72+
across various compilers in order to enable modular TensorFlow.
73+
74+
STL containers/strings are not ABI stable, and can vary across compilers,
75+
compiler versions, and even compiler flags. To mitigate these issues, we
76+
propose a lightweight ABI stable replacement for the underlying objects
77+
representing `DT_STRING`/`TF_STRING`/`kTfLiteString`s.
78+
79+
## Overview
80+
81+
We are proposing two sets of changes in order to (1) unify the definition of
82+
string tensors across C++/Core, C-API, and TFLite; and (2) to achieve ABI
83+
stability for string tensors across compilers on a single architecture.
84+
85+
In order to unify the definition of string tensors in C, we propose the addition
86+
of new methods for creating and ingesting tensors in C. We are also proposing
87+
for the original method of creating tensors in the C API to be marked as
88+
deprecated. In order to support the transition, we will include a flag in the
89+
`TF_Tensor` struct to track which API the tensor was created with. Furthermore,
90+
we plan to provide accessors and mutators for string tensors in C, in order to
91+
simplify language bindings, and ease potential future changes to the byte layout
92+
of string tensors.
93+
94+
For TFLite, we propose an additional enum for the new string tensor type,
95+
allowing for backwards compatibility with existing kTfLiteString tensors. The
96+
prototypes for string creation and string accessors in `strings_util.h` and
97+
`strings.h` do not need to change.
98+
99+
For ABI stability, we propose a new string type that can handle four string
100+
variants: local “small strings” (`SmallType`), longer heap allocated strings
101+
(`LargeType`), exportable/`mmap`-able offset based strings (`OffsetType`), and
102+
preallocated strings---with capacity defined at string tensor
103+
initialization---as a part of a contiguous buffer (`PreallocType`).
104+
105+
## Requirements
106+
107+
To achieve our aim of having universal ABI stable string tensors, we must adhere
108+
to the following requirements
109+
110+
1. Our approach must be ABI stable
111+
2. Our approach must work with the Eigen C++ header library.
112+
3. For TFLite adoption, our approach must support direct memory mapping of
113+
string tensors.
114+
4. For TFLite adoption, our approach must allow for the packed representation of
115+
a string tensor.
116+
5. Our approach must be performant relative to the current use of std::string.
117+
6. Our approach must allow for lvalue assignment of string values.
118+
7. Our approach must allow for piecewise deployment externally. In other words,
119+
during the migration period, downstream users must be able to opt out of our
120+
new string tensor implementation.
121+
122+
## Detailed Design
123+
124+
We propose a new header-only ABI-stable tstring class. Our aim with tstring is
125+
to provide a simplified container for strings which fits our narrow set of
126+
requirements stipulated above. tstring is not meant as a replacement for
127+
`std::string`, and will not implement the full interface for std::string. (Note
128+
that `tensorflow::string` is currently aliased to `std::string`)
129+
130+
Our proposed string implementation will be similar to the canonical 'Small
131+
Strings Optimization' (SSO) used to implement std::string in C++ libraries, but
132+
will feature two additional underlying string container types. In particular,
133+
in addition to having a small local definition, and a heap allocated variant
134+
for longer strings, we propose two additional types: an mmap-able/exportable
135+
offset based string tensor (`OffsetType`), and a preallocated string tensor that
136+
allocates a user-specified minimum capacity for each string as a part of a
137+
contiguous buffer (`PreallocType`).
138+
139+
OffsetType strings will be used to replace the current TFLite string tensor
140+
layout. PreallocType strings can be used in the future for performance
141+
improvements, where N strings with M capacity can be pre-allocated from a
142+
contiguous block of memory at tensor initialization with a single malloc
143+
(instead of incurring N mallocs for large strings with the current `std::string`
144+
implementation). In the scenario where a `PreallocType` string’s capacity is
145+
exceeded, the `PreallocType` would be converted to an `LargeType`.
146+
147+
The following is a layout overview of the proposed new string container type:
148+
149+
```cpp
150+
namespace tensorflow {
151+
class tstring {
152+
public:
153+
static const uint8_t kSmallType = 0x00;
154+
static const uint8_t kLargeType = 0x01;
155+
static const uint8_t kOffsetType = 0x02;
156+
static const uint8_t kPreallocType = 0x03;
157+
158+
static const uint8_t kTypeMask = 0x03;
159+
160+
struct LargeType {
161+
size_t size_;
162+
char* ptr_;
163+
164+
// ...
165+
};
166+
167+
struct PreallocType {
168+
uint32_t size_;
169+
uint32_t cap_;
170+
char* ptr_;
171+
// See “Capacity member variable” section below.
172+
173+
// ...
174+
};
175+
176+
struct OffsetType {
177+
uint32_t size_;
178+
uint32_t offset_; // `this` pointer + offset_ points to char string
179+
uint32_t count_;
180+
181+
// ...
182+
};
183+
184+
struct RawType {
185+
uint8_t raw_[16];
186+
};
187+
188+
union UnionedType {
189+
LargeType p;
190+
OffsetType o;
191+
PreallocType f;
192+
RawType r;
193+
};
194+
195+
enum {
196+
SmallTypeCapacity = (sizeof(UnionedType) - sizeof(uint8_t)) / sizeof(char),
197+
};
198+
199+
struct SmallType {
200+
uint8_t size_;
201+
char str_[SmallTypeCapacity];
202+
203+
// ...
204+
};
205+
206+
union {
207+
LargeType p;
208+
OffsetType o;
209+
PreallocType f;
210+
SmallType s;
211+
RawType r;
212+
};
213+
214+
const uint8_t type() const { return r.raw_[0] & kTypeMask; }
215+
216+
// ...
217+
};
218+
}; // namespace tensorflow
219+
```
220+
221+
![tstring layout](20190411-string-unification/tstring_layout.png)
222+
223+
Independent of endian-ness, the first two bits (lowest order) of the first byte
224+
will denote the string type (i.e. `r.raw_[0] & 0x03` above). For all string
225+
types except, kOffsetType, values will be stored in host byte order. Values
226+
for OffsetType strings will be explicitly define little endian.
227+
228+
### Eigen Compatibility
229+
230+
Of the four string types defined above, the only type with potential eigen
231+
compatibility issues is the `OffsetType`. Since the `OffsetType` relies on an
232+
offset value instead of a char pointer to point to the character string, and
233+
since eigen is restrictive on how
234+
[values are indexed](https://github.com/eigenteam/eigen-git-mirror/blob/branches/3.3/unsupported/Eigen/CXX11/src/Tensor/TensorMap.h#L147),
235+
the simplest approach for
236+
providing eigen compatibility is to define the 'offset' value as an offset from
237+
the this pointer of a tstring scalar, and not as an offset from the start of
238+
the tensor buffer. More concretely, accessing the character string for an
239+
OffsetType would be analogous to:
240+
241+
```cpp
242+
const char* data() const {
243+
return reinterpret_cast<const char*>(this) + offset_;
244+
}
245+
```
246+
247+
### Lvalue assignment
248+
249+
In the scenario were an assignment exceeds the capacity `OffsetType`, `SmallType`,
250+
or `PreallocType` the string is converted to `LargeType` string. The original
251+
string type is copied as a prefix to the `LargeType` so that it can be reverted
252+
when the capacity falls below the original on a subsequent assignment. This
253+
feature will allow for lvalue assignment of OffsetType types, which, as a
254+
corollary, will allow for the assignment of TFLite string tensors.
255+
256+
### Comparison of SSO implementations in gcc, llvm, and MSVC
257+
258+
| | Size in Bytes | Small String Capacity
259+
|------------------------------------------ | -- | --
260+
|tensorflow::string | 16 | 15
261+
|tensorflow::string w/ capacity (see below) | 24 | 23
262+
|GCC | 32 | 15
263+
|MSVC | 32 | 15
264+
|LLVM | 24 | 22
265+
266+
### Capacity member variable
267+
268+
Currently, TFLite has an ~8 byte overhead per string entry, which is used to
269+
describe the offset. The tstring class described above has an overhead of 16
270+
bytes, and does not include a capacity field normally found in SSO
271+
implementations for LargeType strings. This comes with downsides. Without a
272+
capacity field LargeType strings are forced to always call realloc on
273+
assignment. To reduce the potential number of calls to realloc, we can add a
274+
capacity field, at the cost of increasing the per string overhead to 24 bytes.
275+
This would put as at parity with LLVM strings, but would result in a 3x
276+
overhead compared to current TFLite strings.
277+
278+
### Updates to C API
279+
280+
Currently, downstream language bindings to
281+
[Java](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/java/src/main/native/tensor_jni.cc?#L270),
282+
[golang](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/go/tensor.go?#L395),
283+
etc, are expected to
284+
modify a raw buffer in order to construct and pass a string tensor. We propose
285+
new functions to abstract the creation of C string tensors, and a new flag in
286+
`TF_Tensor` which tracks the method with which a C string tensor was created.
287+
Using the new methods will effectively create C++ string tensors underneath,
288+
and, when passed back and forth, will mitigate the conversion of `C TF_STRING`
289+
tensors to C++ `DT_STRING` tensors and vice-versa via
290+
[`TF_TensorToTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L417)
291+
and
292+
[`TF_TensorFromTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L489).
293+
294+
### Updates to TFLite
295+
296+
Since TFLite provides generators and accessors for TFLite string tensors, the
297+
requisite changes needed to have TFLite conform to the `OffsetType` defined above
298+
is on the order of a ~20 line CL. Backwards compatibility can be maintained by
299+
creating a new TFLite enum for tstring separate from the existing
300+
kTfLiteString enum.
Loading

0 commit comments

Comments
 (0)