|
| 1 | +# ABI Stable Unified String Tensors |
| 2 | + |
| 3 | +| Status | Accepted | |
| 4 | +:-------------- |:---------------------------------------------------- | |
| 5 | +| **Author(s)** | Dero Gharibian (dero@google.com) | |
| 6 | +| **Sponsor** | Gunhan Gulsoy (gunan@google.com) | |
| 7 | +| **Updated** | 2019-04-11 | |
| 8 | + |
| 9 | +## Objective |
| 10 | + |
| 11 | +To unify and define the byte interface of a string tensor across TensorFlow’s C |
| 12 | +API (`TF_STRING`), TF Lite (`kTfLiteString`), and TF-Core/C++ (`DT_STRING`) with |
| 13 | +the purpose of enabling |
| 14 | +[modular TensorFlow](https://github.com/tensorflow/community/pull/77) |
| 15 | +and mitigating the performance overhead of string tensor conversions. |
| 16 | + |
| 17 | +## Background |
| 18 | + |
| 19 | +[C++ string tensors](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/framework/types.h#L392) |
| 20 | +([`DT_STRING`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/framework/types.proto?#L24)) |
| 21 | +in TensorFlow are defined as a |
| 22 | +[contiguous array](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/framework/allocator.h?#L126) |
| 23 | +of `std::strings`. |
| 24 | + |
| 25 | +In contrast, C string |
| 26 | +([`TF_STRING`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.h?#L106)) |
| 27 | +and TFLite |
| 28 | +([`kTfLiteString`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/lite/c/c_api_internal.h?#L178)) |
| 29 | +strings tensors have a different public byte layout. In C, string tensors |
| 30 | +[are defined as](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.h?#L205) |
| 31 | +a list of uint64 offsets to varint prefixed char strings (where the varint |
| 32 | +defines the length of the string). Unlike C++ tensor strings, which can |
| 33 | +allocate larger strings on the heap, C string tensors are defined in a single |
| 34 | +block of contiguous memory. Similarly, in TFLite, string tensors |
| 35 | +[are defined as](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/lite/string_util.h?#L16) |
| 36 | +a list of integer offsets to character strings. The offset list is prefixed by |
| 37 | +total string count and suffixed by total buffer size. TFLite strings do not |
| 38 | +explicitly specify the string length for each string, instead they are inferred |
| 39 | +from the offset table. Unlike C strings, TFLite string tensors contain explicit |
| 40 | +string counts and the total buffer size in the buffer. Furthermore, since the |
| 41 | +endianness of TFLite string tensor description is explicit, TFLite strings are |
| 42 | +self-describing, exportable, and effectively mmap-able. |
| 43 | + |
| 44 | +When string tensors are marshalled across the C API, an expensive conversion |
| 45 | +process, via |
| 46 | +[`TF_TensorToTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L417) |
| 47 | +and |
| 48 | +[`TF_TensorFromTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L489), |
| 49 | +is done to convert a |
| 50 | +`TF_STRING` to a `DT_STRING` and vice-versa. This results in a performance hit |
| 51 | +at external language binding boundaries for string tensors. Furthermore, the |
| 52 | +current implementation of the C API does not provide setters/getters or other |
| 53 | +ancillary methods for constructing a `TF_STRING`. As a result, downstream |
| 54 | +language bindings to |
| 55 | +[Java](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/java/src/main/native/tensor_jni.cc?#L270), |
| 56 | +[golang](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/go/tensor.go?#L395), |
| 57 | +etc, modify a raw buffer in order to build |
| 58 | +the index offset list of strings. A similar conversion is done when TFLite |
| 59 | +strings are passed to C++ kernels. |
| 60 | + |
| 61 | +Our aim with modular TensorFlow is to facilitate the creation of externally |
| 62 | +built and dynamically loaded kernels. With modular TensorFlow, we plan to |
| 63 | +provide a thin header only C++ API that depends on the C API. If we do not |
| 64 | +update our approach to string tensors for modular TensorFlow, we will incur a |
| 65 | +heavy cost when processing string tensors in kernels due to the constant |
| 66 | +conversion between `DT_STRING` and `TF_STRING` across the C API. Currently, the |
| 67 | +marshalling of string tensors across TFLite and TFCore incurs a similar cost. |
| 68 | + |
| 69 | +In order to mitigate unnecessary performance degradation, we need to have a |
| 70 | +single definition of a string tensor which is ABI compatible across TFLite, C |
| 71 | +and C++. Furthermore, the string implementation needs to be ABI compatible |
| 72 | +across various compilers in order to enable modular TensorFlow. |
| 73 | + |
| 74 | +STL containers/strings are not ABI stable, and can vary across compilers, |
| 75 | +compiler versions, and even compiler flags. To mitigate these issues, we |
| 76 | +propose a lightweight ABI stable replacement for the underlying objects |
| 77 | +representing `DT_STRING`/`TF_STRING`/`kTfLiteString`s. |
| 78 | + |
| 79 | +## Overview |
| 80 | + |
| 81 | +We are proposing two sets of changes in order to (1) unify the definition of |
| 82 | +string tensors across C++/Core, C-API, and TFLite; and (2) to achieve ABI |
| 83 | +stability for string tensors across compilers on a single architecture. |
| 84 | + |
| 85 | +In order to unify the definition of string tensors in C, we propose the addition |
| 86 | +of new methods for creating and ingesting tensors in C. We are also proposing |
| 87 | +for the original method of creating tensors in the C API to be marked as |
| 88 | +deprecated. In order to support the transition, we will include a flag in the |
| 89 | +`TF_Tensor` struct to track which API the tensor was created with. Furthermore, |
| 90 | +we plan to provide accessors and mutators for string tensors in C, in order to |
| 91 | +simplify language bindings, and ease potential future changes to the byte layout |
| 92 | +of string tensors. |
| 93 | + |
| 94 | +For TFLite, we propose an additional enum for the new string tensor type, |
| 95 | +allowing for backwards compatibility with existing kTfLiteString tensors. The |
| 96 | +prototypes for string creation and string accessors in `strings_util.h` and |
| 97 | +`strings.h` do not need to change. |
| 98 | + |
| 99 | +For ABI stability, we propose a new string type that can handle four string |
| 100 | +variants: local “small strings” (`SmallType`), longer heap allocated strings |
| 101 | +(`LargeType`), exportable/`mmap`-able offset based strings (`OffsetType`), and |
| 102 | +preallocated strings---with capacity defined at string tensor |
| 103 | +initialization---as a part of a contiguous buffer (`PreallocType`). |
| 104 | + |
| 105 | +## Requirements |
| 106 | + |
| 107 | +To achieve our aim of having universal ABI stable string tensors, we must adhere |
| 108 | +to the following requirements |
| 109 | + |
| 110 | +1. Our approach must be ABI stable |
| 111 | +2. Our approach must work with the Eigen C++ header library. |
| 112 | +3. For TFLite adoption, our approach must support direct memory mapping of |
| 113 | + string tensors. |
| 114 | +4. For TFLite adoption, our approach must allow for the packed representation of |
| 115 | + a string tensor. |
| 116 | +5. Our approach must be performant relative to the current use of std::string. |
| 117 | +6. Our approach must allow for lvalue assignment of string values. |
| 118 | +7. Our approach must allow for piecewise deployment externally. In other words, |
| 119 | + during the migration period, downstream users must be able to opt out of our |
| 120 | + new string tensor implementation. |
| 121 | + |
| 122 | +## Detailed Design |
| 123 | + |
| 124 | +We propose a new header-only ABI-stable tstring class. Our aim with tstring is |
| 125 | +to provide a simplified container for strings which fits our narrow set of |
| 126 | +requirements stipulated above. tstring is not meant as a replacement for |
| 127 | +`std::string`, and will not implement the full interface for std::string. (Note |
| 128 | +that `tensorflow::string` is currently aliased to `std::string`) |
| 129 | + |
| 130 | +Our proposed string implementation will be similar to the canonical 'Small |
| 131 | +Strings Optimization' (SSO) used to implement std::string in C++ libraries, but |
| 132 | +will feature two additional underlying string container types. In particular, |
| 133 | +in addition to having a small local definition, and a heap allocated variant |
| 134 | +for longer strings, we propose two additional types: an mmap-able/exportable |
| 135 | +offset based string tensor (`OffsetType`), and a preallocated string tensor that |
| 136 | +allocates a user-specified minimum capacity for each string as a part of a |
| 137 | +contiguous buffer (`PreallocType`). |
| 138 | + |
| 139 | +OffsetType strings will be used to replace the current TFLite string tensor |
| 140 | +layout. PreallocType strings can be used in the future for performance |
| 141 | +improvements, where N strings with M capacity can be pre-allocated from a |
| 142 | +contiguous block of memory at tensor initialization with a single malloc |
| 143 | +(instead of incurring N mallocs for large strings with the current `std::string` |
| 144 | +implementation). In the scenario where a `PreallocType` string’s capacity is |
| 145 | +exceeded, the `PreallocType` would be converted to an `LargeType`. |
| 146 | + |
| 147 | +The following is a layout overview of the proposed new string container type: |
| 148 | + |
| 149 | +```cpp |
| 150 | +namespace tensorflow { |
| 151 | +class tstring { |
| 152 | + public: |
| 153 | + static const uint8_t kSmallType = 0x00; |
| 154 | + static const uint8_t kLargeType = 0x01; |
| 155 | + static const uint8_t kOffsetType = 0x02; |
| 156 | + static const uint8_t kPreallocType = 0x03; |
| 157 | + |
| 158 | + static const uint8_t kTypeMask = 0x03; |
| 159 | + |
| 160 | + struct LargeType { |
| 161 | + size_t size_; |
| 162 | + char* ptr_; |
| 163 | + |
| 164 | + // ... |
| 165 | + }; |
| 166 | + |
| 167 | + struct PreallocType { |
| 168 | + uint32_t size_; |
| 169 | + uint32_t cap_; |
| 170 | + char* ptr_; |
| 171 | + // See “Capacity member variable” section below. |
| 172 | + |
| 173 | + // ... |
| 174 | + }; |
| 175 | + |
| 176 | + struct OffsetType { |
| 177 | + uint32_t size_; |
| 178 | + uint32_t offset_; // `this` pointer + offset_ points to char string |
| 179 | + uint32_t count_; |
| 180 | + |
| 181 | + // ... |
| 182 | + }; |
| 183 | + |
| 184 | + struct RawType { |
| 185 | + uint8_t raw_[16]; |
| 186 | + }; |
| 187 | + |
| 188 | + union UnionedType { |
| 189 | + LargeType p; |
| 190 | + OffsetType o; |
| 191 | + PreallocType f; |
| 192 | + RawType r; |
| 193 | + }; |
| 194 | + |
| 195 | + enum { |
| 196 | + SmallTypeCapacity = (sizeof(UnionedType) - sizeof(uint8_t)) / sizeof(char), |
| 197 | + }; |
| 198 | + |
| 199 | + struct SmallType { |
| 200 | + uint8_t size_; |
| 201 | + char str_[SmallTypeCapacity]; |
| 202 | + |
| 203 | + // ... |
| 204 | + }; |
| 205 | + |
| 206 | + union { |
| 207 | + LargeType p; |
| 208 | + OffsetType o; |
| 209 | + PreallocType f; |
| 210 | + SmallType s; |
| 211 | + RawType r; |
| 212 | + }; |
| 213 | + |
| 214 | + const uint8_t type() const { return r.raw_[0] & kTypeMask; } |
| 215 | + |
| 216 | + // ... |
| 217 | +}; |
| 218 | +}; // namespace tensorflow |
| 219 | +``` |
| 220 | +
|
| 221 | + |
| 222 | +
|
| 223 | +Independent of endian-ness, the first two bits (lowest order) of the first byte |
| 224 | +will denote the string type (i.e. `r.raw_[0] & 0x03` above). For all string |
| 225 | +types except, kOffsetType, values will be stored in host byte order. Values |
| 226 | +for OffsetType strings will be explicitly define little endian. |
| 227 | +
|
| 228 | +### Eigen Compatibility |
| 229 | +
|
| 230 | +Of the four string types defined above, the only type with potential eigen |
| 231 | +compatibility issues is the `OffsetType`. Since the `OffsetType` relies on an |
| 232 | +offset value instead of a char pointer to point to the character string, and |
| 233 | +since eigen is restrictive on how |
| 234 | +[values are indexed](https://github.com/eigenteam/eigen-git-mirror/blob/branches/3.3/unsupported/Eigen/CXX11/src/Tensor/TensorMap.h#L147), |
| 235 | +the simplest approach for |
| 236 | +providing eigen compatibility is to define the 'offset' value as an offset from |
| 237 | +the this pointer of a tstring scalar, and not as an offset from the start of |
| 238 | +the tensor buffer. More concretely, accessing the character string for an |
| 239 | +OffsetType would be analogous to: |
| 240 | +
|
| 241 | +```cpp |
| 242 | + const char* data() const { |
| 243 | + return reinterpret_cast<const char*>(this) + offset_; |
| 244 | + } |
| 245 | +``` |
| 246 | + |
| 247 | +### Lvalue assignment |
| 248 | + |
| 249 | +In the scenario were an assignment exceeds the capacity `OffsetType`, `SmallType`, |
| 250 | +or `PreallocType` the string is converted to `LargeType` string. The original |
| 251 | +string type is copied as a prefix to the `LargeType` so that it can be reverted |
| 252 | +when the capacity falls below the original on a subsequent assignment. This |
| 253 | +feature will allow for lvalue assignment of OffsetType types, which, as a |
| 254 | +corollary, will allow for the assignment of TFLite string tensors. |
| 255 | + |
| 256 | +### Comparison of SSO implementations in gcc, llvm, and MSVC |
| 257 | + |
| 258 | +| | Size in Bytes | Small String Capacity |
| 259 | +|------------------------------------------ | -- | -- |
| 260 | +|tensorflow::string | 16 | 15 |
| 261 | +|tensorflow::string w/ capacity (see below) | 24 | 23 |
| 262 | +|GCC | 32 | 15 |
| 263 | +|MSVC | 32 | 15 |
| 264 | +|LLVM | 24 | 22 |
| 265 | + |
| 266 | +### Capacity member variable |
| 267 | + |
| 268 | +Currently, TFLite has an ~8 byte overhead per string entry, which is used to |
| 269 | +describe the offset. The tstring class described above has an overhead of 16 |
| 270 | +bytes, and does not include a capacity field normally found in SSO |
| 271 | +implementations for LargeType strings. This comes with downsides. Without a |
| 272 | +capacity field LargeType strings are forced to always call realloc on |
| 273 | +assignment. To reduce the potential number of calls to realloc, we can add a |
| 274 | +capacity field, at the cost of increasing the per string overhead to 24 bytes. |
| 275 | +This would put as at parity with LLVM strings, but would result in a 3x |
| 276 | +overhead compared to current TFLite strings. |
| 277 | + |
| 278 | +### Updates to C API |
| 279 | + |
| 280 | +Currently, downstream language bindings to |
| 281 | +[Java](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/java/src/main/native/tensor_jni.cc?#L270), |
| 282 | +[golang](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/go/tensor.go?#L395), |
| 283 | +etc, are expected to |
| 284 | +modify a raw buffer in order to construct and pass a string tensor. We propose |
| 285 | +new functions to abstract the creation of C string tensors, and a new flag in |
| 286 | +`TF_Tensor` which tracks the method with which a C string tensor was created. |
| 287 | +Using the new methods will effectively create C++ string tensors underneath, |
| 288 | +and, when passed back and forth, will mitigate the conversion of `C TF_STRING` |
| 289 | +tensors to C++ `DT_STRING` tensors and vice-versa via |
| 290 | +[`TF_TensorToTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L417) |
| 291 | +and |
| 292 | +[`TF_TensorFromTensor`](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/c/c_api.cc?#L489). |
| 293 | + |
| 294 | +### Updates to TFLite |
| 295 | + |
| 296 | +Since TFLite provides generators and accessors for TFLite string tensors, the |
| 297 | +requisite changes needed to have TFLite conform to the `OffsetType` defined above |
| 298 | +is on the order of a ~20 line CL. Backwards compatibility can be maintained by |
| 299 | +creating a new TFLite enum for tstring separate from the existing |
| 300 | +kTfLiteString enum. |
0 commit comments