-
Notifications
You must be signed in to change notification settings - Fork 587
Commit b857dd3
Implemented a more space efficient string<->integer map. (#9113)
Summary:
While investigating memory consumption, I noticed that the tiktoken loader was allocating 16MB of memory, mainly distributed over two std::unordered_maps. These two maps (the Encode and Decoder types) in Tiktoken are inverses of each other, and the std::string objects contained therein are clones of each other.
The allocation of a node in each map is 40 bytes (on aarch64 Android):
* 2x doubly linked list pointers at 8 bytes each
* 1 std::uint64_t (8 bytes)
* 1 std::string (12 bytes, std::strings contain an internal buffer for small strings).
Each node actually allocates 48 bytes of usable memory, as the allocator aligns the allocations to 16 byte boundaries.
This implementation of the string/integer map has several features:
* Sharing of the data payload between two hash indices.
* Variable sized integers, variable sized string length fields and best fit allocation of string data. That is to say, the data payload elements are variable sized.
The implemented unit tests tracks the memory size allocated between the old std::unordered_map method and the new StringIntegerMap method, yielding a ~6x improvement in the memory allocated:
```
string integer map size = 2623343
unordered map size = 16078928
```
There was a significant speedup when looking up strings, although looking up integers was about the same:
```------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------
BM_FindStringIntegerMapString/iterations:100 4722 us 4722 us 100
BM_FindStringIntegerMapInteger/iterations:100 529 us 529 us 100
BM_FindStringIntegerMapStringOptional/iterations:100 4714 us 4713 us 100
BM_FindStringIntegerMapIntegerOptional/iterations:100 537 us 536 us 100
BM_FindStdUnorderedMapString/iterations:100 7128 us 7127 us 100
BM_FindStdUnorderedMapInteger/iterations:100 536 us 536 us 100
```
Reviewed By: swolchok, larryliu0820
Differential Revision: D694728411 parent 77056c0 commit b857dd3Copy full SHA for b857dd3
File tree
Expand file treeCollapse file tree
7 files changed
+957
-64
lines changedFilter options
- extension/llm/tokenizer
- test
Expand file treeCollapse file tree
7 files changed
+957
-64
lines changed
0 commit comments