-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-2926 BSON Binary Vector Subtype Support #1658
Changes from 28 commits
ec64aa9
d5ab5f1
91212ca
8757836
830632a
67b410d
07667a1
80f19fa
7255b6c
0ff289b
b3d6ea0
8cfc15a
a6ee71b
0d10725
5935ce0
a8b464e
f50677b
2d4ea72
f50b1cc
d6f160b
60088d9
a1b87f7
d267b2a
2cd0b4a
0b888fb
c823174
fcc1be5
d00541a
b30ed35
0ccc399
ae32422
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,231 @@ | ||
# BSON Binary Subtype 9 - Vector | ||
|
||
- Status: Pending | ||
- Minimum Server Version: N/A | ||
|
||
______________________________________________________________________ | ||
|
||
## Abstract | ||
|
||
This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors | ||
here refer to densely packed arrays of numbers, all of the same type. | ||
|
||
## Motivation | ||
|
||
These representations correspond to the numeric types supported by popular numerical libraries for vector processing, | ||
such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed | ||
format used by these libraries can result in significant memory savings and processing efficiency. | ||
|
||
### META | ||
|
||
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and | ||
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). | ||
|
||
## Specification | ||
|
||
This specification introduces a new BSON binary subtype, the vector, with value `9`. | ||
|
||
Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. | ||
|
||
### Data Types (dtypes) | ||
|
||
Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. | ||
|
||
| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | | ||
| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | | ||
| `0x03` | INT8 | 8 | INT8 | | ||
| `0x27` | FLOAT32 | 32 | FLOAT | | ||
| `0x10` | PACKED_BIT | 1 `*` | BOOL | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous documents I saw had other data types too, why is this spec limited to these 3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was the defined work. The other document is not up-to-date or correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yesterday I gained some clarity on the initial requirements. They were for implementation in Python and MongoT. This specification is under the same time pressure, so perhaps we discuss the full design and whether this should be expanded to include those that are not yet implemented. |
||
|
||
`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of | ||
integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector | ||
`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, | ||
some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. | ||
|
||
### Byte padding | ||
|
||
As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of | ||
bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the | ||
final byte that are to be ignored. The least-significant bits are ignored. | ||
|
||
### Binary structure | ||
|
||
Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers. | ||
|
||
- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may | ||
increase. dtype is an unsigned integer. | ||
|
||
- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative | ||
integer. It must be present, even in cases where it is not applicable, and set to zero. | ||
|
||
- The remainder contains the actual vector elements packed according to dtype. | ||
|
||
All values use the little-endian format. | ||
|
||
#### Example | ||
|
||
Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`. | ||
|
||
In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8. | ||
|
||
We can visualize the binary representation like so: | ||
|
||
<table border="1" cellspacing="0" cellpadding="5"> | ||
<tr> | ||
<td colspan="8">1st byte: dtype (from list in previous table) </td> | ||
nbbeeken marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<td colspan="8">2nd byte: padding (values in [0,7])</td> | ||
<td colspan="8">1st uint8: 238</td> | ||
<td colspan="8">2nd uint8: 224</td> | ||
</tr> | ||
<tr> | ||
<td>0</td> | ||
jyemin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<td>0</td> | ||
<td>0</td> | ||
<td>1</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>1</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>1</td> | ||
<td>1</td> | ||
<td>1</td> | ||
<td>0</td> | ||
<td>1</td> | ||
<td>1</td> | ||
<td>1</td> | ||
<td>0</td> | ||
<td>1</td> | ||
<td>1</td> | ||
<td>1</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
<td>0</td> | ||
</tr> | ||
</table> | ||
|
||
Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this! | ||
|
||
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | | ||
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ||
|
||
## API Guidance | ||
|
||
Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while | ||
following idioms of the language of the driver. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that the server currently does not validate the metadata of vectors when they are inserted into the database. As a result, users can insert vectors with invalid metadata if they construct binary manually, such as setting the padding to -1 or applying padding to FLOAT32. We do have tests in the specification that enforce validation in drivers' API to ensure that vectors with corrupted metadata never reach the server. However, the specification does not require drivers to validate vector metadata when the data is being read back from the database. This means that if the metadata becomes corrupted (e.g., a FLOAT32 vector with padding greater than 0), drivers could return a Vector object/structure/class from their API's to the user that contains invalid data and violates expected invariants. Should we mention this in the spec API guidance and enforce validation when reading vectors from the database, ensuring that drivers check for invariant violations in the binary data? This would prevent returning invalid Vector structures to users. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vbabanin This is a good idea. Would you like to propose some prose for this? How about the following? Implementing drivers MUST validate metadata. Padding must be 0 for all dtypes where padding does not apply. When padding is applicable, it must be in [0, 7]. If implementing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds good! I’d suggest making it just a bit more explicit to ensure clarity on where the validation should happen. How about something like this:
I didn’t include What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Validation section was added in this commit: 0ccc39944 |
||
### Encoding | ||
|
||
``` | ||
Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Should There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vbabanin. I completely agree. That was an oversight. What about this? The above is suggestive only. If a driver chooses to implement a Vector type (or numerous) |
||
# Converts a numeric vector into a binary representation based on the specified dtype and padding. | ||
|
||
# :param vector: A sequence or iterable of numbers (either float or int) | ||
# :param dtype: Data type for binary conversion (from DtypeEnum) | ||
# :param padding: Optional integer specifying how many bits to ignore in the final byte | ||
# :return: A binary representation of the vector | ||
|
||
Declare binary_data as Binary | ||
|
||
# Process each number in vector and convert according to dtype | ||
For each number in vector | ||
binary_element = convert_to_binary(number, dtype) | ||
binary_data.append(binary_element) | ||
End For | ||
|
||
# Apply padding to the binary data if needed | ||
If padding > 0 | ||
apply_padding(binary_data, padding) | ||
End If | ||
|
||
Return binary_data | ||
End Function | ||
``` | ||
|
||
### Decoding | ||
|
||
``` | ||
Function as_vector() -> Vector | ||
# Unpacks binary data (BSON or similar) into a Vector structure. | ||
# This process involves extracting numeric values, the data type, and padding information. | ||
|
||
# :return: A BinaryVector containing the unpacked numeric values, dtype, and padding. | ||
|
||
Declare binary_vector as BinaryVector # Struct to hold the unpacked data | ||
|
||
# Extract dtype (data type) from the binary data | ||
binary_vector.dtype = extract_dtype_from_binary() | ||
|
||
# Extract padding from the binary data | ||
binary_vector.padding = extract_padding_from_binary() | ||
|
||
# Unpack the actual numeric values from the binary data according to the dtype | ||
binary_vector.data = unpack_numeric_values(binary_vector.dtype) | ||
|
||
Return binary_vector | ||
End Function | ||
``` | ||
|
||
#### Data Structures | ||
|
||
Drivers MAY find the following structures to represent the dtype and vector structure useful. | ||
|
||
``` | ||
Enum Dtype | ||
# Enum for data types (dtype) | ||
|
||
# FLOAT32: Represents packing of list of floats as float32 | ||
# Value: 0x27 (hexadecimal byte value) | ||
|
||
# INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8 | ||
# Value: 0x03 (hexadecimal byte value) | ||
|
||
# PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255] | ||
# Packed into groups of 8 (a byte) | ||
# Value: 0x10 (hexadecimal byte value) | ||
|
||
# Documentation: | ||
# Each value is a byte (length of one), a convenient choice for decoding. | ||
End Enum | ||
|
||
Struct Vector | ||
# Numeric vector with metadata for binary interoperability | ||
|
||
# Fields: | ||
# data: Sequence of numeric values (either float or int) | ||
# dtype: Data type of vector (from enum BinaryVectorDtype) | ||
# padding: Number of bits to ignore in the final byte for alignment | ||
|
||
data # Sequence of float or int | ||
dtype # Type: DtypeEnum | ||
padding # Integer: Number of padding bits | ||
End Struct | ||
``` | ||
|
||
## Reference Implementation | ||
|
||
- PYTHON (PYTHON-4577) | ||
|
||
## Test Plan | ||
|
||
See the [README](tests/README.md) for tests. | ||
|
||
## FAQ | ||
|
||
- What MongoDB Server version does this apply to? | ||
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. | ||
- In PACKED_BIT, why would one choose to use integers in \[0, 256)? | ||
- This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is | ||
widely used across different fields, such as data compression, communication protocols, and file formats, where you | ||
want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example | ||
in Python, see | ||
[numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Testing Binary subtype 9: Vector | ||
|
||
The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to | ||
the specification. | ||
|
||
These tests focus on the roundtrip of the list of numbers as input/output, along with their data type and byte padding. | ||
|
||
Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector | ||
to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype. | ||
|
||
## Format | ||
|
||
The test data corpus consists of a JSON file for each data type (dtype). Each file contains a number of test cases, | ||
under the top-level key "tests". Each test case pertains to a single vector. The keys provide the specification of the | ||
vector. Valid cases also include the Canonical BSON format of a document {test_key: binary}. The "test_key" is common, | ||
and specified at the top level. | ||
|
||
#### Top level keys | ||
|
||
Each JSON file contains three top-level keys. | ||
|
||
- `description`: human-readable description of what is in the file | ||
- `test_key`: name used for key when encoding/decoding a BSON document containing the single BSON Binary for the test | ||
case. Applies to *every* case. | ||
- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional | ||
binary and json encoding values. | ||
|
||
#### Keys of individual tests cases | ||
|
||
- `description`: string describing the test. | ||
- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input. | ||
- `vector`: list of numbers | ||
- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27") | ||
- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum. | ||
- `padding`: (optional) integer for byte padding. Defaults to 0. | ||
- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string. | ||
|
||
## Required tests | ||
|
||
To prove correct in a valid case (`valid: true`), one MUST | ||
|
||
- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match | ||
those provided in the JSON. | ||
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the | ||
canonical_bson string. | ||
- For floating point number types, numerical values need not match exactly. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What can actually be asserted about floats? Can we place any sort of mathematical limit on the non-exact match? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When converting a float32 which is already stored as an approximation, like 127.699..... for 127.7, in Java for example, to a 4-byte array and back, no additional approximation error should occur since it's a direct encoding and decoding of the binary representation. So precise matching in tests seems important for drivers that support float32 to catch any mistakes in the Should we also specify a small margin of error for comparing floating-point values for drivers supporting only float64 where precision loss could happen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something like this? "For drivers that natively support types like float32, they SHOULD implement additional testing. For example, precise matching for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float algorithm itself." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good! I think drivers that support
What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like it. I made an update. Please take a look. |
||
|
||
To prove correct in an invalid case (`valid:false`), one MUST | ||
|
||
- raise an exception when attempting to encode a document from the numeric values, dtype, and padding. | ||
|
||
## FAQ | ||
|
||
- What MongoDB Server version does this apply to? | ||
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
{ | ||
"description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32", | ||
"test_key": "vector", | ||
"tests": [ | ||
{ | ||
"description": "Simple Vector FLOAT32", | ||
"valid": true, | ||
"vector": [127.0, 7.0], | ||
vbabanin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000" | ||
}, | ||
{ | ||
"description": "Vector with decimals and negative value FLOAT32", | ||
"valid": true, | ||
"vector": [127.7, -7.7], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "1C00000005766563746F72000A0000000927006666FF426666F6C000" | ||
}, | ||
{ | ||
"description": "Empty Vector FLOAT32", | ||
"valid": true, | ||
"vector": [], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "1400000005766563746F72000200000009270000" | ||
}, | ||
{ | ||
"description": "Infinity Vector FLOAT32", | ||
"valid": true, | ||
"vector": ["-inf", 0.0, "inf"], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 0, | ||
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00" | ||
}, | ||
{ | ||
"description": "FLOAT32 with padding", | ||
"valid": false, | ||
"vector": [127.0, 7.0], | ||
"dtype_hex": "0x27", | ||
"dtype_alias": "FLOAT32", | ||
"padding": 3 | ||
} | ||
] | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
{ | ||
"description": "Tests of Binary subtype 9, Vectors, with dtype INT8", | ||
"test_key": "vector", | ||
"tests": [ | ||
{ | ||
"description": "Simple Vector INT8", | ||
"valid": true, | ||
"vector": [127, 7], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0, | ||
"canonical_bson": "1600000005766563746F7200040000000903007F0700" | ||
}, | ||
{ | ||
"description": "Empty Vector INT8", | ||
"valid": true, | ||
"vector": [], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0, | ||
"canonical_bson": "1400000005766563746F72000200000009030000" | ||
}, | ||
{ | ||
"description": "Overflow Vector INT8", | ||
"valid": false, | ||
"vector": [128], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0 | ||
}, | ||
{ | ||
"description": "Underflow Vector INT8", | ||
"valid": false, | ||
"vector": [-129], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0 | ||
}, | ||
{ | ||
"description": "INT8 with padding", | ||
"valid": false, | ||
"vector": [127, 7], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 3 | ||
}, | ||
{ | ||
"description": "INT8 with float inputs", | ||
"valid": false, | ||
"vector": [127.77, 7.77], | ||
"dtype_hex": "0x03", | ||
"dtype_alias": "INT8", | ||
"padding": 0 | ||
} | ||
] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence seems to be hiding a lot of complexity. What should the driver APIs look like? What happens if the "padding" field is non-zero but the dtype is a multiple of 8? How does the padding field change the output? Are we planning to add a NONPACKED_BIT type which represents the data as unit1 (or bool) eg the user would actually give [1, 0, 0, 1] for a 4-bit vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my understanding that it is up to the driver to implement the API as they please. Is that not true?
If the padding is non-zero for a dtype where it does not make sense, then the tests will fail.