Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2926 BSON Binary Vector Subtype Support #1658

Merged
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ec64aa9
Added bson_corpus test new binary subtype 9: vectors
caseyclements Sep 13, 2024
d5ab5f1
Added first draft of Binary Vector subtype spec markdown
caseyclements Sep 16, 2024
91212ca
Move bson-binary-vector.md from bson-corpus to its own dir
caseyclements Sep 17, 2024
8757836
Updates based on feedback.
caseyclements Sep 17, 2024
830632a
Added README.md for Binary Vector tests
caseyclements Sep 20, 2024
67b410d
Added tests for binary vector subtype
caseyclements Sep 20, 2024
07667a1
Broke tests into 3 files by dtype
caseyclements Sep 20, 2024
80f19fa
Added github link in Reference Implementation
caseyclements Sep 20, 2024
7255b6c
PyArrow -> Arrow
caseyclements Sep 20, 2024
0ff289b
Added reference to jira ticket
caseyclements Sep 21, 2024
b3d6ea0
Added example for Binary structure
caseyclements Sep 23, 2024
8cfc15a
Added table visualization of binary structure
caseyclements Sep 23, 2024
a6ee71b
typo
caseyclements Sep 23, 2024
0d10725
Updates from Anna's comments
caseyclements Sep 26, 2024
5935ce0
Correction from Shane's comment
caseyclements Sep 26, 2024
a8b464e
Fix typo in binary structure html table
caseyclements Sep 26, 2024
f50677b
Moved editorial comments about PACKED_BIT ambiguity to an FAQ
caseyclements Sep 27, 2024
2d4ea72
Added Required Tests section to README. Removed JSON from tests.
caseyclements Sep 27, 2024
f50b1cc
Further improvements to PACKED_BIT with padding example.
caseyclements Sep 27, 2024
d6f160b
Fixed consistency for subtype reference. Follows Tech Design Doc
caseyclements Sep 28, 2024
60088d9
Addressed Neal's comments.
caseyclements Oct 7, 2024
a1b87f7
Made clear that it is the least significant bit that is ignored.
caseyclements Oct 7, 2024
d267b2a
Additional invalid test cases for PACKED_BIT vectors
caseyclements Oct 7, 2024
2cd0b4a
Merge branch 'master' into DRIVERS-2926-BSON-Binary-Vectors
caseyclements Oct 7, 2024
0b888fb
Additional float32 binary vector test cases
caseyclements Oct 7, 2024
c823174
Added API Guidance section
caseyclements Oct 8, 2024
fcc1be5
Change mention of binary subtype from x09 to 9
caseyclements Oct 16, 2024
d00541a
Remove github link to pymongo.binary. Reference implementation now si…
caseyclements Oct 16, 2024
b30ed35
Clarification of test requirements for drivers that natively support …
caseyclements Oct 22, 2024
0ccc399
Adds Validation subsection
caseyclements Oct 22, 2024
ae32422
Add note about signature of from_vector, that it be implemented as fr…
caseyclements Oct 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions source/bson-binary-vector/bson-binary-vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# BSON Binary Subtype 9 - Vector

- Status: Pending
- Minimum Server Version: N/A

______________________________________________________________________

## Abstract

This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors
here refer to densely packed arrays of numbers, all of the same type.

## Motivation

These representations correspond to the numeric types supported by popular numerical libraries for vector processing,
such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed
format used by these libraries can result in significant memory savings and processing efficiency.

### META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Specification

This specification introduces a new BSON binary subtype, the vector, with value `9`.

Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence seems to be hiding a lot of complexity. What should the driver APIs look like? What happens if the "padding" field is non-zero but the dtype is a multiple of 8? How does the padding field change the output? Are we planning to add a NONPACKED_BIT type which represents the data as unit1 (or bool) eg the user would actually give [1, 0, 0, 1] for a 4-bit vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that it is up to the driver to implement the API as they please. Is that not true?

If the padding is non-zero for a dtype where it does not make sense, then the tests will fail.


### Data Types (dtypes)

Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented.

| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) |
| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- |
| `0x03` | INT8 | 8 | INT8 |
| `0x27` | FLOAT32 | 32 | FLOAT |
| `0x10` | PACKED_BIT | 1 `*` | BOOL |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous documents I saw had other data types too, why is this spec limited to these 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the defined work. The other document is not up-to-date or correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I gained some clarity on the initial requirements. They were for implementation in Python and MongoT. This specification is under the same time pressure, so perhaps we discuss the full design and whether this should be expanded to include those that are not yet implemented.


`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of
integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector
`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course,
some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk.

### Byte padding

As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of
bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the
final byte that are to be ignored. The least-significant bits are ignored.

### Binary structure

Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers.

- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may
increase. dtype is an unsigned integer.

- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative
integer. It must be present, even in cases where it is not applicable, and set to zero.

- The remainder contains the actual vector elements packed according to dtype.

All values use the little-endian format.

#### Example

Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`.

In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8.

We can visualize the binary representation like so:

<table border="1" cellspacing="0" cellpadding="5">
<tr>
<td colspan="8">1st byte: dtype (from list in previous table) </td>
nbbeeken marked this conversation as resolved.
Show resolved Hide resolved
<td colspan="8">2nd byte: padding (values in [0,7])</td>
<td colspan="8">1st uint8: 238</td>
<td colspan="8">2nd uint8: 224</td>
</tr>
<tr>
<td>0</td>
jyemin marked this conversation as resolved.
Show resolved Hide resolved
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</table>

Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this!

| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |

## API Guidance

Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while
following idioms of the language of the driver.

Copy link
Member

@vbabanin vbabanin Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the server currently does not validate the metadata of vectors when they are inserted into the database. As a result, users can insert vectors with invalid metadata if they construct binary manually, such as setting the padding to -1 or applying padding to FLOAT32.

We do have tests in the specification that enforce validation in drivers' API to ensure that vectors with corrupted metadata never reach the server. However, the specification does not require drivers to validate vector metadata when the data is being read back from the database. This means that if the metadata becomes corrupted (e.g., a FLOAT32 vector with padding greater than 0), drivers could return a Vector object/structure/class from their API's to the user that contains invalid data and violates expected invariants.

Should we mention this in the spec API guidance and enforce validation when reading vectors from the database, ensuring that drivers check for invariant violations in the binary data? This would prevent returning invalid Vector structures to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin This is a good idea. Would you like to propose some prose for this? How about the following?

Implementing drivers MUST validate metadata. Padding must be 0 for all dtypes where padding does not apply. When padding is applicable, it must be in [0, 7]. If implementing INT4 valid values would only be {0, 4}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good! I’d suggest making it just a bit more explicit to ensure clarity on where the validation should happen. How about something like this:

Drivers MUST validate vector metadata and raise an error if any invariant is violated:

  • Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within [0, 7] for PACKED_BIT.
  • A PACKED_BIT vector MUST NOT be empty if padding is in the range [1, 7].

Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking binary data (BSON or similar) into a Vector structure.

I didn’t include INT4 in validation requirements as it could be added later once it’s supported in the table of types.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation section was added in this commit: 0ccc39944

### Encoding

```
Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_vector currently only accepts raw vector parameters, while as_vector returns a Vector structure. To keep the API consistent, it would make sense for from_vector to accept a Vector structure directly, similar to how as_vector returns one. This would also reduce the need for manually extracting individual components when encoding a Vector.

Should from_vector also accept a Vector type as input? Perhaps this could be offered as an alternative method for drivers to implement.

Copy link
Contributor Author

@caseyclements caseyclements Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin. I completely agree. That was an oversight. What about this?

The above is suggestive only. If a driver chooses to implement a Vector type (or numerous)
they MAY decide that from_vector that has a single argument, a Vector.

# Converts a numeric vector into a binary representation based on the specified dtype and padding.

# :param vector: A sequence or iterable of numbers (either float or int)
# :param dtype: Data type for binary conversion (from DtypeEnum)
# :param padding: Optional integer specifying how many bits to ignore in the final byte
# :return: A binary representation of the vector

Declare binary_data as Binary

# Process each number in vector and convert according to dtype
For each number in vector
binary_element = convert_to_binary(number, dtype)
binary_data.append(binary_element)
End For

# Apply padding to the binary data if needed
If padding > 0
apply_padding(binary_data, padding)
End If

Return binary_data
End Function
```

### Decoding

```
Function as_vector() -> Vector
# Unpacks binary data (BSON or similar) into a Vector structure.
# This process involves extracting numeric values, the data type, and padding information.

# :return: A BinaryVector containing the unpacked numeric values, dtype, and padding.

Declare binary_vector as BinaryVector # Struct to hold the unpacked data

# Extract dtype (data type) from the binary data
binary_vector.dtype = extract_dtype_from_binary()

# Extract padding from the binary data
binary_vector.padding = extract_padding_from_binary()

# Unpack the actual numeric values from the binary data according to the dtype
binary_vector.data = unpack_numeric_values(binary_vector.dtype)

Return binary_vector
End Function
```

#### Data Structures

Drivers MAY find the following structures to represent the dtype and vector structure useful.

```
Enum Dtype
# Enum for data types (dtype)

# FLOAT32: Represents packing of list of floats as float32
# Value: 0x27 (hexadecimal byte value)

# INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8
# Value: 0x03 (hexadecimal byte value)

# PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255]
# Packed into groups of 8 (a byte)
# Value: 0x10 (hexadecimal byte value)

# Documentation:
# Each value is a byte (length of one), a convenient choice for decoding.
End Enum

Struct Vector
# Numeric vector with metadata for binary interoperability

# Fields:
# data: Sequence of numeric values (either float or int)
# dtype: Data type of vector (from enum BinaryVectorDtype)
# padding: Number of bits to ignore in the final byte for alignment

data # Sequence of float or int
dtype # Type: DtypeEnum
padding # Integer: Number of padding bits
End Struct
```

## Reference Implementation

- PYTHON (PYTHON-4577)

## Test Plan

See the [README](tests/README.md) for tests.

## FAQ

- What MongoDB Server version does this apply to?
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
- In PACKED_BIT, why would one choose to use integers in \[0, 256)?
- This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is
widely used across different fields, such as data compression, communication protocols, and file formats, where you
want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example
in Python, see
[numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits).
55 changes: 55 additions & 0 deletions source/bson-binary-vector/tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Testing Binary subtype 9: Vector

The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to
the specification.

These tests focus on the roundtrip of the list of numbers as input/output, along with their data type and byte padding.

Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector
to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype.

## Format

The test data corpus consists of a JSON file for each data type (dtype). Each file contains a number of test cases,
under the top-level key "tests". Each test case pertains to a single vector. The keys provide the specification of the
vector. Valid cases also include the Canonical BSON format of a document {test_key: binary}. The "test_key" is common,
and specified at the top level.

#### Top level keys

Each JSON file contains three top-level keys.

- `description`: human-readable description of what is in the file
- `test_key`: name used for key when encoding/decoding a BSON document containing the single BSON Binary for the test
case. Applies to *every* case.
- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional
binary and json encoding values.

#### Keys of individual tests cases

- `description`: string describing the test.
- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input.
- `vector`: list of numbers
- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27")
- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum.
- `padding`: (optional) integer for byte padding. Defaults to 0.
- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string.

## Required tests

To prove correct in a valid case (`valid: true`), one MUST

- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match
those provided in the JSON.
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the
canonical_bson string.
- For floating point number types, numerical values need not match exactly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can actually be asserted about floats? Can we place any sort of mathematical limit on the non-exact match?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When converting a float32 which is already stored as an approximation, like 127.699..... for 127.7, in Java for example, to a 4-byte array and back, no additional approximation error should occur since it's a direct encoding and decoding of the binary representation. So precise matching in tests seems important for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float algorithm itself. Should we consider mentioning this in the test guidance for drivers that natively support float32?

Should we also specify a small margin of error for comparing floating-point values for drivers supporting only float64 where precision loss could happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

"For drivers that natively support types like float32, they SHOULD implement additional testing. For example, precise matching for drivers that support float32 to catch any mistakes in the float to bytes & bytes to float algorithm itself."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I think drivers that support float32 natively have to ensure precise conversion. How about extending it with the following wording?

Drivers that natively support the floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32), MUST assert that the input float array is the same after encoding and decoding.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. I made an update. Please take a look.


To prove correct in an invalid case (`valid:false`), one MUST

- raise an exception when attempting to encode a document from the numeric values, dtype, and padding.

## FAQ

- What MongoDB Server version does this apply to?
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
51 changes: 51 additions & 0 deletions source/bson-binary-vector/tests/float32.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector FLOAT32",
"valid": true,
"vector": [127.0, 7.0],
vbabanin marked this conversation as resolved.
Show resolved Hide resolved
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000"
},
{
"description": "Vector with decimals and negative value FLOAT32",
"valid": true,
"vector": [127.7, -7.7],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1C00000005766563746F72000A0000000927006666FF426666F6C000"
},
{
"description": "Empty Vector FLOAT32",
"valid": true,
"vector": [],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009270000"
},
{
"description": "Infinity Vector FLOAT32",
"valid": true,
"vector": ["-inf", 0.0, "inf"],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00"
},
{
"description": "FLOAT32 with padding",
"valid": false,
"vector": [127.0, 7.0],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 3
}
]
}

57 changes: 57 additions & 0 deletions source/bson-binary-vector/tests/int8.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype INT8",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector INT8",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000903007F0700"
},
{
"description": "Empty Vector INT8",
"valid": true,
"vector": [],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009030000"
},
{
"description": "Overflow Vector INT8",
"valid": false,
"vector": [128],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
},
{
"description": "Underflow Vector INT8",
"valid": false,
"vector": [-129],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
},
{
"description": "INT8 with padding",
"valid": false,
"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 3
},
{
"description": "INT8 with float inputs",
"valid": false,
"vector": [127.77, 7.77],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
}
]
}

Loading
Loading