mongodb · caseyclements · Oct 23, 2024 · Sep 13, 2024 · Sep 16, 2024 · Sep 17, 2024
@@ -0,0 +1,231 @@
+# BSON Binary Subtype 9 - Vector
+
+- Status: Pending
+- Minimum Server Version: N/A
+
+______________________________________________________________________
+
+## Abstract
+
+This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors
+here refer to densely packed arrays of numbers, all of the same type.
+
+## Motivation
+
+These representations correspond to the numeric types supported by popular numerical libraries for vector processing,
+such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed
+format used by these libraries can result in significant memory savings and processing efficiency.
+
+### META
+
+The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
+"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).
+
+## Specification
+
+This specification introduces a new BSON binary subtype, the vector, with value `9`.
+
+Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification.
+
+### Data Types (dtypes)
+
+Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented.
+
+| Vector data type | Alias      | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) |
+| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- |
+| `0x03`           | INT8       | 8                       | INT8                                                                                      |
+| `0x27`           | FLOAT32    | 32                      | FLOAT                                                                                     |
+| `0x10`           | PACKED_BIT | 1     `*`               | BOOL                                                                                      |
+
+`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of
+integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector
+`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course,
+some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk.
+
+### Byte padding
+
+As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of
+bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the
+final byte that are to be ignored. The least-significant bits are ignored.
+
+### Binary structure
+
+Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers.
+
+- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may
+  increase. dtype is an unsigned integer.
+
+- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative
+  integer. It must be present, even in cases where it is not applicable, and set to zero.
+
+- The remainder contains the actual vector elements packed according to dtype.
+
+All values use the little-endian format.
+
+#### Example
+
+Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`.
+
+In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8.
+
+We can visualize the binary representation like so:
+
+<table border="1" cellspacing="0" cellpadding="5">
+  <tr>
+    <td colspan="8">1st byte: dtype (from list in previous table) </td>
+    <td colspan="8">2nd byte: padding (values in [0,7])</td>
+    <td colspan="8">1st uint8: 238</td>
+    <td colspan="8">2nd uint8: 224</td>
+  </tr>
+  <tr>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>1</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>1</td>
+    <td>0</td>
+    <td>0</td>
+    <td>1</td>
+    <td>1</td>
+    <td>1</td>
+    <td>0</td>
+    <td>1</td>
+    <td>1</td>
+    <td>1</td>
+    <td>0</td>
+    <td>1</td>
+    <td>1</td>
+    <td>1</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+    <td>0</td>
+  </tr>
+</table>
+
+Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this!
+
+| 1   | 1   | 1   | 0   | 1   | 1   | 1   | 0   | 1   | 1   | 1   | 0   |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+
+## API Guidance
+
+Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while
+following idioms of the language of the driver.
+
+### Encoding
+
+```
+Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary
+    # Converts a numeric vector into a binary representation based on the specified dtype and padding.
+
+    # :param vector: A sequence or iterable of numbers (either float or int)
+    # :param dtype: Data type for binary conversion (from DtypeEnum)
+    # :param padding: Optional integer specifying how many bits to ignore in the final byte
+    # :return: A binary representation of the vector
+
+    Declare binary_data as Binary
+
+    # Process each number in vector and convert according to dtype
+    For each number in vector
+        binary_element = convert_to_binary(number, dtype)
+        binary_data.append(binary_element)
+    End For
+
+    # Apply padding to the binary data if needed
+    If padding > 0
+        apply_padding(binary_data, padding)
+    End If
+
+    Return binary_data
+End Function
+```
+
+### Decoding
+
+```
+Function as_vector() -> Vector
+    # Unpacks binary data (BSON or similar) into a Vector structure.
+    # This process involves extracting numeric values, the data type, and padding information.
+
+    # :return: A BinaryVector containing the unpacked numeric values, dtype, and padding.
+
+    Declare binary_vector as BinaryVector  # Struct to hold the unpacked data
+
+    # Extract dtype (data type) from the binary data
+    binary_vector.dtype = extract_dtype_from_binary()
+
+    # Extract padding from the binary data
+    binary_vector.padding = extract_padding_from_binary()
+
+    # Unpack the actual numeric values from the binary data according to the dtype
+    binary_vector.data = unpack_numeric_values(binary_vector.dtype)
+
+    Return binary_vector
+End Function
+```
+
+#### Data Structures
+
+Drivers MAY find the following structures to represent the dtype and vector structure useful.
+
+```
+Enum Dtype
+    # Enum for data types (dtype)
+
+    # FLOAT32: Represents packing of list of floats as float32
+    # Value: 0x27 (hexadecimal byte value)
+
+    # INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8
+    # Value: 0x03 (hexadecimal byte value)
+
+    # PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255]
+    # Packed into groups of 8 (a byte)
+    # Value: 0x10 (hexadecimal byte value)
+
+    # Documentation:
+    # Each value is a byte (length of one), a convenient choice for decoding.
+End Enum
+
+Struct Vector
+    # Numeric vector with metadata for binary interoperability
+
+    # Fields:
+    # data: Sequence of numeric values (either float or int)
+    # dtype: Data type of vector (from enum BinaryVectorDtype)
+    # padding: Number of bits to ignore in the final byte for alignment
+
+    data     # Sequence of float or int
+    dtype    # Type: DtypeEnum
+    padding  # Integer: Number of padding bits
+ End Struct
+```
+
+## Reference Implementation
+
+- PYTHON (PYTHON-4577)
+
+## Test Plan
+
+See the [README](tests/README.md) for tests.
+
+## FAQ
+
+- What MongoDB Server version does this apply to?
+  - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
+- In PACKED_BIT, why would one choose to use integers in \[0, 256)?
+  - This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is
+    widely used across different fields, such as data compression, communication protocols, and file formats, where you
+    want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example
+    in Python, see
+    [numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits).
@@ -0,0 +1,55 @@
+# Testing Binary subtype 9: Vector
+
+The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to
+the specification.
+
+These tests focus on the roundtrip of the list of numbers as input/output, along with their data type and byte padding.
+
+Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector
+to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype.
+
+## Format
+
+The test data corpus consists of a JSON file for each data type (dtype). Each file contains a number of test cases,
+under the top-level key "tests". Each test case pertains to a single vector. The keys provide the specification of the
+vector. Valid cases also include the Canonical BSON format of a document {test_key: binary}. The "test_key" is common,
+and specified at the top level.
+
+#### Top level keys
+
+Each JSON file contains three top-level keys.
+
+- `description`: human-readable description of what is in the file
+- `test_key`: name used for key when encoding/decoding a BSON document containing the single BSON Binary for the test
+  case. Applies to *every* case.
+- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional
+  binary and json encoding values.
+
+#### Keys of individual tests cases
+
+- `description`: string describing the test.
+- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input.
+- `vector`: list of numbers
+- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27")
+- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum.
+- `padding`: (optional) integer for byte padding. Defaults to 0.
+- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string.
+
+## Required tests
+
+To prove correct in a valid case (`valid: true`), one MUST
+
+- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match
+  those provided in the JSON.
+- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the
+  canonical_bson string.
+- For floating point number types, numerical values need not match exactly.
+
+To prove correct in an invalid case (`valid:false`), one MUST
+
+- raise an exception when attempting to encode a document from the numeric values, dtype, and padding.
+
+## FAQ
+
+- What MongoDB Server version does this apply to?
+  - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
@@ -0,0 +1,51 @@
+{
+  "description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32",
+  "test_key": "vector",
+  "tests": [
+    {
+      "description": "Simple Vector FLOAT32",
+      "valid": true,
+      "vector": [127.0, 7.0],
+      "dtype_hex": "0x27",
+      "dtype_alias": "FLOAT32",
+      "padding": 0,
+      "canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000"
+    },
+    {
+      "description": "Vector with decimals and negative value FLOAT32",
+      "valid": true,
+      "vector": [127.7, -7.7],
+      "dtype_hex": "0x27",
+      "dtype_alias": "FLOAT32",
+      "padding": 0,
+      "canonical_bson": "1C00000005766563746F72000A0000000927006666FF426666F6C000"
+    },
+    {
+      "description": "Empty Vector FLOAT32",
+      "valid": true,
+      "vector": [],
+      "dtype_hex": "0x27",
+      "dtype_alias": "FLOAT32",
+      "padding": 0,
+      "canonical_bson": "1400000005766563746F72000200000009270000"
+    },
+    {
+      "description": "Infinity Vector FLOAT32",
+      "valid": true,
+      "vector": ["-inf", 0.0, "inf"],
+      "dtype_hex": "0x27",
+      "dtype_alias": "FLOAT32",
+      "padding": 0,
+      "canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00"
+    },
+    {
+      "description": "FLOAT32 with padding",
+      "valid": false,
+      "vector": [127.0, 7.0],
+      "dtype_hex": "0x27",
+      "dtype_alias": "FLOAT32",
+      "padding": 3
+    }
+  ]
+}
+
@@ -0,0 +1,57 @@
+{
+  "description": "Tests of Binary subtype 9, Vectors, with dtype INT8",
+  "test_key": "vector",
+  "tests": [
+    {
+      "description": "Simple Vector INT8",
+      "valid": true,
+      "vector": [127, 7],
+      "dtype_hex": "0x03",
+      "dtype_alias": "INT8",
+      "padding": 0,
+      "canonical_bson": "1600000005766563746F7200040000000903007F0700"
+    },
+    {
+      "description": "Empty Vector INT8",
+      "valid": true,
+      "vector": [],
+      "dtype_hex": "0x03",
+      "dtype_alias": "INT8",
+      "padding": 0,
+      "canonical_bson": "1400000005766563746F72000200000009030000"
+    },
+    {
+      "description": "Overflow Vector INT8",
+      "valid": false,
+      "vector": [128],
+      "dtype_hex": "0x03",
+      "dtype_alias": "INT8",
+      "padding": 0
+    },
+    {
+      "description": "Underflow Vector INT8",
+      "valid": false,
+      "vector": [-129],
+      "dtype_hex": "0x03",
+      "dtype_alias": "INT8",
+      "padding": 0
+    },
+    {
+      "description": "INT8 with padding",
+      "valid": false,
+      "vector": [127, 7],
+      "dtype_hex": "0x03",
+      "dtype_alias": "INT8",
+      "padding": 3
+    },
+    {
+      "description": "INT8 with float inputs",
+      "valid": false,
+      "vector": [127.77, 7.77],
+      "dtype_hex": "0x03",
+      "dtype_alias": "INT8",
+      "padding": 0
+    }
+  ]
+}
+