-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813
[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813
Conversation
…t.test_bson.TestBSON.test_encode_type_marker to pass
…t.test_bson.TestBSON.test_encode_type_marker to pass
…ector, from_vector
e32d1b8
to
26b8398
Compare
bson/binary.py
Outdated
elif dtype == BinaryVectorDtype.FLOAT32: | ||
n_bytes = len(self) - position | ||
n_values = n_bytes // 4 | ||
assert n_bytes % 4 == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use assert
for validation data. If the argument type is incorrect we raise a TypeError, if the type is correct but the value is invalid we raise a ValueError.
bson/binary.py
Outdated
""" | ||
if dtype == BinaryVectorDtype.INT8: # pack ints in [-128, 127] as signed int8 | ||
format_str = "b" | ||
assert not padding, f"padding does not apply to {dtype=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about assert.
bson/binary.py
Outdated
format_str = "B" | ||
elif dtype == BinaryVectorDtype.FLOAT32: # pack floats as float32 | ||
format_str = "f" | ||
assert not padding, f"padding does not apply to {dtype=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about assert.
|
||
.. versionadded:: 4.9 | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to validate self.subtype == 9
here before attempting to decode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShaneHarvey How do I do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
if self.subtype != VECTOR_SUBTYPE:
raise ValueError(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at as_uuid
it should be the same as that validation:
if self.subtype not in ALL_UUID_SUBTYPES:
raise ValueError(f"cannot decode subtype {self.subtype} as a uuid")
.. versionadded:: 4.9 | ||
""" | ||
|
||
INT8 = b"\x03" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enum of bytes is a bit unusual. Can we change it to use ints? eg INT8 = 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention that Geert used has meaning in both the first and last 4 bits, so I'd prefer it to stay as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user doesn't see the value locally either. Given that the bson spec itself always uses integers I think we should use the integer form.
bson/binary.py
Outdated
.. versionadded:: 4.9 | ||
""" | ||
|
||
data: Sequence[float | int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add __slots__ = ("data", "dtype", "padding")
to reduce the memory overhead of using this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. How do I do type annotation for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None needed, e.g.
mongo-python-driver/bson/timestamp.py
Line 31 in 821811e
__slots__ = ("__time", "__inc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pass slots=True
instead: https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange. And you're cool with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to add them manually instead of using slots=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class BinaryVector:
"""**(BETA)** Vector of numbers along with metadata for binary interoperability.
:param data (Sequence[float | int]): Sequence of numbers representing the mathematical vector.
:param dtype (:class:`bson.Binary.BinaryVectorDtype`): The data type stored in binary
:param padding (Optional[int] = 0): The number of bits in the final byte that are to be ignored
when a vector element's size is less than a byte
and the length of the vector is not a multiple of 8. Default is 0.
.. versionadded:: 4.10
"""
__slots__ = ("data", "dtype", "padding")
def __init__(self, data, dtype, padding=0):
self.data = data
self.dtype = dtype
self.padding = padding
Is this right? @blink1073
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't realize __slots__
with dataclass was problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be sorted now
.. versionadded:: 4.9 | ||
""" | ||
|
||
INT8 = b"\x03" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?
bson/binary.py
Outdated
|
||
data: Sequence[float | int] | ||
dtype: BinaryVectorDtype | ||
padding: Optional[int] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[int]
-> int
, unless padding=None is considered valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
padding=None is expected when it is not a packed type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? I see the code using padding=0 in that case.
bson/binary.py
Outdated
cls: Type[Binary], | ||
vector: list[int, float], | ||
dtype: BinaryVectorDtype, | ||
padding: Optional[int] = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[int]
-> int
bson/binary.py
Outdated
data = struct.pack(f"{len(vector)}{format_str}", *vector) | ||
return cls(metadata + data, subtype=VECTOR_SUBTYPE) | ||
|
||
def as_vector(self, uncompressed: Optional[bool] = False) -> BinaryVector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[bool]
-> bool
unless uncompressed=None
is valid.
bson/binary.py
Outdated
else: | ||
bits = [] | ||
for uint8 in unpacked_uint8s: | ||
bits.extend([int(bit) for bit in f"{uint8:08b}"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add an uncompressed
option here at all? It looks like this option is irreversible because from_vector does not support the same option. Even if it did, is this option useful outside of test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should leave out uncompressed
, it was a quality of life addition, but is not likely to be standardized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's irreversible. We don't yet provide an API to go from a full vector of zeros and ones to a packed bit vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going in, you'd be using something like numpy.packbits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's omit it since it adds unneeded complexity at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Fixing the test suite and the Enum issue are my last remaining comments. |
We're pressing ahead with the hex values in the Enum because they simplify the implementation of encode and decode. |
Again, the user will only ever see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Adds support for new subtype of Binary BSON subtype to be used for densely-packed 1d arrays (vectors).
Corresponding changes are in specifications repo PR #1658