Description
Severity
P1 - Major feature malfunctioning
Current Behavior
Assigning bytes
to Dataset.Metadata
or Column.Metadata
stores the byte sequence as a str
not as bytes
. For bytes
consisting of ascii characters the sequence should be legible, but it isn't. It is not straightforward to recover the bytes from the str
object.
Steps to Reproduce
import deeplake
ds = deeplake.create("mem://temp")
byte_sequence = b"This should be legible"
ds.metadata["bytes"] = byte_sequence
eq_or_ne = "=" if ds.metadata["bytes"] == byte_sequence else "!"
print(f"byte_sequence '{byte_sequence}' {eq_or_ne}= ds.metadata['bytes'] '{ds.metadata["bytes"]}'")
print(f"byte_sequence '{type(byte_sequence)}' {eq_or_ne}= ds.metadata['bytes'] '{type(ds.metadata["bytes"])}'")
store in issue.py, execute issue.py, see result
$ python issue.py
byte_sequence 'b'This should be legible'' != ds.metadata['bytes'] 'VGhpcyBzaG91bGQgYmUgbGVnaWJsZQ=='
byte_sequence '<class 'bytes'>' != ds.metadata['bytes'] '<class 'str'>'
Expected/Desired Behavior
The value returned from a metadata property should retain the type and content that was assigned to it.
The output should read:
$ python /tmp/gh_issue.py
byte_sequence 'b'This should be legible'' == ds.metadata['bytes'] 'b'This should be legible''
byte_sequence '<class 'bytes'>' == ds.metadata['bytes'] '<class 'bytes'>'
This is particularly relevant, given that [BUG]#3061 has a workaround where sequence[text]
is replaced by sequence[bytes]
(using str.encode
). Now, it would be handy to store the list of tokens unique to the collection of all sequence[text]
across records in a dataset in the metadata of the col containing the sequence[text]
. While it is possible to assign a list[str]
to the metadata, list[bytes]
will be garbled.
Python Version
python 3.12.0 hab00c5b_0_cpython conda-forge
OS
Ubuntu 24.04.2 LTS
IDE
VS-Code
Packages
deeplake==4.2.14 numpy==2.3.1 pip==25.1.1 setuptools==80.9.0 wheel==0.45.1
Additional Context
No response
Possible Solution
No response
Are you willing to submit a PR?
- I'm willing to submit a PR (Thank you!)