Skip to content

[BUG] Dataset.Metadata mangles bytes as str with unknown encoding #3062

Open
@saamc

Description

@saamc

Severity

P1 - Major feature malfunctioning

Current Behavior

Assigning bytes to Dataset.Metadata or Column.Metadata stores the byte sequence as a str not as bytes. For bytes consisting of ascii characters the sequence should be legible, but it isn't. It is not straightforward to recover the bytes from the str object.

Steps to Reproduce

import deeplake

ds = deeplake.create("mem://temp")

byte_sequence = b"This should be legible"
ds.metadata["bytes"] = byte_sequence
eq_or_ne = "=" if ds.metadata["bytes"] == byte_sequence else "!"
print(f"byte_sequence '{byte_sequence}' {eq_or_ne}= ds.metadata['bytes'] '{ds.metadata["bytes"]}'")
print(f"byte_sequence '{type(byte_sequence)}' {eq_or_ne}= ds.metadata['bytes'] '{type(ds.metadata["bytes"])}'")

store in issue.py, execute issue.py, see result

$ python issue.py 
byte_sequence 'b'This should be legible'' != ds.metadata['bytes'] 'VGhpcyBzaG91bGQgYmUgbGVnaWJsZQ=='
byte_sequence '<class 'bytes'>' != ds.metadata['bytes'] '<class 'str'>'

Expected/Desired Behavior

The value returned from a metadata property should retain the type and content that was assigned to it.
The output should read:

$ python /tmp/gh_issue.py 
byte_sequence 'b'This should be legible'' == ds.metadata['bytes'] 'b'This should be legible''
byte_sequence '<class 'bytes'>' == ds.metadata['bytes'] '<class 'bytes'>'

This is particularly relevant, given that [BUG]#3061 has a workaround where sequence[text] is replaced by sequence[bytes] (using str.encode). Now, it would be handy to store the list of tokens unique to the collection of all sequence[text] across records in a dataset in the metadata of the col containing the sequence[text]. While it is possible to assign a list[str] to the metadata, list[bytes] will be garbled.

Python Version

python 3.12.0 hab00c5b_0_cpython conda-forge

OS

Ubuntu 24.04.2 LTS

IDE

VS-Code

Packages

deeplake==4.2.14 numpy==2.3.1 pip==25.1.1 setuptools==80.9.0 wheel==0.45.1

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR (Thank you!)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions