-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Concatenate str support for IterableDataset
#3686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
47cca6c
to
1a73970
Compare
9d49356
to
032c7d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment !
src/accelerate/utils/operations.py
Outdated
first_inner = data[0][0] if len(data[0]) > 0 else None | ||
|
||
if isinstance(first_inner, str): | ||
return honor_type(data[0], [item for sublist in data for item in sublist]) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't modify that, we should check elif isinstance(data[0], str) and do the respective changes there. Try to add for test case also involving tuple and dictionaries. Also we should check that no tensors is being concatenated with str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Instead of checking for str directly, I check whether it's a list of strings, this made the logic simpler.
Let me know if that works for you, and I'm happy to explore another approach if needed.
…rate into concatenate-str-support
What does this PR do?
This PR changes concatenate so it doesn't crash when batches contain non-tensor values like strings.
Right now it seems to throw a TypeError if you had anything other than tensors, for example, string labels.
This is pretty useful for
IterableDataset
. I also added a new test.Fixes #3624 #1878
Proposal
When concatenating lists, check if the first element is astr
, if so, concat as strings as a flat python list instead of tensor (I could also check all elements if you would like to avoid cases like["test", 0] + ["test", "test] = ["test", 0, "test", "test"]
).Changed the logic to check all elements when concatenating lists. If all elements are strings, they are concatenated as a flat Python list.
If this is accepted, this can easily be adapted to support more types if needed.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@BenjaminBossan @SunMarc @zach-huggingface
Minimal example script showing the problem
Example with
IterableDataset
Discussion
Could be extended to support constants? Like