Skip to content

Conversation

@JackTemaki
Copy link
Collaborator

  • Changes the torch data pipeline to allow for non-tensor data to be passed.
  • Extended typing

- Changes the torch data pipeline to allow for non-tensor data to be passed.
- Extended typing
albertz added a commit that referenced this pull request May 30, 2023
albertz added a commit that referenced this pull request May 30, 2023
Follow-up to #1330.
@albertz
Copy link
Member

albertz commented May 30, 2023

The recent commits added seq_tag to extern_data. It just uses a Numpy ndarray for that, which supports strings. It would also support any custom Python object, so it is generic to support any potential other case you might want to have later. So I add seq_tag as numpy.array. The only real remaining changes w.r.t. handling Numpy arrays are:

  • create_tensor: Return Numpy array as-is for certain types. This is currently only the string type, but could be extended to Python object type, or other types which do not have meaningful PyTorch types.
  • collate_batch: Also handle Numpy arrays.

Further, seq_tag is handled a bit as a special case, and is automatically added to extern_data if it is not specified.

I did not add seq_idx because the idx might not be meaningful. It depends on the current ordering of the dataset. Using get_corpus_seq_idx might be more meaningful. But I noticed that we anyway also have this wrong in the TF backend and it was probably never really used (as it was incorrect), so I left it away for now.

I guess this PR can be closed.

@albertz albertz closed this May 30, 2023
@albertz albertz deleted the nick-torch-seq-info branch May 30, 2023 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants