Description
Per discussion with @bpopeters and @vince62s on #1196
In the long run, by the time [data is] batched it shouldn't be necessary to know the data type. It's just a batch of data that the model should be able to handle. It only comes up right now because our way of handling feature embeddings (a feature almost no one uses) is super hacky.
The question is just when, where, and how they get numericalized. I think the best way would be some kind of multilevel field that takes an input which is words plus arbitrarily many features and when numericalized produced the sort of stacked tensor that the embeddings module expects. The
NestedField
in torchtext almost, but not quite, fits the bill.
It would save the trainer and translator from having to reason about the types of the source and target data.
I made a first pass at designing a multilevel field that handles the batching of features so that it's not in inputters.make_features
. I opened a PR from that branch against my branch for #1196 so that you can review the diff and guide the design further before I try to open one here (also because #1196 isn't merged). It's here. I checked the PR script and all seems well.
Activity