Skip to content

Improve feature embeddings implementation #1200

Closed
@flauted

Description

Per discussion with @bpopeters and @vince62s on #1196

In the long run, by the time [data is] batched it shouldn't be necessary to know the data type. It's just a batch of data that the model should be able to handle. It only comes up right now because our way of handling feature embeddings (a feature almost no one uses) is super hacky.

The question is just when, where, and how they get numericalized. I think the best way would be some kind of multilevel field that takes an input which is words plus arbitrarily many features and when numericalized produced the sort of stacked tensor that the embeddings module expects. The NestedField in torchtext almost, but not quite, fits the bill.

It would save the trainer and translator from having to reason about the types of the source and target data.

I made a first pass at designing a multilevel field that handles the batching of features so that it's not in inputters.make_features. I opened a PR from that branch against my branch for #1196 so that you can review the diff and guide the design further before I try to open one here (also because #1196 isn't merged). It's here. I checked the PR script and all seems well.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions