Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add magic method to our TF models to convert datasets with column inference #17160

Merged
merged 35 commits into from
Jun 6, 2022

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented May 10, 2022

Left to do:

  • Add docstring
  • Figure out what type to set for dataset (since datasets isn't imported)
  • Set default values for batch_size / shuffle?
  • Do we need to document this anywhere besides the docstring?
  • Anything else I forgot? (Reviewers please yell at me)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 10, 2022

The documentation is not available anymore as the PR was closed or merged.

@Rocketknight1 Rocketknight1 force-pushed the magic_tf_dataset_method branch from f76056b to 037e848 Compare May 12, 2022 15:12
@Rocketknight1 Rocketknight1 marked this pull request as ready for review May 12, 2022 17:34
@Rocketknight1
Copy link
Member Author

Rocketknight1 commented May 17, 2022

Hey all, this is the method in transformers that I moved all the column inference code to! I also need some advice:

  • How do we handle an input type of Dataset when datasets may not be installed?
  • Should this be moved to a utility function that takes model as an argument rather than a method on TFPreTrainedModel?
  • Should I set default values for any arguments to reduce the amount of typing users have to do?
  • How should I make sure users find out about this (besides revamping the examples and notebooks once it's merged)?

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good -- the biggest question to settle is the import requirement of datasets

src/transformers/modeling_tf_utils.py Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
@gante
Copy link
Member

gante commented May 19, 2022

Should this be moved to a utility function that takes model as an argument rather than a method on TFPreTrainedModel?

All our models inherit from TFPreTrainedModel, so it should be fine

Should I set default values for any arguments to reduce the amount of typing users have to do?

I'd keep the same defaults as in datasets, for a consistent experience. But I'm not the most experienced in this domain :D

How should I make sure users find out about this (besides revamping the examples and notebooks once it's merged)?

We are getting to the point where we have a lot of content to announce (these changes, the metrics working correctly, generate and its updates, new models, ...), maybe we can start a once-a-week TF communication of some sort!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Let's polish the type annotations/imports and it should be good to go.

src/transformers/dependency_versions_table.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
Rocketknight1 and others added 9 commits May 19, 2022 14:31
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@Rocketknight1 Rocketknight1 force-pushed the magic_tf_dataset_method branch from e3661b4 to e381daf Compare May 19, 2022 13:32
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite cool!

@Rocketknight1
Copy link
Member Author

Quick update: I think this is ready to merge, but I've only really tested it with the updated to_tf_dataset() method in datasets, which hasn't been merged yet (but is due very soon!). As such, I don't want to merge it until that's in, because there could be edge case issues with the old method that I haven't seen.

@Rocketknight1
Copy link
Member Author

I've merged the to_tf_dataset update so I'm going to merge this one too - though I think it will be a silent 'soft launch' until there's a new release of datasets, to avoid any unforeseen problems. Since this code only adds the new method, it shouldn't disrupt any existing workflows before it's ready to be used.

@Rocketknight1 Rocketknight1 merged commit 19a8a30 into main Jun 6, 2022
@Rocketknight1 Rocketknight1 deleted the magic_tf_dataset_method branch June 6, 2022 14:53
Narsil pushed a commit to Narsil/transformers that referenced this pull request Jun 7, 2022
…erence (huggingface#17160)

* Add method to call to_tf_dataset() with column inference

* Add test for dataset creation

* Add a default arg for data collator

* Fix test

* Fix call with non-dev version of datasets

* Test correct column removal too

* make fixup

* More tests to make sure we remove unwanted columns

* Fix test to avoid predicting on unbuilt models

* Fix test to avoid predicting on unbuilt models

* Fix test to remove unwanted head mask columns from inputs

* Stop pushing your debug breakpoints to the main repo of the $2bn company you work for

* Skip the test in convnext because no grouped conv support

* Drop bools from the dataset dict

* Make style

* Skip the training test for models whose input dicts don't give us labels

* Skip transformerXL in the test because it doesn't return a simple loss

* Skip TFTapas because of some odd NaN losses

* make style

* make fixup

* Add docstring

* fixup

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Remove breakpoint from tests

* Fix assert, add requires_backends

* Protect tokenizer import with if TYPE_CHECKING

* make fixup

* Add noqa, more fixup

* More rearranging for ~* aesthetics *~

* Adding defaults for shuffle and batch_size to match to_tf_dataset()

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…erence (huggingface#17160)

* Add method to call to_tf_dataset() with column inference

* Add test for dataset creation

* Add a default arg for data collator

* Fix test

* Fix call with non-dev version of datasets

* Test correct column removal too

* make fixup

* More tests to make sure we remove unwanted columns

* Fix test to avoid predicting on unbuilt models

* Fix test to avoid predicting on unbuilt models

* Fix test to remove unwanted head mask columns from inputs

* Stop pushing your debug breakpoints to the main repo of the $2bn company you work for

* Skip the test in convnext because no grouped conv support

* Drop bools from the dataset dict

* Make style

* Skip the training test for models whose input dicts don't give us labels

* Skip transformerXL in the test because it doesn't return a simple loss

* Skip TFTapas because of some odd NaN losses

* make style

* make fixup

* Add docstring

* fixup

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Remove breakpoint from tests

* Fix assert, add requires_backends

* Protect tokenizer import with if TYPE_CHECKING

* make fixup

* Add noqa, more fixup

* More rearranging for ~* aesthetics *~

* Adding defaults for shuffle and batch_size to match to_tf_dataset()

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022
…erence (huggingface#17160)

* Add method to call to_tf_dataset() with column inference

* Add test for dataset creation

* Add a default arg for data collator

* Fix test

* Fix call with non-dev version of datasets

* Test correct column removal too

* make fixup

* More tests to make sure we remove unwanted columns

* Fix test to avoid predicting on unbuilt models

* Fix test to avoid predicting on unbuilt models

* Fix test to remove unwanted head mask columns from inputs

* Stop pushing your debug breakpoints to the main repo of the $2bn company you work for

* Skip the test in convnext because no grouped conv support

* Drop bools from the dataset dict

* Make style

* Skip the training test for models whose input dicts don't give us labels

* Skip transformerXL in the test because it doesn't return a simple loss

* Skip TFTapas because of some odd NaN losses

* make style

* make fixup

* Add docstring

* fixup

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Remove breakpoint from tests

* Fix assert, add requires_backends

* Protect tokenizer import with if TYPE_CHECKING

* make fixup

* Add noqa, more fixup

* More rearranging for ~* aesthetics *~

* Adding defaults for shuffle and batch_size to match to_tf_dataset()

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants