[Whisper] Move decoder id method to tokenizer #20589

sanchit-gandhi · 2022-12-05T12:11:21Z

What does this PR do?

Moves the method get_decoder_prompt_ids from the processor to the tokenizer. The primary reason for this change is that the ASR pipeline class does not load the processor object, but rather the feature extractor and tokenizer separately (see docs). Therefore, as things currently stand, pipeline does not have access to the processor method get_decoder_prompt_ids. By moving it to the tokenizer, it will be able to call this method with pipeline.

Note that this is not a breaking change: we retain a method get_decoder_prompt_ids in the processor. This method simply calls the get_decoder_prompt_ids from the tokenizer:

transformers/src/transformers/models/whisper/processing_whisper.py

Lines 44 to 45 in ca8b332

    
           def get_decoder_prompt_ids(self, task=None, language=None, no_timestamps=True): 
        
               return self.tokenizer.get_decoder_prompt_ids(task=task, language=language, no_timestamps=no_timestamps)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sanchit-gandhi · 2022-12-05T12:14:22Z

src/transformers/models/whisper/tokenization_whisper.py

+            elif self.language in TO_LANGUAGE_CODE.values():
+                language_id = self.language


Processor's get_decodoer_prompt_ids expected a language code id (e.g. "es"). Tokenizer's set_prefix_tokens expected a language (e.g. "Spanish"). This PR amends the tokenizer method to handle either.

ArthurZucker

Very nice thanks a lot! I remember it was temporary so that's a nice follow up!

HuggingFaceDocBuilderDev · 2022-12-05T12:27:36Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Nice fix and thanks for making it fully backward compatible!

[Whisper] Move decoder id method to tokenizer

ca8b332

sanchit-gandhi commented Dec 5, 2022

View reviewed changes

sanchit-gandhi requested a review from ArthurZucker December 5, 2022 12:14

ArthurZucker approved these changes Dec 5, 2022

View reviewed changes

sanchit-gandhi requested a review from sgugger December 5, 2022 13:26

sgugger approved these changes Dec 5, 2022

View reviewed changes

sanchit-gandhi merged commit e7e6d18 into huggingface:main Dec 5, 2022

bofenghuang mentioned this pull request Dec 5, 2022

Fix get_decoder_prompt_ids in whisper #20598

Closed

5 tasks

This was referenced Dec 5, 2022

[Whisper] Fix decoder ids methods #20599

Merged

[Whisper] Fix forced decoder ids #20652

Merged

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022

[Whisper] Move decoder id method to tokenizer (huggingface#20589)

911d21f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Whisper] Move decoder id method to tokenizer #20589

[Whisper] Move decoder id method to tokenizer #20589

Uh oh!

sanchit-gandhi commented Dec 5, 2022 •

edited

Loading

Uh oh!

sanchit-gandhi Dec 5, 2022 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 5, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

	def get_decoder_prompt_ids(self, task=None, language=None, no_timestamps=True):
	return self.tokenizer.get_decoder_prompt_ids(task=task, language=language, no_timestamps=no_timestamps)

		elif self.language in TO_LANGUAGE_CODE.values():
		language_id = self.language

[Whisper] Move decoder id method to tokenizer #20589

[Whisper] Move decoder id method to tokenizer #20589

Uh oh!

Conversation

sanchit-gandhi commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

sanchit-gandhi Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanchit-gandhi commented Dec 5, 2022 •

edited

Loading

sanchit-gandhi Dec 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 5, 2022 •

edited

Loading