Skip to content

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Dec 5, 2022

What does this PR do?

Moves the method get_decoder_prompt_ids from the processor to the tokenizer. The primary reason for this change is that the ASR pipeline class does not load the processor object, but rather the feature extractor and tokenizer separately (see docs). Therefore, as things currently stand, pipeline does not have access to the processor method get_decoder_prompt_ids. By moving it to the tokenizer, it will be able to call this method with pipeline.

Note that this is not a breaking change: we retain a method get_decoder_prompt_ids in the processor. This method simply calls the get_decoder_prompt_ids from the tokenizer:

def get_decoder_prompt_ids(self, task=None, language=None, no_timestamps=True):
return self.tokenizer.get_decoder_prompt_ids(task=task, language=language, no_timestamps=no_timestamps)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Comment on lines +402 to +403
elif self.language in TO_LANGUAGE_CODE.values():
language_id = self.language
Copy link
Contributor Author

@sanchit-gandhi sanchit-gandhi Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processor's get_decodoer_prompt_ids expected a language code id (e.g. "es"). Tokenizer's set_prefix_tokens expected a language (e.g. "Spanish"). This PR amends the tokenizer method to handle either.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice thanks a lot! I remember it was temporary so that's a nice follow up!

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 5, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix and thanks for making it fully backward compatible!

@sanchit-gandhi sanchit-gandhi merged commit e7e6d18 into huggingface:main Dec 5, 2022
mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants