Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UDOP #22940

Merged
merged 159 commits into from
Mar 4, 2024
Merged

Add UDOP #22940

merged 159 commits into from
Mar 4, 2024

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Apr 22, 2023

What does this PR do?

This PR adds UDOP as described in Unifying Vision, Text, and Layout for Universal Document Processing.

The model can be seen as an encoder-decoder Transformer with LayoutLMv3 as encoder and a T5 text decoder.

Fixes #20650

To do:

  • fix tests/models/udop/test_processor_udop.py::UdopProcessorTest::test_save_load_pretrained_default
  • include pytesseract decodings in processor test
  • check forward signature of the model as we can't change this afterwards
  • update organization to microsoft, replace ArthurZ/udop everywhere by an official UDOP checkpoint

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@plamb-viso
Copy link

hi @NielsRogge thank you for pushing this PR. I haven't had the chance to try yet, but I'm curious if you have an example or have tried to perform a torch.jit.trace or onnx conversion on UDOP yet? I know with the previous PR that was where I got stuck.

@dtiarks
Copy link

dtiarks commented May 3, 2023

@plamb-viso My impression was always that tracing Encoder-Decoder models (e.g. BART) works fine but exporting to ONNX is challenging using jit.trace. There's a research example for BART on how to do that: Bart + Beam Search to ONNX

I think this part of the reason the ONNX export is now offloaded into optimum: #14222 (comment)

@plamb-viso
Copy link

plamb-viso commented May 8, 2023

Just want to make sure with the UdopProcessor that we need to manually add the task to each input string. For e.g. if I'm doing document classification, I need to add document classification. and [0,0,0,0] to my words and bboxes before they go through the processor

For e.g.:

        prompt_text = ['document', 'classification.']
        prompt_boxes = [[0,0,0,0],[0,0,0,0]]
        processor.tokenizer(text=prompt_text, boxes=prompt_boxes)

And prepend these input_ids/boxes to the input_ids/boxes that come out of the processor

(Note that i am using apply_ocr=False)

@plamb-viso
Copy link

Also curious how we should encode the label of a training example. Is it a part of the inputs to UdopProcessor?

The I-Code example appears to do it like this

@plamb-viso
Copy link

plamb-viso commented May 12, 2023

thanks @dtiarks looks like a key component of that script is the BartBeamSearchGenerator which allows you to convert it to torchscript. Will UDOP have something like this?

I tried some of the naive steps I tried in this comment for tracing this new UDOP PR. Looks like the same issues remain. Curious if we'll get a test/example of tracing/compiling/onnx exporting the model either here or in optimum?

EDIT just a naive try at onnx export in optimum:
KeyError: "udop is not supported yet.

And just for completeness, a torch.onnx.export gives:

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case.  Argument types: Tensor, bool, int, int, Device, bool, NoneType,

Candidates:
	aten::full_like(Tensor self, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
	aten::full_like.out(Tensor self, Scalar fill_value, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)

@regisss
Copy link
Contributor

regisss commented May 12, 2023

@plamb-viso Here is the guide to add ONNX export support for a new architecture in Optimum: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute
Feel free to open a PR there and we'll help you if you encounter any issue 🙂

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at the tokenization, I didn't check the original codebase, but I think we should not touch the convert_to_token_id function. My reasoning is that this function should usually be called after tokenize, and only on tokens that are part of the original vocab.
Thus I also think that the addition of special tokens when initialising a model should not rely on this functions (special tokens are not part of the vocab, but the additional_special_tokens setter uses it. I'll see if this is something that can be easily adapted

src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved
src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the failing tests are also due to the typos and wrong slow to fast converter

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
@Jordy-VL
Copy link

Jordy-VL commented Jun 6, 2023

Highly anticipating this release! :) Keep up the great work

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@plamb-viso
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Definitely still highly interested in this work

@dtiarks
Copy link

dtiarks commented Jun 30, 2023

@ArthurZucker does #24565 fix the remaining issues of this PR?

@ArthurZucker
Copy link
Collaborator

not sure it does no! The added tokens was the issue if I remember correctly

@dtiarks
Copy link

dtiarks commented Jul 2, 2023

Ok. The question is how we can move this PR forward? @plamb-viso, @Jordy-VL, I (and probably others) are still definitely interested in this.

@NielsRogge are you aware of other issues blocking this PR or do you have other priorities at the moment?

@ArthurZucker
Copy link
Collaborator

My current priority is #24629, then it will be the tokenizer PR which seems to be the last blocking factor. In the mean time I think that it should be good to get all the tests green and ask for a review to make it ready for a final one! The tokenizer can be updated after wards 🤗 sorry for the wait 😓

@dtiarks
Copy link

dtiarks commented Jul 3, 2023

No worries @ArthurZucker ☺️. My comment was not meant to push anyone. I was just interested if I could contribute to speed up the process.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jul 3, 2023

@ArthurZucker the tokenizer is the only thing left to make all tests green. The PR is ready other than that. The only issue that is remaining are the sentinel tokens that the UDOP author defined (T5 has 100 of them, UDOP a lot more). Those are actually only relevant during pre-training, not during fine-tuning. Hence the model is already perfectly usable.

I can only assign core maintainers for review when the CI is more or less green, so will do that once the tokenizer issue is fixed.

@AleRosae
Copy link

AleRosae commented Jul 4, 2023

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

@Jordy-VL
Copy link

Jordy-VL commented Jul 4, 2023

Hi @NielsRogge, are you planning to do one of your wonderful notebook tutorials once this PR is closed? I'm rather curios on how can we approach a token-classification task with a encoder-decoder architecture such as UDOP :)

You can already check pix2struct ;)

@ArthurZucker
Copy link
Collaborator

Ok! Let me have a second look at the tokenizer then! There are quite a few issues currently with spm and AddedToken being taken care of!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pushing the changes need, will let you fix the tests. Overall, extra id things should be handled with extra care see #24622.
Otherwise pushed a tokenizer here:

src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved
src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/models/udop/tokenization_udop.py Outdated Show resolved Hide resolved
@ArthurZucker
Copy link
Collaborator

You have to manually add the tokens, and that can't be done in the init with the current API, but this allows us to remove the crazy regex in encoding.

@sromoam
Copy link

sromoam commented Jul 17, 2023

Eagerly anticipating this PR being merged. Is there any information on priority of this work and rough timelines? Thank you @ArthurZucker and @NielsRogge for your great work.

@ArthurZucker ArthurZucker self-requested a review February 27, 2024 06:54
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

.circleci/create_circleci_config.py Show resolved Hide resolved
src/transformers/models/udop/modeling_udop.py Show resolved Hide resolved
@NielsRogge NielsRogge merged commit 836921f into huggingface:main Mar 4, 2024
23 checks passed
@NielsRogge
Copy link
Contributor Author

It's finally here folks!

3 checkpoints are on the hub:

@plamb-viso
Copy link

incredible, thank you. Hope you're able to do one of your amazing tutorials at some point as well

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Mar 4, 2024

Oh yes I'll upload the notebooks now ;) https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP

@ageorgiev97
Copy link

Thanks a lot for the amazing work! Is there any chance that you upload question and answering fine tuning? Or give some ideas how it should be done.

@felixvor
Copy link

felixvor commented Mar 6, 2024

Thank you very much for your work! We are very excited to experiment with UDOP!

@plamb-viso
Copy link

For anyone following this thread, I looked today and saw that @NielsRogge added some UDOP tutorials: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP

damithsenanayake pushed a commit to damithsenanayake/transformers that referenced this pull request Mar 7, 2024
* First draft

* More improvements

* More improvements

* More fixes

* Fix copies

* More improvements

* More fixes

* More improvements

* Convert checkpoint

* More improvements, set up tests

* Fix more tests

* Add UdopModel

* More improvements

* Fix equivalence test

* More fixes

* Redesign model

* Extend conversion script

* Use real inputs for conversion script

* Add image processor

* Improve conversion script

* Add UdopTokenizer

* Add fast tokenizer

* Add converter

* Update README's

* Add processor

* Add fully fledged tokenizer

* Add fast tokenizer

* Use processor in conversion script

* Add tokenizer tests

* Fix one more test

* Fix more tests

* Fix tokenizer tests

* Enable fast tokenizer tests

* Fix more tests

* Fix additional_special_tokens of fast tokenizer

* Fix tokenizer tests

* Fix more tests

* Fix equivalence test

* Rename image to pixel_values

* Rename seg_data to bbox

* More renamings

* Remove vis_special_token

* More improvements

* Add docs

* Fix copied from

* Update slow tokenizer

* Update fast tokenizer design

* Make text input optional

* Add first draft of processor tests

* Fix more processor tests

* Fix decoder_start_token_id

* Fix test_initialization

* Add integration test

* More improvements

* Improve processor, add test

* Add more copied from

* Add more copied from

* Add more copied from

* Add more copied from

* Remove print statement

* Update README and auto mapping

* Delete files

* Delete another file

* Remove code

* Fix test

* Fix docs

* Remove asserts

* Add doc tests

* Include UDOP in exotic model tests

* Add expected tesseract decodings

* Add sentencepiece

* Use same design as T5

* Add UdopEncoderModel

* Add UdopEncoderModel to tests

* More fixes

* Fix fast tokenizer

* Fix one more test

* Remove parallelisable attribute

* Fix copies

* Remove legacy file

* Copy from T5Tokenizer

* Fix rebase

* More fixes, copy from T5

* More fixes

* Fix init

* Use ArthurZ/udop for tests

* Make all model tests pass

* Remove UdopForConditionalGeneration from auto mapping

* Fix more tests

* fixups

* more fixups

* fix the tokenizers

* remove un-necessary changes

* nits

* nits

* replace truncate_sequences_boxes with truncate_sequences for fix-copies

* nit current path

* add a test for input ids

* ids that we should get taken from c9f7a32

* nits converting

* nits

* apply ruff

* nits

* nits

* style

* fix slow order of addition

* fix udop fast range as well

* fixup

* nits

* Add docstrings

* Fix gradient checkpointing

* Update code examples

* Skip tests

* Update integration test

* Address comment

* Make fixup

* Remove extra ids from tokenizer

* Skip test

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update year

* Address comment

* Address more comments

* Address comments

* Add copied from

* Update CI

* Rename script

* Update model id

* Add AddedToken, skip tests

* Update CI

* Fix doc tests

* Do not use Tesseract for the doc tests

* Remove kwargs

* Add original inputs

* Update casting

* Fix doc test

* Update question

* Update question

* Use LayoutLMv3ImageProcessor

* Update organization

* Improve docs

* Update forward signature

* Make images optional

* Remove deprecated device argument

* Add comment, add add_prefix_space

* More improvements

* Remove kwargs

---------

Co-authored-by: ArthurZucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@plamb-viso
Copy link

plamb-viso commented Mar 12, 2024

Noting here that if you try to load the non-fast version of the tokenizer this way:

proc = UdopProcessor.from_pretrained(MODEL_KEY, use_fast=False)

When you go to tokenize labels, it calls base tokenization methods that miss the udop-boxed based methods and produces this error:

TypeError: UdopTokenizer.truncate_sequences() missing 1 required positional argument: 'token_boxes'

Same thing happens if you attempt this:

tok = UdopTokenizer.from_pretrained(MODEL_KEY)
    img = LayoutLMv3ImageProcessor.from_pretrained(MODEL_KEY)
    proc = UdopProcessor(img, tok)

For anyone else that encounters this, you can get the effect of the non-fast tokenizer by setting the env var TOKENIZERS_PARALLELISM=false (i believe)

@ArthurZucker
Copy link
Collaborator

Could you open a saperate issue @plamb-viso ?
To get a slow tokenizer you just need to use UdopTokenizer, but if you set use_fast = False in the UdopProcessor call to from pretrained that might also do the trick

cc @NielsRogge

itazap pushed a commit that referenced this pull request May 14, 2024
* First draft

* More improvements

* More improvements

* More fixes

* Fix copies

* More improvements

* More fixes

* More improvements

* Convert checkpoint

* More improvements, set up tests

* Fix more tests

* Add UdopModel

* More improvements

* Fix equivalence test

* More fixes

* Redesign model

* Extend conversion script

* Use real inputs for conversion script

* Add image processor

* Improve conversion script

* Add UdopTokenizer

* Add fast tokenizer

* Add converter

* Update README's

* Add processor

* Add fully fledged tokenizer

* Add fast tokenizer

* Use processor in conversion script

* Add tokenizer tests

* Fix one more test

* Fix more tests

* Fix tokenizer tests

* Enable fast tokenizer tests

* Fix more tests

* Fix additional_special_tokens of fast tokenizer

* Fix tokenizer tests

* Fix more tests

* Fix equivalence test

* Rename image to pixel_values

* Rename seg_data to bbox

* More renamings

* Remove vis_special_token

* More improvements

* Add docs

* Fix copied from

* Update slow tokenizer

* Update fast tokenizer design

* Make text input optional

* Add first draft of processor tests

* Fix more processor tests

* Fix decoder_start_token_id

* Fix test_initialization

* Add integration test

* More improvements

* Improve processor, add test

* Add more copied from

* Add more copied from

* Add more copied from

* Add more copied from

* Remove print statement

* Update README and auto mapping

* Delete files

* Delete another file

* Remove code

* Fix test

* Fix docs

* Remove asserts

* Add doc tests

* Include UDOP in exotic model tests

* Add expected tesseract decodings

* Add sentencepiece

* Use same design as T5

* Add UdopEncoderModel

* Add UdopEncoderModel to tests

* More fixes

* Fix fast tokenizer

* Fix one more test

* Remove parallelisable attribute

* Fix copies

* Remove legacy file

* Copy from T5Tokenizer

* Fix rebase

* More fixes, copy from T5

* More fixes

* Fix init

* Use ArthurZ/udop for tests

* Make all model tests pass

* Remove UdopForConditionalGeneration from auto mapping

* Fix more tests

* fixups

* more fixups

* fix the tokenizers

* remove un-necessary changes

* nits

* nits

* replace truncate_sequences_boxes with truncate_sequences for fix-copies

* nit current path

* add a test for input ids

* ids that we should get taken from c9f7a32

* nits converting

* nits

* apply ruff

* nits

* nits

* style

* fix slow order of addition

* fix udop fast range as well

* fixup

* nits

* Add docstrings

* Fix gradient checkpointing

* Update code examples

* Skip tests

* Update integration test

* Address comment

* Make fixup

* Remove extra ids from tokenizer

* Skip test

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update year

* Address comment

* Address more comments

* Address comments

* Add copied from

* Update CI

* Rename script

* Update model id

* Add AddedToken, skip tests

* Update CI

* Fix doc tests

* Do not use Tesseract for the doc tests

* Remove kwargs

* Add original inputs

* Update casting

* Fix doc test

* Update question

* Update question

* Use LayoutLMv3ImageProcessor

* Update organization

* Improve docs

* Update forward signature

* Make images optional

* Remove deprecated device argument

* Add comment, add add_prefix_space

* More improvements

* Remove kwargs

---------

Co-authored-by: ArthurZucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model] UDOP: Unifying Vision, Text, and Layout for Universal Document Processing