PretrainedTokenizer::truncateHelper: prevent array_slice() error for flawed text input (summarization)#36
Conversation
with the if-clause in PretrainedTokenizer::truncateHelper
certain input may result in the following error:
array_slice(): Argument CodeWithKyrian#1 ($array) must be of type array,
null given
two tests were added to prove that the fix is working:
1. SummarizationPipelineTest:
Integration test which checks behavior using a real model
and some extracted text from a PDF. I think there is a better
way to accomplish the same test result, because this one test
runs 10+ sec. locally.
2. PretrainedTokenizerTest:
Unit test to check PretrainedTokenizer::truncateHelper itself.
The input is flawed by design, which would trigger the error
without the fix.
|
Thank you so much for your contribution, @k00ni. I'm happy with the changes for the If you notice, there aren't many tests in the library, which I'm not proud of. This is because I want to take my time to decide on the best structure for the tests. Classes like Tensor and Image can be easily tested, and I included tests for the basic tokenizer because the config file sizes are relatively small. However, the overall testing structure of the library is still largely undecided. For now, let's leave out the tests. Since you seem to have a keen interest in testing, I would really appreciate your suggestions on how best to structure tests for the project. |
|
I will reduce the PR to the fix and will write my feedback about the tests in a separate issue. |
|
That's great. Looking forward to your suggestions |
What:
Description:
When running a summarization task, such as
certain input may result in the following error in
PretrainedTokenizer::truncateHelper:My tests suggest that texts with lines, containing (almost?) no text, cause this problem (Example). Its not model-dependent.
Two tests were added to prove that the fix is working:
SummarizationPipelineTest: Integration test which checks behavior using a real model and some extracted text from a PDF (original source: https://arxiv.org/abs/2309.06888). I think there is a better way to accomplish the same test result, because the test runs 10+ secondsPretrainedTokenizerTest: Unit test to checkPretrainedTokenizer::truncateHelperitself. The input is flawed by design, which would trigger the error without the fix.I marked the PR as draft because I would like to hear your opinion first.