Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word delimiter support, fixes #2637, #2556, #2553, #2522 #2661

Closed
wants to merge 11 commits into from

Conversation

JorjMcKie
Copy link
Collaborator

For text extraction get_text("words"), or extractWORDS, words are defined as strings not containing white space. This change allows adding up to 64 characters to also function as delimiters. This allows for instance to separate words from punctuations or to decompose an e-mail address into its components.

Other changes:

Fixing #2522: correcting the typo

Remove some unnecessary setting of flags when creating annotations.

Fixing #2553:
Adjust plain text extraction to use the same approach as other variants. This entails using Unicode escape strings on output instead of using the output of fz_chartorune. Another consequence is that standard text output is directed to a fz_buffer instead to a fz_output.

Fixing #2556: Add checking the existence of path dictionaries at every possible place. Includes an additional test function.

Add functions JM_ignore_rect / JM_ignore_irect which return a bool. The functions return True if the rectangle should be ignored. This is the case for infinite and empty rectangles, but also for any rectangle that has a common edge with the infinite rectangle.

Support variable setting of character border widths for insert_text() / insert_textbox(). This is a factor to be multiplied with the font size. Default is 0.05 (read: 5% of the fontsize). This value is relevant for text rendering modes 1 and 2 only.

Fixing #2637:
In Page.insert_textbox, when the last word of a line won't fit in the line buffer, we did not increase the line position. This is now handled correctly.

For text extraction `get_text("words")`, or extractWORDS, words are defined as strings not containing white space.
This change allows adding up to 64 characters to also function as delimiters.
This allows for instance to separate words from punctuations or to decompose an e-mail address into its components.

Other changes:

Fixing #2522: correcting the typo

Remove some unnecessary setting of flags when creating annotations.

Fixing #2553:
Adjust plain text extraction to use the same approach as other variants. This entails using Unicode escape strings on output instead of using the output of fz_chartorune.
Another consequence is that standard text output is directed to a fz_buffer instead to a fz_output.

Fixing #2556: Add checking the existence of path dictionaries at every possible place.
Includes an additional test function.

Add functions JM_ignore_rect / JM_ignore_irect which return a bool. The functions return True if the rectangle should be ignored.
This is the case for infinite and empty rectangles, but also for any rectangle that has a common edge with the infinite rectangle.

Support variable setting of character border widths for insert_text() / insert_textbox(). This is a factor to be multiplied
with the font size. Default is 0.05 (read: 5% of the fontsize). This value is relevant for text rendering modes 1 and 2 only.

Fixing #2637:
In Page.insert_textbox, when the last word of a line won't fit in the line buffer, we did not increase the line position. This is now handled correctly.
@JorjMcKie
Copy link
Collaborator Author

@julian-smith-artifex-com - I also made the corresponding "rebased" fixes. I hope I spotted them all ...

@JorjMcKie JorjMcKie closed this Sep 19, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Sep 19, 2023
@JorjMcKie JorjMcKie deleted the word-delimiters branch September 28, 2023 12:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant