fix: fix and improve PDF highlighting#313
Conversation
- story for user-provided PDF and document JSON - fix cmap URL - make storybook provide cmap files
- to fix highlight boxes - for more efficient text box mapping
- give priority to longer match - add min match length to match functions - support `rewind` in cell provider
| if (shape?.boxes.length === 0) { | ||
| return null; | ||
| } |
| if (horizontalGap < 0) { | ||
| // gap < 0 means that the boxes are overlapped | ||
| const overlap = -horizontalGap; | ||
| // consider two horizontally aligned overlapped boxes are next to each other when | ||
| // the overlap is smaller than half of the box height. | ||
| return overlap < avgHeight / 2; | ||
| } |
There was a problem hiding this comment.
the previous code of disallowing any overlapping was too strict.
if (bboxesIntersect(boxA, boxB)) {
return false;
}
Practically, PDF sometimes includes bboxes overlapped each other for a text. It caused an issue of fragmented highlights like the following screenshot.

So, this code is to allow that kind of overlapping. It will fix the issue of split highlights.
There was a problem hiding this comment.
I don't fully understand why you are using height to determine if boxes are horizontally aligned. That seems odd. What's the reasoning behind that?
There was a problem hiding this comment.
Not sure, but my guess is that this is for picking a small size that is relative to the size of the characters. (Width wouldn't work due to differences in word length.) My initial thought would be to pick a specific small size, but this solution better accounts for potential issues with zoom.
There was a problem hiding this comment.
This is to to adapt to various font sizes. For font height of 72px, 30px is small enough. But for height of 12px, 30px overlapping means that some characters are most likely overlapped. In the case, given boxes cannot be considered as 'next to each other'.
| markSourceAsMapped: (text: string) => { | ||
| const mappedSource = matchedSourceProvider.getMatch(text); | ||
| if (mappedSource?.span) { | ||
| matchedSourceProvider.consume(mappedSource.span); | ||
| } | ||
| debug('>> target cell %o to source %o', text, mappedSource); | ||
| return mappedSource?.span; | ||
| }, | ||
| markSourceMappedBySpan: (span: TextSpan) => { | ||
| if (spanLen(span) > 0) { | ||
| matchedSourceProvider.consume(span); | ||
| } | ||
| markAsMapped: () => { | ||
| matchedSourceProvider.consume(matchedSourceSpan); | ||
| } |
There was a problem hiding this comment.
These changes are to consume text in source cell properly.
Without the first one, a single character in a source could be mapped to multiple characters in target (i.e. bbox in PDF text item). Without the second one, many characters in target couldn't be mapped to source text.
There was a problem hiding this comment.
Add comments to explain the code.
| const mappedItems = doMapping(items, textToPdfTextItemMappings, this.textMappingsLayout); | ||
| if (mappedItems.length > 0) { | ||
| items = mappedItems; | ||
| } |
There was a problem hiding this comment.
When the length of the result of doMapping is zero, it means that mapping fails to find bboxes on PDF using PDF text items.
In the case, mapping from bboxes on HTML fields should be used as a fallback. But the previous code didn't do that.
So, this is to fix the behavior. In the updated code, when mapping returns bboxes, use them. Fall back to use bbox form HTML fields otherwise.
There was a problem hiding this comment.
Please add a comment in the code about this, since the intent isn't evident just from the code itself.
| for (const minMatchLength of [27, 9, 3, 1]) { | ||
| debug('getTextBoxMapping: processText with minMatchLength: %d', minMatchLength); |
|
Not sure if related to this PR or if it's a separate issue, but in Storybook > DocumentPreview > table highlight, when selecting the HTML option in Knobs, it fails to render and shows this error: |
jhpedemonte
left a comment
There was a problem hiding this comment.
Code looks good. Have some minor suggestions to add code comments, plus a few questions. With those fixed, approved.
| const mappedItems = doMapping(items, textToPdfTextItemMappings, this.textMappingsLayout); | ||
| if (mappedItems.length > 0) { | ||
| items = mappedItems; | ||
| } |
There was a problem hiding this comment.
Please add a comment in the code about this, since the intent isn't evident just from the code itself.
| if (horizontalGap < 0) { | ||
| // gap < 0 means that the boxes are overlapped | ||
| const overlap = -horizontalGap; | ||
| // consider two horizontally aligned overlapped boxes are next to each other when | ||
| // the overlap is smaller than half of the box height. | ||
| return overlap < avgHeight / 2; | ||
| } |
There was a problem hiding this comment.
I don't fully understand why you are using height to determine if boxes are horizontally aligned. That seems odd. What's the reasoning behind that?
| @@ -137,12 +138,12 @@ class Source<SourceCell extends TextLayoutCell, TargetCell extends TextLayoutCel | |||
| * @param text | |||
There was a problem hiding this comment.
Add mention of param minLength and what it does.
| markSourceAsMapped: (text: string) => { | ||
| const mappedSource = matchedSourceProvider.getMatch(text); | ||
| if (mappedSource?.span) { | ||
| matchedSourceProvider.consume(mappedSource.span); | ||
| } | ||
| debug('>> target cell %o to source %o', text, mappedSource); | ||
| return mappedSource?.span; | ||
| }, | ||
| markSourceMappedBySpan: (span: TextSpan) => { | ||
| if (spanLen(span) > 0) { | ||
| matchedSourceProvider.consume(span); | ||
| } | ||
| markAsMapped: () => { | ||
| matchedSourceProvider.consume(matchedSourceSpan); | ||
| } |
There was a problem hiding this comment.
Add comments to explain the code.



What do these changes do/fix?
minLengthparameter to many of text match functions. fix: fix and improve PDF highlighting #313 (comment)How do you test/verify these changes?
Please try the storybook and see if PDF highlights works as before.
Have you documented your changes (if necessary)?
n/a
Are there any breaking changes included in this pull request?
no