Skip to content

fix: fix and improve PDF highlighting#313

Merged
fukudasjp merged 10 commits intomasterfrom
fix/pdf-highlight
Mar 4, 2022
Merged

fix: fix and improve PDF highlighting#313
fukudasjp merged 10 commits intomasterfrom
fix/pdf-highlight

Conversation

@fukudasjp
Copy link
Contributor

@fukudasjp fukudasjp commented Feb 27, 2022

What do these changes do/fix?

How do you test/verify these changes?

Please try the storybook and see if PDF highlights works as before.

Have you documented your changes (if necessary)?

n/a

Are there any breaking changes included in this pull request?

no

- story for user-provided PDF and document JSON
- fix cmap URL
- make storybook provide cmap files
- to fix highlight boxes
- for more efficient text box mapping
- give priority to longer match
- add min match length to match functions
- support `rewind` in cell provider
Comment on lines +111 to +113
if (shape?.boxes.length === 0) {
return null;
}
Copy link
Contributor Author

@fukudasjp fukudasjp Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to avoid useless empty <div/> containers being generated. The divs are generated for all the highlights in a document regardless of the page where the highlight exists. With this fix, divs are generated for highlights on the current page.

image

Comment on lines +55 to +61
if (horizontalGap < 0) {
// gap < 0 means that the boxes are overlapped
const overlap = -horizontalGap;
// consider two horizontally aligned overlapped boxes are next to each other when
// the overlap is smaller than half of the box height.
return overlap < avgHeight / 2;
}
Copy link
Contributor Author

@fukudasjp fukudasjp Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous code of disallowing any overlapping was too strict.

  if (bboxesIntersect(boxA, boxB)) {
    return false;
  }

Practically, PDF sometimes includes bboxes overlapped each other for a text. It caused an issue of fragmented highlights like the following screenshot.
image
So, this code is to allow that kind of overlapping. It will fix the issue of split highlights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why you are using height to determine if boxes are horizontally aligned. That seems odd. What's the reasoning behind that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but my guess is that this is for picking a small size that is relative to the size of the characters. (Width wouldn't work due to differences in word length.) My initial thought would be to pick a specific small size, but this solution better accounts for potential issues with zoom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to to adapt to various font sizes. For font height of 72px, 30px is small enough. But for height of 12px, 30px overlapping means that some characters are most likely overlapped. In the case, given boxes cannot be considered as 'next to each other'.

Comment on lines 157 to 167
markSourceAsMapped: (text: string) => {
const mappedSource = matchedSourceProvider.getMatch(text);
if (mappedSource?.span) {
matchedSourceProvider.consume(mappedSource.span);
}
debug('>> target cell %o to source %o', text, mappedSource);
return mappedSource?.span;
},
markSourceMappedBySpan: (span: TextSpan) => {
if (spanLen(span) > 0) {
matchedSourceProvider.consume(span);
}
markAsMapped: () => {
matchedSourceProvider.consume(matchedSourceSpan);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are to consume text in source cell properly.

Without the first one, a single character in a source could be mapped to multiple characters in target (i.e. bbox in PDF text item). Without the second one, many characters in target couldn't be mapped to source text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments to explain the code.

Comment on lines 183 to 186
const mappedItems = doMapping(items, textToPdfTextItemMappings, this.textMappingsLayout);
if (mappedItems.length > 0) {
items = mappedItems;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the length of the result of doMapping is zero, it means that mapping fails to find bboxes on PDF using PDF text items.

In the case, mapping from bboxes on HTML fields should be used as a fallback. But the previous code didn't do that.

So, this is to fix the behavior. In the updated code, when mapping returns bboxes, use them. Fall back to use bbox form HTML fields otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment in the code about this, since the intent isn't evident just from the code itself.

Comment on lines +32 to +33
for (const minMatchLength of [27, 9, 3, 1]) {
debug('getTextBoxMapping: processText with minMatchLength: %d', minMatchLength);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match longer chunks first to fix scattered highlights
image

@fukudasjp fukudasjp requested a review from wataken-ibm March 2, 2022 10:58
@fukudasjp fukudasjp marked this pull request as ready for review March 2, 2022 10:59
@jhpedemonte
Copy link
Member

Not sure if related to this PR or if it's a separate issue, but in Storybook > DocumentPreview > table highlight, when selecting the HTML option in Knobs, it fails to render and shows this error:

Uncaught Error: Failed to find text node. Node: Information Security., offset: 274502
    at getTextNodeAndOffset (documentUtils.ts:108:1)
    at findOffsetInDOM (documentUtils.ts:53:1)
    at HtmlView.tsx:153:1
    at Array.forEach (<anonymous>)
    at HtmlView.tsx:147:1
    at invokePassiveEffectCreate (react-dom.development.js:23487:1)
    at HTMLUnknownElement.callCallback (react-dom.development.js:3945:1)
    at Object.invokeGuardedCallbackDev (react-dom.development.js:3994:1)
    at invokeGuardedCallback (react-dom.development.js:4056:1)
    at flushPassiveEffectsImpl (react-dom.development.js:23574:1)

Copy link
Member

@jhpedemonte jhpedemonte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Have some minor suggestions to add code comments, plus a few questions. With those fixed, approved.

Comment on lines 183 to 186
const mappedItems = doMapping(items, textToPdfTextItemMappings, this.textMappingsLayout);
if (mappedItems.length > 0) {
items = mappedItems;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment in the code about this, since the intent isn't evident just from the code itself.

Comment on lines +55 to +61
if (horizontalGap < 0) {
// gap < 0 means that the boxes are overlapped
const overlap = -horizontalGap;
// consider two horizontally aligned overlapped boxes are next to each other when
// the overlap is smaller than half of the box height.
return overlap < avgHeight / 2;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why you are using height to determine if boxes are horizontally aligned. That seems odd. What's the reasoning behind that?

@@ -137,12 +138,12 @@ class Source<SourceCell extends TextLayoutCell, TargetCell extends TextLayoutCel
* @param text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add mention of param minLength and what it does.

Comment on lines 157 to 167
markSourceAsMapped: (text: string) => {
const mappedSource = matchedSourceProvider.getMatch(text);
if (mappedSource?.span) {
matchedSourceProvider.consume(mappedSource.span);
}
debug('>> target cell %o to source %o', text, mappedSource);
return mappedSource?.span;
},
markSourceMappedBySpan: (span: TextSpan) => {
if (spanLen(span) > 0) {
matchedSourceProvider.consume(span);
}
markAsMapped: () => {
matchedSourceProvider.consume(matchedSourceSpan);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments to explain the code.

@fukudasjp
Copy link
Contributor Author

In most cases, this improves highlight accuracy, but there are some sideeffects by changing the algorithm. I'll further update the algorithm in upcoming PR.

image

(left: highlight is shown, right: highlight is missing - this reproduces when same text are repeated (in title and body in this case))

@fukudasjp fukudasjp merged commit cc53e51 into master Mar 4, 2022
@fukudasjp fukudasjp deleted the fix/pdf-highlight branch March 4, 2022 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants