Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,10 @@ const Highlight: FC<{
scale: number;
active?: boolean;
}> = ({ className, activeClassName, shape, scale, active }) => {
if (shape?.boxes.length === 0) {
return null;
}
Comment on lines +111 to +113
Copy link
Contributor Author

@fukudasjp fukudasjp Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to avoid useless empty <div/> containers being generated. The divs are generated for all the highlights in a document regardless of the page where the highlight exists. With this fix, divs are generated for highlights on the current page.

image


return (
<div data-highlight-id={shape.highlightId}>
{shape?.boxes.map(item => {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,13 +159,17 @@ export class Highlighter {
const doMapping = (
items: TextBoxMappingResult,
textBoxMapping: TextBoxMapping,
parent: TextLayout<TextLayoutCell>
parent: TextLayout<TextLayoutCell>,
options?: { retainUnmapped: boolean }
) =>
flatMap(items, item => {
if (item.cell) {
const { cell: baseCell } = item.cell.getNormalized();
if (baseCell.parent === parent) {
const newItems = textBoxMapping.apply(item.cell);
if (newItems.length === 0 && options?.retainUnmapped) {
return [item];
}
return newItems.map(({ cell, sourceSpan }) => {
return {
cell,
Expand All @@ -180,7 +184,11 @@ export class Highlighter {

const { textToPdfTextItemMappings, textToHtmlBboxMappings } = this;
if (textToPdfTextItemMappings) {
items = doMapping(items, textToPdfTextItemMappings, this.textMappingsLayout);
items = doMapping(items, textToPdfTextItemMappings, this.textMappingsLayout, {
// keep the highlight items which are not mapped to PDF text content.
// Those items can be handled by the following HTML bbox mapping as a fallback
retainUnmapped: true
});
}
if (textToHtmlBboxMappings) {
items = doMapping(items, textToHtmlBboxMappings, this.textMappingsLayout);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ describe('isSideBySideOnLine', () => {
it('should return true for side-by-side boxes', () => {
expect(isNextToEachOther([0, 0, 5, 2], [5, 0, 10, 2])).toBeTruthy();
});
it('should return true for horizontally overlapped boxes', () => {
expect(isNextToEachOther([0, 0, 5, 3], [4, 0, 10, 3])).toBeTruthy();
});
it('should return false when boxes are not vertically aligned', () => {
expect(isNextToEachOther([0, 0, 5, 2], [5, 1, 10, 3])).toBeFalsy();
});
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import { Bbox, TextSpan } from '../../types';
import { bboxesIntersect } from '../../../../utils/box';
import { spanIntersection, spanLen } from '../../../../utils/textSpan';

/**
Expand Down Expand Up @@ -35,10 +34,6 @@ export function isNextToEachOther(boxA: Bbox, boxB: Bbox): boolean {
//
const OVERLAP_RATIO = 0.8;

if (bboxesIntersect(boxA, boxB)) {
return false;
}

const [leftA, topA, rightA, bottomA] = boxA;
const [leftB, topB, rightB, bottomB] = boxB;
const heightA = bottomA - topA;
Expand All @@ -56,6 +51,14 @@ export function isNextToEachOther(boxA: Bbox, boxB: Bbox): boolean {
}

// see if boxes can be neighborhoods
const verticalGap = Math.max(0, leftB - rightA, leftA - rightB);
return verticalGap < avgHeight;
const horizontalGap = Math.max(leftB - rightA, leftA - rightB);
if (horizontalGap < 0) {
// gap < 0 means that the boxes are overlapped
const overlap = -horizontalGap;
// consider two horizontally aligned overlapped boxes are next to each other when
// the overlap is smaller than half of the box height.
return overlap < avgHeight / 2;
}
Comment on lines +55 to +61
Copy link
Contributor Author

@fukudasjp fukudasjp Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous code of disallowing any overlapping was too strict.

  if (bboxesIntersect(boxA, boxB)) {
    return false;
  }

Practically, PDF sometimes includes bboxes overlapped each other for a text. It caused an issue of fragmented highlights like the following screenshot.
image
So, this code is to allow that kind of overlapping. It will fix the issue of split highlights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why you are using height to determine if boxes are horizontally aligned. That seems odd. What's the reasoning behind that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but my guess is that this is for picking a small size that is relative to the size of the characters. (Width wouldn't work due to differences in word length.) My initial thought would be to pick a specific small size, but this solution better accounts for potential issues with zoom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to to adapt to various font sizes. For font height of 72px, 30px is small enough. But for height of 12px, 30px overlapping means that some characters are most likely overlapped. In the case, given boxes cannot be considered as 'next to each other'.


return horizontalGap < avgHeight;
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ import { isNextToEachOther } from '../common/bboxUtils';
import { TextLayoutCell, TextLayoutCellBase } from '../textLayout/types';

export class CellProvider {
private readonly skippedCells: TextLayoutCellBase[] = [];
private cells: readonly TextLayoutCellBase[];
private cursor: number = 0;

Expand Down Expand Up @@ -77,9 +76,13 @@ export class CellProvider {
*/
consume(length: number): TextLayoutCellBase[] {
const result: TextLayoutCellBase[] = [];
if (length <= 0) {
return result;
}

let lengthToConsume = length;
while (lengthToConsume > 0) {
const newCells = [...this.cells];
while (lengthToConsume > 0 && this.cursor < this.cells.length) {
const current = this.cells[this.cursor];
const bboxTextLength = current.text.length;

Expand All @@ -90,24 +93,32 @@ export class CellProvider {
result.push(consumed);

const remaining = current.getPartial([lengthToConsume, bboxTextLength]);
const newCells = [...this.cells];
newCells[this.cursor] = remaining;
this.cells = Object.freeze(newCells);
break;
}

result.push(current);
newCells[this.cursor] = current.getPartial([0, 0]);
lengthToConsume -= bboxTextLength;
this.cursor += 1;
}

this.cells = Object.freeze(newCells);
return result;
}

/**
* skip the current cell
*/
skip() {
this.skippedCells.push(this.cells[this.cursor]);
this.cursor += 1;
}

/**
* move the cursor to the top to reprocess remaining cells
*/
rewind() {
this.cursor = 0;
this.getNextCellsCache = null;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ export class MappingSourceTextProvider {
/**
* Find the best span where the give text matches to the rest of the text
*/
getMatch(text: string) {
getMatch(text: string, minLength = 1) {
const normalizedText = this.normalizer.normalize(text);
debug('getMatch "%s", normalized "%s"', text, normalizedText);
const normalizedMatches = this.provider.getMatches(normalizedText);
debug('getMatch "%s", normalized "%s", minLength = ', text, normalizedText, minLength);
const normalizedMatches = this.provider.getMatches(normalizedText, minLength);
debug('normalized matches: %o', normalizedMatches);

// find best
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,12 @@ export class MappingTargetBoxProvider {
this.current = null;
this.cellProvider.skip();
}

/**
* move the cursor to the top to reprocess remaining target cells
*/
rewind() {
this.current = null;
this.cellProvider.rewind();
}
}
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
import minBy from 'lodash/minBy';
import { nonEmpty } from 'utils/nonEmpty';
import { TextSpan } from '../../types';
import { bboxesIntersect } from '../../../../utils/box';
import { spanLen, spanMerge } from '../../../../utils/textSpan';
import { spanLen } from '../../../../utils/textSpan';
import { TextLayout, TextLayoutCell, TextLayoutCellBase } from '../textLayout/types';
import { MappingSourceTextProvider } from './MappingSourceTextProvider';
import { MappingTargetBoxProvider } from './MappingTargetCellProvider';
Expand Down Expand Up @@ -30,28 +29,29 @@ export function getTextBoxMappings<
const source = new Source(sourceLayout, targetLayout);
const builder = new TextBoxMappingBuilder();

target.processText((targetCellId, targetText, markTargetAsMapped) => {
const matchInSource = source.findMatch(targetCellId, targetText);
if (matchInSource) {
const mappedTargetCells = markTargetAsMapped(matchInSource.matchLength);

let mappedSourceFullSpan: TextSpan = [0, 0];
mappedTargetCells.forEach(targetCell => {
const mappedSourceSpan = matchInSource.markSourceAsMapped(targetCell.text);
if (mappedSourceSpan) {
builder.addMapping(
{ cell: matchInSource.cell, span: mappedSourceSpan },
{ cell: targetCell }
);
mappedSourceFullSpan = spanMerge(mappedSourceFullSpan, mappedSourceSpan);
}
});
if (spanLen(mappedSourceFullSpan) > 0) {
matchInSource.markSourceMappedBySpan(mappedSourceFullSpan);
for (const minMatchLength of [27, 9, 3, 1]) {
debug('getTextBoxMapping: processText with minMatchLength: %d', minMatchLength);
Comment on lines +32 to +33
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match longer chunks first to fix scattered highlights
image

target.processText((targetCellId, targetText, markTargetAsMapped) => {
if (targetText.length < minMatchLength) {
return;
}
}
});

const matchInSource = source.findMatch(targetCellId, targetText, minMatchLength);
if (matchInSource) {
const mappedTargetCells = markTargetAsMapped(matchInSource.matchLength);

mappedTargetCells.forEach(targetCell => {
const mappedSourceSpan = matchInSource.markSourceAsMapped(targetCell.text);
if (mappedSourceSpan) {
builder.addMapping(
{ cell: matchInSource.cell, span: mappedSourceSpan },
{ cell: targetCell }
);
}
});
matchInSource.markAsMapped();
}
});
}
return builder.toTextBoxMapping();
}

Expand Down Expand Up @@ -98,6 +98,7 @@ class Target<TargetCell extends TextLayoutCell> {
this.targetProvider.skip();
}
}
this.targetProvider.rewind();
}
}

Expand Down Expand Up @@ -135,14 +136,17 @@ class Source<SourceCell extends TextLayoutCell, TargetCell extends TextLayoutCel
* with the target cell of given `targetCellId`
* @param targetCellId
* @param text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add mention of param minLength and what it does.

* @param minLength the minimal length of matched text. this function returns the result
* only when a match longer than `minLength` is found. Otherwise `null`.
* In case the `text` is shorter than `minLength`, this always return `null`.
* @return matched source information and functions to mark the matched span as mapped
*/
findMatch(targetCellId: TargetCell['id'], text: string) {
findMatch(targetCellId: TargetCell['id'], text: string, minLength = 1) {
const candidateSources = this.targetIndexToSources[targetCellId];
const bestMatch = Source.findBestMatch(candidateSources, text);
const bestMatch = Source.findBestMatch(candidateSources, text, minLength);
debug('> source cell(s) matched: %o', bestMatch);

if (!bestMatch?.match || spanLen(bestMatch.match.span) === 0) {
if (!bestMatch?.match || spanLen(bestMatch.match.span) < minLength) {
return null;
}

Expand All @@ -155,13 +159,19 @@ class Source<SourceCell extends TextLayoutCell, TargetCell extends TextLayoutCel
matchLength: spanLen(matchedSourceSpan),
markSourceAsMapped: (text: string) => {
const mappedSource = matchedSourceProvider.getMatch(text);
if (mappedSource?.span) {
// mark the span in the source provider as 'used' so that other texts from target
// are not mapped to the same source span
matchedSourceProvider.consume(mappedSource.span);
}
debug('>> target cell %o to source %o', text, mappedSource);
return mappedSource?.span;
},
markSourceMappedBySpan: (span: TextSpan) => {
if (spanLen(span) > 0) {
matchedSourceProvider.consume(span);
}
markAsMapped: () => {
// mark entire the match in the source provider as 'used'.
// this need to called after mapping to target text are built using `markSourceAsMapped`.
// the matched span in the source is mapped to target.
matchedSourceProvider.consume(matchedSourceSpan);
}
};
}
Expand All @@ -177,11 +187,12 @@ class Source<SourceCell extends TextLayoutCell, TargetCell extends TextLayoutCel
cell: TextLayoutCell;
provider: MappingSourceTextProvider;
}[],
textToMatch: string
textToMatch: string,
minLength: number
) {
// find matches
const matches = sources.map(source => {
const match = source.provider.getMatch(textToMatch);
const match = source.provider.getMatch(textToMatch, minLength);
return { ...source, match };
});

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,7 @@ $highlight: #d0e2ff;
.highlight {
opacity: 0.3;
background: darken($highlight, 30%);
border: 2px solid #000000;
box-sizing: content-box;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,6 @@ import document from 'components/DocumentPreview/__fixtures__/Art Effects Koya C
import { document as docJa } from 'components/DocumentPreview/__fixtures__/DiscoComponent-ja.pdf';
import documentJa from 'components/DocumentPreview/__fixtures__/DiscoComponents-ja_document.json';

import PDFJS from 'pdfjs-dist';
(PDFJS as any).cMapUrl = './node_modules/pdfjs-dist/cmaps/';
(PDFJS as any).cMapPacked = true;

const pageKnob = {
label: 'Page',
options: {
Expand Down
4 changes: 2 additions & 2 deletions yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -2163,7 +2163,7 @@ __metadata:
languageName: node
linkType: hard

"@ibm-watson/discovery-react-components@^1.5.0-beta.24, @ibm-watson/discovery-react-components@workspace:packages/discovery-react-components":
"@ibm-watson/discovery-react-components@^1.5.0-beta.25, @ibm-watson/discovery-react-components@workspace:packages/discovery-react-components":
version: 0.0.0-use.local
resolution: "@ibm-watson/discovery-react-components@workspace:packages/discovery-react-components"
dependencies:
Expand Down Expand Up @@ -10664,7 +10664,7 @@ __metadata:
dependencies:
"@carbon/icons": ^10.5.0
"@cypress/webpack-preprocessor": ^5.11.0
"@ibm-watson/discovery-react-components": ^1.5.0-beta.24
"@ibm-watson/discovery-react-components": ^1.5.0-beta.25
"@ibm-watson/discovery-styles": ^1.5.0-beta.24
"@testing-library/cypress": ^7.0.7
"@types/proper-url-join": ^2
Expand Down