Use ActualText when getting the text for the text layer#20014
Use ActualText when getting the text for the text layer#20014calixteman wants to merge 1 commit intomozilla:masterfrom
Conversation
|
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/b97b27b230d6a57/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/1161b6cc8eacaff/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/b97b27b230d6a57/output.txt Total script time: 32.07 mins
Image differences available at: http://54.241.84.105:8877/b97b27b230d6a57/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/1161b6cc8eacaff/output.txt Total script time: 58.70 mins
Image differences available at: http://54.193.163.58:8877/1161b6cc8eacaff/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Can the args[1]?.get("ActualText") be exposed in the getOperatorList result as well?
e.g. something like this
args = [
args[0].name,
args[1] instanceof Dict ? args[1].get("MCID") : null,
args[1] instanceof Dict ? args[1].get("ActualText") : null // <--- extra arg
];in
Lines 2300 to 2303 in d2a6638
Not sure whether its a breaking change, but it's crucial for reconstructing content (e.g. svg) from the results of getOperatorList() when not using getTextContent().
There was a problem hiding this comment.
Could you file a bug and explain why it'd be useful to have such a feature ?
Could it help to fix an existing issue in the current viewer ?
There was a problem hiding this comment.
@calixteman Ok, I'll open a another ticket for it. I don't think it's related to the current issue with the viewer.
Actually I opened the original ticket because I got wrong text from getOperatorList(), and the viewer is also affected so I used it to open the ticket as it's easier to reproduce than a code snippet.
I was actually building a pdf -> svg conversion tool with getOperatorList(). I found getTextContent() to be not useful - it only extracts text, and the shape info can only be obtained from getOperatorList(), and there's no easy way to interweave the text+shape back into correct order from the results of both functions, so I ditched getTextContent() and only use getOperatorList() to also obtain text.
|
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://54.241.84.105:8877/ab0c7336a1331e6/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/ab0c7336a1331e6/output.txt Total script time: 1.14 mins Published |
timvandermeij
left a comment
There was a problem hiding this comment.
Looks good to me, with the comments addressed and passing tests.
| flushTextContentItem(); | ||
| if (args[0]?.name === "Span") { | ||
| textContentItem.span = stringToPDFString( | ||
| args[1]?.get("ActualText") || "" |
There was a problem hiding this comment.
This would be a start at fixing that issue. This is the first step, getting this /ActualText into the text content. That issue is asking for src/display/text_layer.js:#processItems to draw spans containing this actual text in the right places. That will mean also accumulating the text drawing that would have been done (to know the bounds of the glyphs that will be drawn) so that the bounds of the span can be calculated.
| }); | ||
|
|
||
| it("get the text a content stream containing some ActualText", async function () { | ||
| const loadingTask = getDocument(buildGetDocumentParams("issue20007.pdf")); |
There was a problem hiding this comment.
I don't really know why, but the unit test failure suggests that this can't be loaded:
TEST-UNEXPECTED-FAIL | get the text a content stream containing some ActualText | in firefox | ResponseException: Unexpected server response (404) while retrieving PDF "http://127.0.0.1:38175/test/pdfs/issue20007.pdf". in http://127.0.0.1:38175/src/shared/util.js (line 501)
Moreover, is the movement in the reference test expected?
| if (args[0]?.name === "Span") { | ||
| textContentItem.span = stringToPDFString( | ||
| args[1]?.get("ActualText") || "" | ||
| ); | ||
| } |
There was a problem hiding this comment.
I'm not sure what this addition does. This is for a BMC which is just a tag so there never is an args[1]? beginMarkedContentProps (below) is for BDC which is a tag and dictionary.
No description provided.