The scripts/code used to match the PDF miner outputs on documents to the XML representations #20

abirami005 · 2020-02-26T03:40:42Z

Do you provide the scripts/code that you developed to match the PDFMiner outputs on the documents to the XML representation of the PDF page itself? Thanks

zhxgj · 2020-02-27T21:42:17Z

We cannot open source the code at the moment as it is related to our IP protection.

bertsky · 2020-03-02T08:26:21Z

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

zhxgj · 2020-03-02T21:42:53Z

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

pollyMath · 2020-03-05T11:48:22Z

I assume this means that providing only the code for extracting annotations from XML representation is also not possible at the moment?

zhxgj · 2020-03-05T23:48:59Z

@pollyMath Unfortunately that is what our IP lawyer told us.

bertsky · 2021-01-11T16:48:44Z

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@zhxgj Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data?

Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g.

definition/granularity of region classes
not annotating headers and footers
not including reading order of regions
not including text lines (contours / baselines)
not including text content (plain) and text style (formatting)

ajjimeno · 2021-01-12T23:05:46Z

Unfortunately not yet. I understand the benefits, but we cannot release it yet. Thanks for your understanding.

…

On Tue, Jan 12, 2021 at 3:49 AM Robert Sachunsky ***@***.***> wrote: We cannot open source the code at the moment as it is related to our IP protection. Then how about publishing the alignment data themselves in some form? Em, I did not think of it before. Let me have a check along our legal approval chain. @zhxgj <https://github.com/zhxgj> Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data? Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g. - definition/granularity of region classes - not annotating headers and footers - not including reading order of regions - not including text lines (contours / baselines) - not including text content (plain) and text style (formatting) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6BZDOMQJ545RQ35QSAHDLSZMTXZANCNFSM4K34F7UA> .

bertsky mentioned this issue Jan 12, 2021

Rewrite OCR-D/ocrd_kraken#33

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The scripts/code used to match the PDF miner outputs on documents to the XML representations #20

The scripts/code used to match the PDF miner outputs on documents to the XML representations #20

abirami005 commented Feb 26, 2020

zhxgj commented Feb 27, 2020

bertsky commented Mar 2, 2020

zhxgj commented Mar 2, 2020

pollyMath commented Mar 5, 2020

zhxgj commented Mar 5, 2020

bertsky commented Jan 11, 2021

ajjimeno commented Jan 12, 2021 via email

The scripts/code used to match the PDF miner outputs on documents to the XML representations #20

The scripts/code used to match the PDF miner outputs on documents to the XML representations #20

Comments

abirami005 commented Feb 26, 2020

zhxgj commented Feb 27, 2020

bertsky commented Mar 2, 2020

zhxgj commented Mar 2, 2020

pollyMath commented Mar 5, 2020

zhxgj commented Mar 5, 2020

bertsky commented Jan 11, 2021

ajjimeno commented Jan 12, 2021 via email