Implementation of advanced cmap encodings #2356

stefan6419846 · 2023-12-22T10:44:45Z

Currently, I am trying to extract text from PDF files which partially report some warnings like

/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_cmap.py:183: PdfReadWarning: Advanced encoding /GBK2K-H not implemented yet
  warnings.warn(
/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_cmap.py:183: PdfReadWarning: Advanced encoding /GBK2K-V not implemented yet
  warnings.warn(

I have seen this for the both encodings mentioned above and for /StandardEncoding.

Digging through the available resources related to the GBK2K cmaps, I found some Adobe resources as well as the implementation from pdfminer.six, which ships some custom pickled files derived from the Adobe open source components to handle such cases.

Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.3, crypt_provider=('pycryptodome', '3.18.0'), PIL=10.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
page = reader.pages[0]
print(page.extract_text())

For now, I have no uncritical file I could share here. Looking at the example file, it seems like in this case it is a scan of a document (from a Canon device?) with Latin characters with wrongly configured or strange OCR, yielding a mix of Latin and Chinese characters inside the text layer.

Traceback

warnings.warn as currently used only prints the pypdf code line this occurred, thus there is not much of a traceback.

The text was updated successfully, but these errors were encountered:

MartinThoma · 2023-12-25T11:05:55Z

Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?

No, there is none. I guess only @pubpub-zz can help you with that.

pubpub-zz · 2023-12-27T13:28:57Z

@stefan6419846
try to modify _cmap.py with

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",  # <- new
    "/GBK2K-V": "gb18030", # <- new
    # UCS2 in code
}

stefan6419846 · 2023-12-29T10:49:26Z

@pubpub-zz Thanks for pointing this out. It seems to indeed work.

When looking at this, two questions arose for me:

Why do we not declare the complete mapping already if this seems to be easy enough to do? https://github.com/adobe-type-tools/cmap-resources lists quite some more possible character maps.
Is there an easy way to generate corresponding test data? Assuming that cmaps are rather essential, I would have assumed that there are some sample files, but doing a quick search, I could not really find some.

actuary-chen · 2024-06-19T10:01:49Z

@stefan6419846 try to modify _cmap.py with

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",  # <- new
    "/GBK2K-V": "gb18030", # <- new
    # UCS2 in code
}

Similiar issues for "/UniCNS-UTF16-H" , "/ETen-B5-H" , "/ETen-B5-V", "/ETenms-B5-H" , how to modify _cmap?

pubpub-zz · 2024-06-19T11:26:29Z

@actuary-chen can you please share your pdf for analysis?

actuary-chen · 2024-06-20T04:24:33Z

Hi, Maybe regards these two files. Benjamin pubpub-zz ***@***.***> 於 2024年6月19日週三下午7:26寫道：

…

@actuary-chen <https://github.com/actuary-chen> can you please share your pdf for analysis? — Reply to this email directly, view it on GitHub <#2356 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEO7QJBBCNVEMFERM4HCEWDZIFTHZAVCNFSM6AAAAABA7VBGLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGQ2DAMZUGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pubpub-zz · 2024-06-20T04:41:46Z

@actuary-chen the files are not attached. Please attach them directly in the thread

actuary-chen · 2024-06-20T07:01:03Z

FBL01-1.pdf
FBL01-2.pdf

The issues are maybe from such as the attached files

pubpub-zz · 2024-06-21T18:51:54Z

@actuary-chen
this is the updated table. your files were not containing UniCNS-UTF16-H can you check it is ok with the new table?

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",
    "/GBK2K-V": "gb18030",
    "/ETen-B5-H": "cp950",
    "/ETen-B5-V": "cp950",
    "/ETenms-B5-H": "cp950",
    "/ETenms-B5-V": "cp950",
    "/UniCNS-UTF16-H": "utf-16-be", # TBC
    "/UniCNS-UTF16-V": "utf-16-be", # TBC
    # UCS2 in code
}

closes py-pdf#2356

Related to #2356.

pubpub-zz · 2024-08-16T09:18:38Z

This issue seems solved. Don't know why it has not been closed automatically

stefan6419846 · 2024-08-16T09:47:38Z

This has not been closed before as I was looking for a generic solution for implementing all possible encodings in one step instead of opening a new issue for each one.

pubpub-zz · 2024-08-16T12:50:17Z

we need to check the encodings. I can not see a global solutoin

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-feature A feature request labels Dec 25, 2023

MartinThoma mentioned this issue Dec 25, 2023

Inconsistent usage of warnings #2354

Closed

stefan6419846 mentioned this issue Jan 1, 2024

BUG: Add support for GBK2K cmaps #2385

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jun 21, 2024

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings

f667fbb

closes py-pdf#2356

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jun 21, 2024

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings

54fbcd7

closes py-pdf#2356

pubpub-zz mentioned this issue Jun 21, 2024

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

Merged

stefan6419846 pushed a commit that referenced this issue Jun 23, 2024

ENH: Accept ETen-B5 and UniCNS-UTF16 encodings (#2721)

81f35f9

Related to #2356.

pubpub-zz closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of advanced cmap encodings #2356

Implementation of advanced cmap encodings #2356

stefan6419846 commented Dec 22, 2023

MartinThoma commented Dec 25, 2023

pubpub-zz commented Dec 27, 2023

stefan6419846 commented Dec 29, 2023

actuary-chen commented Jun 19, 2024

pubpub-zz commented Jun 19, 2024

actuary-chen commented Jun 20, 2024 via email

pubpub-zz commented Jun 20, 2024

actuary-chen commented Jun 20, 2024

pubpub-zz commented Jun 21, 2024

pubpub-zz commented Aug 16, 2024

stefan6419846 commented Aug 16, 2024

pubpub-zz commented Aug 16, 2024

Implementation of advanced cmap encodings #2356

Implementation of advanced cmap encodings #2356

Comments

stefan6419846 commented Dec 22, 2023

Environment

Code + PDF

Traceback

MartinThoma commented Dec 25, 2023

pubpub-zz commented Dec 27, 2023

stefan6419846 commented Dec 29, 2023

actuary-chen commented Jun 19, 2024

pubpub-zz commented Jun 19, 2024

actuary-chen commented Jun 20, 2024 via email

pubpub-zz commented Jun 20, 2024

actuary-chen commented Jun 20, 2024

pubpub-zz commented Jun 21, 2024

pubpub-zz commented Aug 16, 2024

stefan6419846 commented Aug 16, 2024

pubpub-zz commented Aug 16, 2024