Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

mk-docenty · 2024-08-16T07:01:49Z

Description of the bug

Hi,

I am testing a PDF file and when I try to run it using pymupdf/fitz characters are broken and my pdf is encoded with /UniKS-UTF16-H
For example this image is getting

Input :

output : 5356㱊ኂ⮮ᦂ# ⯆♮ⴖ# ⛯ኺ⊲ኞ⛚

Here is my code

`

try:
import pymupdf as fitz # available with v1.24.3
except ImportError:
import fitz
import pathlib

Open the PDF document

doc = fitz.open("2023_결산서_제9장_성과보고서.pdf")
output_dir = "output_markdown"

Create the output directory if it does not exist

pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

Get the text from each page

for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text("text")

# Convert the text to UTF-8 (or keep as-is if already in desired encoding)
utf8_text = text.encode('utf-8') #encode('utf-8').decode('utf-16') #text.decode('utf-16').encode("utf-8")

# Define the output .md file name
outname = pathlib.Path(output_dir) / f"page_{page_num + 1}.md"

# Save the text to the .md file
pathlib.Path(outname).write_bytes(utf8_text)

print("Text has been successfully extracted and saved as .md files.")

`
is there any solution for this?

How to reproduce the bug

My pymupdf version is 1.24.5 on macos with python 3.10

python test.py

PyMuPDF version

1.24.5

Operating system

MacOS

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-08-16T07:16:06Z

You only attached some image. We need the reproducing document.

mk-docenty · 2024-08-16T08:38:04Z

Hi,

here is example Pdf
2023_결산서_제9장_성과보고서-7-12.pdf

JorjMcKie · 2024-08-16T09:00:17Z

All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF.
All you could do is using OCR to make it readable.

mk-docenty · 2024-08-16T09:01:26Z

@JorjMcKie can I ask if I can apply CMap for pyMupdf? https://github.com/adobe-type-tools/cmap-resources/tree/master

mk-docenty · 2024-08-16T09:01:45Z

All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.

Thank you for quick response

JorjMcKie added example required Waiting for information labels Aug 16, 2024

JorjMcKie added not a bug not a bug / user error / unable to reproduce and removed example required Waiting for information labels Aug 16, 2024

JorjMcKie closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

mk-docenty commented Aug 16, 2024

JorjMcKie commented Aug 16, 2024

mk-docenty commented Aug 16, 2024

JorjMcKie commented Aug 16, 2024

mk-docenty commented Aug 16, 2024

mk-docenty commented Aug 16, 2024

Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

Comments

mk-docenty commented Aug 16, 2024

Description of the bug

Open the PDF document

Create the output directory if it does not exist

Get the text from each page

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Aug 16, 2024

mk-docenty commented Aug 16, 2024

JorjMcKie commented Aug 16, 2024

mk-docenty commented Aug 16, 2024

mk-docenty commented Aug 16, 2024