Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charcaters are broken for Korean language pdf with /UniKS-UTF16-H encoding #3785

Closed
mk-docenty opened this issue Aug 16, 2024 · 5 comments
Closed
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@mk-docenty
Copy link

Description of the bug

Hi,

I am testing a PDF file and when I try to run it using pymupdf/fitz characters are broken and my pdf is encoded with /UniKS-UTF16-H
For example this image is getting

Input : image

output : 5356㱊ኂ⮮ᦂ# ⯆♮ⴖ# ⛯ኺ⊲ኞ⛚

Here is my code

`

try:
import pymupdf as fitz # available with v1.24.3
except ImportError:
import fitz
import pathlib

Open the PDF document

doc = fitz.open("2023_결산서_제9장_성과보고서.pdf")
output_dir = "output_markdown"

Create the output directory if it does not exist

pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

Get the text from each page

for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text("text")

# Convert the text to UTF-8 (or keep as-is if already in desired encoding)
utf8_text = text.encode('utf-8') #encode('utf-8').decode('utf-16') #text.decode('utf-16').encode("utf-8")

# Define the output .md file name
outname = pathlib.Path(output_dir) / f"page_{page_num + 1}.md"

# Save the text to the .md file
pathlib.Path(outname).write_bytes(utf8_text)

print("Text has been successfully extracted and saved as .md files.")

`
is there any solution for this?

How to reproduce the bug

My pymupdf version is 1.24.5 on macos with python 3.10

python test.py

PyMuPDF version

1.24.5

Operating system

MacOS

Python version

3.10

@JorjMcKie
Copy link
Collaborator

You only attached some image. We need the reproducing document.

@mk-docenty
Copy link
Author

@JorjMcKie JorjMcKie added not a bug not a bug / user error / unable to reproduce and removed example required Waiting for information labels Aug 16, 2024
@JorjMcKie
Copy link
Collaborator

All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF.
All you could do is using OCR to make it readable.

@mk-docenty
Copy link
Author

@JorjMcKie can I ask if I can apply CMap for pyMupdf? https://github.com/adobe-type-tools/cmap-resources/tree/master

@mk-docenty
Copy link
Author

All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.

Thank you for quick response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants