You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am testing a PDF file and when I try to run it using pymupdf/fitz characters are broken and my pdf is encoded with /UniKS-UTF16-H
For example this image is getting
Input :
output : 5356㱊ኂ⮮ᦂ# ⯆♮ⴖ# ⛯ኺ⊲ኞ⛚
Here is my code
`
try:
import pymupdf as fitz # available with v1.24.3
except ImportError:
import fitz
import pathlib
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text("text")
# Convert the text to UTF-8 (or keep as-is if already in desired encoding)
utf8_text = text.encode('utf-8') #encode('utf-8').decode('utf-16') #text.decode('utf-16').encode("utf-8")
# Define the output .md file name
outname = pathlib.Path(output_dir) / f"page_{page_num + 1}.md"
# Save the text to the .md file
pathlib.Path(outname).write_bytes(utf8_text)
print("Text has been successfully extracted and saved as .md files.")
`
is there any solution for this?
How to reproduce the bug
My pymupdf version is 1.24.5 on macos with python 3.10
python test.py
PyMuPDF version
1.24.5
Operating system
MacOS
Python version
3.10
The text was updated successfully, but these errors were encountered:
All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF.
All you could do is using OCR to make it readable.
All PDF viewers / readers show garbled text output. So this is a problem with the file - not (Py-) MuPDF. All you could do is using OCR to make it readable.
Description of the bug
Hi,
I am testing a PDF file and when I try to run it using pymupdf/fitz characters are broken and my pdf is encoded with /UniKS-UTF16-H
For example this image is getting
Input :
output : 5356㱊ኂ⮮ᦂ# ⯆♮ⴖ# ⛯ኺ⊲ኞ⛚
Here is my code
`
try:
import pymupdf as fitz # available with v1.24.3
except ImportError:
import fitz
import pathlib
Open the PDF document
doc = fitz.open("2023_결산서_제9장_성과보고서.pdf")
output_dir = "output_markdown"
Create the output directory if it does not exist
pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
Get the text from each page
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text("text")
print("Text has been successfully extracted and saved as .md files.")
`
is there any solution for this?
How to reproduce the bug
My pymupdf version is 1.24.5 on macos with python 3.10
python test.py
PyMuPDF version
1.24.5
Operating system
MacOS
Python version
3.10
The text was updated successfully, but these errors were encountered: