Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it support inserting Chinese text? #329

Closed
seven1122 opened this issue Jul 22, 2019 · 29 comments
Closed

Does it support inserting Chinese text? #329

seven1122 opened this issue Jul 22, 2019 · 29 comments

Comments

@seven1122
Copy link

if it supports, how should i set the 'encoding' when i try to use the 'page.insertText()' method? or any other method recommend,thank you!

@JorjMcKie
Copy link
Collaborator

Yes it does! And it works for insertText() as well as for insertTextbox().
For a recent discussion (an example using a Thai font) see #319.
PyMuPDF comes with built-in fonts for traditional and simplified Chinese fonts. Use:

  • fontname="china-s" or fontname="china-ss" for simplified Chinese
  • fontname="china-t" or fontname="china-ts" for traditional Chinese

Using these means your PDF will not need or contain extra fonts, resp. fontfiles.
If you want to use a special font however, you can also do this. You must then choose a fontname different from every of the above and also specify the filename of a fontfile on your system.

@seven1122
Copy link
Author

seven1122 commented Jul 23, 2019 via email

@JorjMcKie
Copy link
Collaborator

There was no picture ... Please let me see your script. Here is my example:

Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import fitz
>>> doc = fitz.open()
>>> page = doc.newPage()
>>> text = "你好!hello!Hallo! 我很喜欢德国!德国是个好地方!"
>>> page.insertText((100,100), text, fontname="china-ss")
1
>>> doc.save("test.pdf")
>>> 

It leads to this PDF:
test.pdf

@seven1122
Copy link
Author

seven1122 commented Jul 24, 2019 via email

@JorjMcKie
Copy link
Collaborator

You forgot to attach the PDF again.

The code looks okay so far.
What you can try:

  • use a different viewer to show the PDF: not all viewers are capable of displaying all Chinese fonts. For example, SumatraPDF cannot display "china-ss", but Adobe Acrobat can, etc.
  • choose a different fontname: apart from "china-ss", you can also try "china-t", "china-ts" and "china-s".

@seven1122
Copy link
Author

Uploading image.png…
I am sure that it is not related to the PDF viewer, as the PDF file is a chinese file. Also I have try all above fontname,but the results are same.

@seven1122
Copy link
Author

image

@JorjMcKie
Copy link
Collaborator

The file you sent was just an image, not a PDF, so I am unble to track down what happened.

I hope you have tried the script I sent you. What were the results?
Which PDF viewer are you using?
If it does not work on your system, please send me details of your installation: operating system, Python version, PyMuPDF version, method of installation (wheel? generation via source?)

@seven1122
Copy link
Author

seven1122 commented Jul 25, 2019

I have tried your script, as following:
Python 2.7.15+ (default, Nov 27 2018, 23:36:35)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import fitz>>> doc = fitz.open()>>> page = doc.newPage()>>> text = "你好! hello!">>> page.insertText((100, 100), text, fontname="china-ss")1>>> doc.save("test01.pdf")>>>
the result file:
test01.pdf
when I run my code, the result PDF file:
test.pdf
I am using Document Viewer as my PDF viewer comes with the operating system
some other details: ubuntu 18, python 2.7.15, PyMuPDF 1.14.11, wheel installation

@JorjMcKie
Copy link
Collaborator

finally we are making progress!
You are using Python 2 which has poor support of unicode. I cannot imagine that you were able to enter Chinese text in a Python2 IDLE session! And even without using u"..."!
Please try a Python 3 version to see if there is a difference.
Anyway, here is a script which I would ask you to run in batch under Python2. I did this using the same Python 2.7.15+ version under Ubuntu:

# -*- coding: utf-8 -*-
import fitz

doc = fitz.open()  # new PDF
text = u"你好!hello!Hallo! 我很喜欢德国!德国是个好地方"
pnt = fitz.Point(50, 72)  # start point of text insertion
page = doc.newPage()  # crete a page
page.insertText(pnt, text, fontname="china-t")
doc.save(__file__ + ".pdf")

The resulting PDF was correct.

@seven1122
Copy link
Author

oh my god, I felt as if I had made an outrageous howler.tank you very much. while, it doesn't seem very friendly to support mixed text(including number ,chinese, english), the space between different number letter is not standard. But it doesn't matter.
image

@JorjMcKie
Copy link
Collaborator

good to see it works now.
Close the issue?

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 25, 2019

oh my god, I felt as if I had made an outrageous howler.tank you very much. while, it doesn't seem very friendly to support mixed text(including number ,chinese, english), the space between different number letter is not standard. But it doesn't matter.

This is a characteristic of the built-in fonts (they are mono-spaced: all the characters have the same width, whether Chinese or Latin). If you choose a nicer one, you will also see nicer results. Look at the difference between the following pictures (identical code, made with page.insertTextbox() and rotated via morphing):
Built-in font:
grafik

Nicer font:
grafik

@HeroadZ
Copy link

HeroadZ commented Jun 26, 2021

Could you give some hints about a nicer font for mixed text?
In the explanation of font, it said that

If you know you have a mixture of CJK and Latin text, consider just using Font("cjk") because this supports everything

but there is no 'cjk' font in the latest version of PyMuPdf now.

What kind of fontfile did you use for mixed font? For example, mixed of Japanese and English.

@JorjMcKie
Copy link
Collaborator

but there is no 'cjk' font in the latest version of PyMuPdf now.

Why do you say that? Of course there is! It is

>>> import fitz
>>> font=fitz.Font("cjk")
>>> font.name
'Droid Sans Fallback Regular'
>>> 

@HeroadZ
Copy link

HeroadZ commented Jun 26, 2021

oh, sorry for that.
I tested this font in Page.insert_text function, but error occurred. Even with Droid Sans Fallback Regular
Could I get a nicer font for mixed text by insert_text function with cjk font like 2021年6月26日?

@JorjMcKie
Copy link
Collaborator

These methods insert_text, insert_textbox require to use the buffer of Font("cjk"):

font=fitz.Font("cjk")
page.insert_font(fontname="F0", fontbuffer=font.buffer)
page.insert_text(..., fontname="F0",...)

@HeroadZ
Copy link

HeroadZ commented Jun 26, 2021

Thank you very much! This saved my day!
Sorry if I didn't read the doc carefully, it seems that there's no doc for this usage with mixed text.
I think that it's better add some explanation in insert_text, insert_textbox function.

@JorjMcKie
Copy link
Collaborator

ok, I'll see what I can do.

@maiiabocharova
Copy link

maiiabocharova commented Feb 25, 2022

Hello! Is there a way to insert Chinese text using page.insert_textbox and custom font?
This example

doc = fitz.open()
page = doc.new_page()
text = '为什么烦恼'
page.insert_text((100,100), text, fontname="china-s")
doc.save("test.pdf")

worked, file was Ok, but I want to use textbox to ensure the right positioning:

So I did:

page.insert_textbox(rect, text,
                        fontsize=fontsize_to_use,
                        fontfile='zcool.ttf', 
                        align=1)

(I downloaded a fontfile from here: https://fonts.google.com/specimen/ZCOOL+XiaoWei?subset=chinese-simplified#standard-styles)

Also I tried
page.insert_textbox(rect, text, fontname="china-s") and it also doesn't word (blank output)

It doesn't give me any errors, but the output file is blank

@JorjMcKie
Copy link
Collaborator

If you would try insert_textbox() with "china-s", it should work.
Other fonts always are a bit less reliable.
I recommend using the new TextWriter feature togehter with the Font class - along this code snippet:

font = fitz.Font(fontfile="...")
page=...
tw = fitz.TextWriter(page.rect)
tw.fill_textbox(...)
tw.write_text(page)

For a TTF font, this hsould deliver the best results.
You can also use font = fitz.Font("cjk"), the universal builtin font. This allows using a mixture of China, Korea, Japan, and Latin text at once.

@maiiabocharova
Copy link

font = fitz.Font(fontfile="...")
page=...
tw = fitz.TextWriter(page.rect)
tw.fill_textbox(...)
tw.write_text(page)

Thank you for the snippet. I followed your advice and had a problem with this line

rect = (rect_x1, rect_y1, rect_x2, rect_y2)
tw.fill_textbox(fitz.Rect(rect), text, font=font, fontsize=fontsize)

ValueError: Text must start in rectangle.

@maiiabocharova
Copy link

Actually the mistake was when I tried to insert text with fontsize bigger than rect
So

page.insert_textbox(rect, text,
                            fontsize=fontsize,
                            fontname="china-t",
                            align=1)

works perfectly when I provide the smaller fontsize

@JorjMcKie
Copy link
Collaborator

My bad, should have looked up the right call pattern again first:
fill_textbox(rect, text, pos=None, font=None, fontsize=11, align=0, right_to_left=False, warn=None, small_caps=0)

@JorjMcKie
Copy link
Collaborator

This method insert_textbox() returns a float, which should be checked to determine success.
If negative, there was not enough room and nothing has been written!

@buptyyf
Copy link

buptyyf commented Oct 25, 2022

I have another question about extract font family buffer, sunch as example.ttf.
I have learn about analyze font on the website. And I have extract a font family SimSun from a pdf file, that's a Chinese font family. But how can I get the font ttf file? Thanks.

@JorjMcKie
Copy link
Collaborator

I have another question about extract font family buffer, sunch as example.ttf. I have learn about analyze font on the website. And I have extract a font family SimSun from a pdf file, that's a Chinese font family. But how can I get the font ttf file? Thanks.

The page.get_fonts() output is a list of fonts used by the page. Each item of the list looks like (xref, ext, ...).
The ext sub-item is a string with the extension suitable for that font, so something like "ttf".
If however that string equals "n/a", then this is a built-in font, for which there exists no binary font file in the PDF. Examples are the Helvetica and Courier fonts.
In other cases you can do buffer = doc.extract_font(xref)[-1] to extract the binary font file content. After that you can store the result away as a font file like for exmaple so:

fontfile = open(f"myfont.{ext}", "wb")
fontfile.write(buffer)
fontfile.close()

But please note, that in many (if not most) PDFs not the complete font is embedded, but only those characters of a font, which are actually used inside the PDF.
So with the extraction above, you do get a valid font, but it may only contain a handful of characters.

@buptyyf
Copy link

buptyyf commented Oct 26, 2022

But please note, that in many (if not most) PDFs not the complete font is embedded, but only those characters of a font, which are actually used inside the PDF.
So with the extraction above, you do get a valid font, but it may only contain a handful of characters.

@JorjMcKie Thanks a lot.

I use doc.extract_font get all font families of a pdf as follows:
font family BCDEEE+SimSun ttf
font family BCDFEE+Calibri ttf
font family BCDGEE+Calibri ttf
font family BCDHEE+SimSun ttf
font family BCDIEE+Calibri-Bold ttf
font family SimSun n/a
font family ArialMT n/a

But I don't know the meaning of 'BCDEEE\BCDFEE\BCDGEE\BCDHEE\BCDIEE'. I just get font info from span dictionary by page.get_text('dict'), which only show SimSun, ArialMT, Calibri.

@JorjMcKie
Copy link
Collaborator

The prefix of 6 upper case ASCII letter with a "+" mean a font subset: this is not the complete "SimSun.ttf" for example.
If you want that prefix to also be contained in text extractions, set a global variable to True: fitz.TOOLS.set_subset_fontnames(True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants