Does it support inserting Chinese text? #329

seven1122 · 2019-07-22T11:32:50Z

if it supports, how should i set the 'encoding' when i try to use the 'page.insertText()' method? or any other method recommend,thank you!

JorjMcKie · 2019-07-22T12:05:02Z

Yes it does! And it works for insertText() as well as for insertTextbox().
For a recent discussion (an example using a Thai font) see #319.
PyMuPDF comes with built-in fonts for traditional and simplified Chinese fonts. Use:

fontname="china-s" or fontname="china-ss" for simplified Chinese
fontname="china-t" or fontname="china-ts" for traditional Chinese

Using these means your PDF will not need or contain extra fonts, resp. fontfiles.
If you want to use a special font however, you can also do this. You must then choose a fontname different from every of the above and also specify the filename of a fontfile on your system.

seven1122 · 2019-07-23T12:44:47Z

hi, First thank you for your reply! But it does not work. As following picture, the red circle position should be my chinese text,but they have missed.

…

------------------ 原始邮件 ------------------ 发件人: "Jorj X. McKie"<notifications@github.com>; 发送时间: 2019年7月22日(星期一) 晚上8:05 收件人: "pymupdf/PyMuPDF"<PyMuPDF@noreply.github.com>; 抄送: "蜗牛快跑"<931880645@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [pymupdf/PyMuPDF] Does it support inserting Chinese text? (#329) Yes it does! And it works for insertText() as well as for insertTextbox(). For a recent discussion (an example using a Thai font) see #319. PyMuPDF comes with built-in fonts for traditional and simplified Chinese fonts. Use: fontname="china-s" or fontname="china-ss" for simplified Chinese fontname="china-t" or fontname="china-ts" for traditional Chinese Using these means your PDF will not need or contain extra fonts, resp. fontfiles. If you want to use a special font however, you can also do this. You must then choose a fontname different from every of the above and also specify the filename of a fontfile on your system. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

JorjMcKie · 2019-07-23T12:53:19Z

There was no picture ... Please let me see your script. Here is my example:

Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import fitz
>>> doc = fitz.open()
>>> page = doc.newPage()
>>> text = "你好！hello！Hallo！ 我很喜欢德国！德国是个好地方！"
>>> page.insertText((100,100), text, fontname="china-ss")
1
>>> doc.save("test.pdf")
>>>

It leads to this PDF:
test.pdf

seven1122 · 2019-07-24T01:59:13Z

My code as following: ''' def write_text_to_pdf(pdf_key, position_x, position_y, page_num, insert_text=""): pdf_url = get_qiniu_url(pdf_key) pdf_content = requests.get(pdf_url).content doc = fitz.open("type", pdf_content) page = doc[page_num-1] page_height = page.rect.height p = fitz.Point(position_x, page_height-position_y) # start point of 1st line if not insert_text: insert_text = datetime.date.today().strftime("%Y-%m-%d") page.insertText(p, insert_text, fontname="china-ss", fontsize=14, rotate=0, ) tem_name = str(uuid.uuid4()).replace('-', '') + ".pdf" pdf_path = TEMP_DIR + tem_name doc.save(pdf_path) ''' when the insert_text="2018年11月12日" , the chinese texts (年月日） missed, as the attachment showing. ------------------ 原始邮件 ------------------ 发件人: "Jorj X. McKie"<notifications@github.com>; 发送时间: 2019年7月23日(星期二) 晚上8:53 收件人: "pymupdf/PyMuPDF"<PyMuPDF@noreply.github.com>; 抄送: "蜗牛快跑"<931880645@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [pymupdf/PyMuPDF] Does it support inserting Chinese text? (#329) There was no picture ... Please let me see your script. Here is my example: Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. >>> import fitz >>> doc = fitz.open() >>> page = doc.newPage() >>> text = "你好！hello！Hallo！我很喜欢德国！德国是个好地方！" >>> page.insertText((100,100), text, fontname="china-ss") 1 >>> doc.save("test.pdf") >>> It leads to this PDF: test.pdf — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

JorjMcKie · 2019-07-24T17:54:29Z

You forgot to attach the PDF again.

The code looks okay so far.
What you can try:

use a different viewer to show the PDF: not all viewers are capable of displaying all Chinese fonts. For example, SumatraPDF cannot display "china-ss", but Adobe Acrobat can, etc.
choose a different fontname: apart from "china-ss", you can also try "china-t", "china-ts" and "china-s".

seven1122 · 2019-07-25T03:15:51Z

I am sure that it is not related to the PDF viewer, as the PDF file is a chinese file. Also I have try all above fontname,but the results are same.

seven1122 · 2019-07-25T03:17:42Z

JorjMcKie · 2019-07-25T06:28:28Z

The file you sent was just an image, not a PDF, so I am unble to track down what happened.

I hope you have tried the script I sent you. What were the results?
Which PDF viewer are you using?
If it does not work on your system, please send me details of your installation: operating system, Python version, PyMuPDF version, method of installation (wheel? generation via source?)

seven1122 · 2019-07-25T07:04:42Z

I have tried your script, as following:
Python 2.7.15+ (default, Nov 27 2018, 23:36:35)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import fitz>>> doc = fitz.open()>>> page = doc.newPage()>>> text = "你好! hello!">>> page.insertText((100, 100), text, fontname="china-ss")1>>> doc.save("test01.pdf")>>>
the result file:
test01.pdf
when I run my code, the result PDF file:
test.pdf
I am using Document Viewer as my PDF viewer comes with the operating system
some other details: ubuntu 18, python 2.7.15, PyMuPDF 1.14.11, wheel installation

JorjMcKie · 2019-07-25T07:40:58Z

finally we are making progress!
You are using Python 2 which has poor support of unicode. I cannot imagine that you were able to enter Chinese text in a Python2 IDLE session! And even without using u"..."!
Please try a Python 3 version to see if there is a difference.
Anyway, here is a script which I would ask you to run in batch under Python2. I did this using the same Python 2.7.15+ version under Ubuntu:

# -*- coding: utf-8 -*-
import fitz

doc = fitz.open()  # new PDF
text = u"你好！hello！Hallo！ 我很喜欢德国！德国是个好地方"
pnt = fitz.Point(50, 72)  # start point of text insertion
page = doc.newPage()  # crete a page
page.insertText(pnt, text, fontname="china-t")
doc.save(__file__ + ".pdf")

The resulting PDF was correct.

seven1122 · 2019-07-25T09:06:53Z

oh my god, I felt as if I had made an outrageous howler.tank you very much. while, it doesn't seem very friendly to support mixed text(including number ,chinese, english), the space between different number letter is not standard. But it doesn't matter.

JorjMcKie · 2019-07-25T09:53:53Z

good to see it works now.
Close the issue?

JorjMcKie · 2019-07-25T10:14:34Z

oh my god, I felt as if I had made an outrageous howler.tank you very much. while, it doesn't seem very friendly to support mixed text(including number ,chinese, english), the space between different number letter is not standard. But it doesn't matter.

This is a characteristic of the built-in fonts (they are mono-spaced: all the characters have the same width, whether Chinese or Latin). If you choose a nicer one, you will also see nicer results. Look at the difference between the following pictures (identical code, made with page.insertTextbox() and rotated via morphing):
Built-in font:

Nicer font:

HeroadZ · 2021-06-26T06:39:31Z

Could you give some hints about a nicer font for mixed text?
In the explanation of font, it said that

If you know you have a mixture of CJK and Latin text, consider just using Font("cjk") because this supports everything

but there is no 'cjk' font in the latest version of PyMuPdf now.

What kind of fontfile did you use for mixed font? For example, mixed of Japanese and English.

JorjMcKie · 2021-06-26T07:08:42Z

but there is no 'cjk' font in the latest version of PyMuPdf now.

Why do you say that? Of course there is! It is

>>> import fitz
>>> font=fitz.Font("cjk")
>>> font.name
'Droid Sans Fallback Regular'
>>>

HeroadZ · 2021-06-26T07:33:33Z

oh, sorry for that.
I tested this font in Page.insert_text function, but error occurred. Even with Droid Sans Fallback Regular
Could I get a nicer font for mixed text by insert_text function with cjk font like 2021年6月26日?

JorjMcKie · 2021-06-26T07:43:45Z

These methods insert_text, insert_textbox require to use the buffer of Font("cjk"):

font=fitz.Font("cjk")
page.insert_font(fontname="F0", fontbuffer=font.buffer)
page.insert_text(..., fontname="F0",...)

HeroadZ · 2021-06-26T07:50:11Z

Thank you very much! This saved my day!
Sorry if I didn't read the doc carefully, it seems that there's no doc for this usage with mixed text.
I think that it's better add some explanation in insert_text, insert_textbox function.

JorjMcKie · 2021-06-26T07:51:25Z

ok, I'll see what I can do.

maiiabocharova · 2022-02-25T13:41:00Z

Hello! Is there a way to insert Chinese text using page.insert_textbox and custom font?
This example

doc = fitz.open()
page = doc.new_page()
text = '为什么烦恼'
page.insert_text((100,100), text, fontname="china-s")
doc.save("test.pdf")

worked, file was Ok, but I want to use textbox to ensure the right positioning:

So I did:

page.insert_textbox(rect, text,
                        fontsize=fontsize_to_use,
                        fontfile='zcool.ttf', 
                        align=1)

(I downloaded a fontfile from here: https://fonts.google.com/specimen/ZCOOL+XiaoWei?subset=chinese-simplified#standard-styles)

Also I tried
page.insert_textbox(rect, text, fontname="china-s") and it also doesn't word (blank output)

It doesn't give me any errors, but the output file is blank

JorjMcKie · 2022-02-25T14:12:50Z

If you would try insert_textbox() with "china-s", it should work.
Other fonts always are a bit less reliable.
I recommend using the new TextWriter feature togehter with the Font class - along this code snippet:

font = fitz.Font(fontfile="...")
page=...
tw = fitz.TextWriter(page.rect)
tw.fill_textbox(...)
tw.write_text(page)

For a TTF font, this hsould deliver the best results.
You can also use font = fitz.Font("cjk"), the universal builtin font. This allows using a mixture of China, Korea, Japan, and Latin text at once.

maiiabocharova · 2022-02-26T14:57:11Z

font = fitz.Font(fontfile="...")
page=...
tw = fitz.TextWriter(page.rect)
tw.fill_textbox(...)
tw.write_text(page)

Thank you for the snippet. I followed your advice and had a problem with this line

rect = (rect_x1, rect_y1, rect_x2, rect_y2)
tw.fill_textbox(fitz.Rect(rect), text, font=font, fontsize=fontsize)

ValueError: Text must start in rectangle.

maiiabocharova · 2022-02-26T16:03:06Z

Actually the mistake was when I tried to insert text with fontsize bigger than rect
So

page.insert_textbox(rect, text,
                            fontsize=fontsize,
                            fontname="china-t",
                            align=1)

works perfectly when I provide the smaller fontsize

JorjMcKie · 2022-02-26T17:06:05Z

My bad, should have looked up the right call pattern again first:
fill_textbox(rect, text, pos=None, font=None, fontsize=11, align=0, right_to_left=False, warn=None, small_caps=0)

JorjMcKie · 2022-02-26T17:14:37Z

This method insert_textbox() returns a float, which should be checked to determine success.
If negative, there was not enough room and nothing has been written!

buptyyf · 2022-10-25T07:43:43Z

I have another question about extract font family buffer, sunch as example.ttf.
I have learn about analyze font on the website. And I have extract a font family SimSun from a pdf file, that's a Chinese font family. But how can I get the font ttf file? Thanks.

JorjMcKie · 2022-10-25T08:25:27Z

I have another question about extract font family buffer, sunch as example.ttf. I have learn about analyze font on the website. And I have extract a font family SimSun from a pdf file, that's a Chinese font family. But how can I get the font ttf file? Thanks.

The page.get_fonts() output is a list of fonts used by the page. Each item of the list looks like (xref, ext, ...).
The ext sub-item is a string with the extension suitable for that font, so something like "ttf".
If however that string equals "n/a", then this is a built-in font, for which there exists no binary font file in the PDF. Examples are the Helvetica and Courier fonts.
In other cases you can do buffer = doc.extract_font(xref)[-1] to extract the binary font file content. After that you can store the result away as a font file like for exmaple so:

fontfile = open(f"myfont.{ext}", "wb")
fontfile.write(buffer)
fontfile.close()

But please note, that in many (if not most) PDFs not the complete font is embedded, but only those characters of a font, which are actually used inside the PDF.
So with the extraction above, you do get a valid font, but it may only contain a handful of characters.

buptyyf · 2022-10-26T11:43:58Z

But please note, that in many (if not most) PDFs not the complete font is embedded, but only those characters of a font, which are actually used inside the PDF.
So with the extraction above, you do get a valid font, but it may only contain a handful of characters.

@JorjMcKie Thanks a lot.

I use doc.extract_font get all font families of a pdf as follows:
font family BCDEEE+SimSun ttf
font family BCDFEE+Calibri ttf
font family BCDGEE+Calibri ttf
font family BCDHEE+SimSun ttf
font family BCDIEE+Calibri-Bold ttf
font family SimSun n/a
font family ArialMT n/a

But I don't know the meaning of 'BCDEEE\BCDFEE\BCDGEE\BCDHEE\BCDIEE'. I just get font info from span dictionary by page.get_text('dict'), which only show SimSun, ArialMT, Calibri.

JorjMcKie · 2022-10-26T12:02:06Z

The prefix of 6 upper case ASCII letter with a "+" mean a font subset: this is not the complete "SimSun.ttf" for example.
If you want that prefix to also be contained in text extractions, set a global variable to True: fitz.TOOLS.set_subset_fontnames(True).

JorjMcKie added the example required label Jul 23, 2019

JorjMcKie closed this as completed Jul 26, 2019

cges30901 mentioned this issue Sep 13, 2020

How to use PyMuPDF to generate vertical text？ #653

Closed

goldengrape mentioned this issue Mar 11, 2023

improve search_embeddings() mukulpatnaik/researchgpt#39

Closed

wenbopeng mentioned this issue May 24, 2023

[Bug]-add_text: Some characters will be hidden if using Simplified Chinese, but it is normal to use Traditional Chinese ahrm/sioyek-python-extensions#5

Open

l1t1 mentioned this issue Oct 12, 2024

keep the table grid lines align when use Chinese fonts #3938

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does it support inserting Chinese text? #329

Does it support inserting Chinese text? #329

seven1122 commented Jul 22, 2019

JorjMcKie commented Jul 22, 2019

seven1122 commented Jul 23, 2019 via email

JorjMcKie commented Jul 23, 2019

seven1122 commented Jul 24, 2019 via email

JorjMcKie commented Jul 24, 2019

seven1122 commented Jul 25, 2019

seven1122 commented Jul 25, 2019

JorjMcKie commented Jul 25, 2019

seven1122 commented Jul 25, 2019 •

edited

Loading

JorjMcKie commented Jul 25, 2019

seven1122 commented Jul 25, 2019

JorjMcKie commented Jul 25, 2019

JorjMcKie commented Jul 25, 2019 •

edited

Loading

HeroadZ commented Jun 26, 2021 •

edited

Loading

JorjMcKie commented Jun 26, 2021

HeroadZ commented Jun 26, 2021 •

edited

Loading

JorjMcKie commented Jun 26, 2021

HeroadZ commented Jun 26, 2021

JorjMcKie commented Jun 26, 2021

maiiabocharova commented Feb 25, 2022 •

edited

Loading

JorjMcKie commented Feb 25, 2022

maiiabocharova commented Feb 26, 2022

maiiabocharova commented Feb 26, 2022

JorjMcKie commented Feb 26, 2022

JorjMcKie commented Feb 26, 2022

buptyyf commented Oct 25, 2022

JorjMcKie commented Oct 25, 2022

buptyyf commented Oct 26, 2022

JorjMcKie commented Oct 26, 2022

Does it support inserting Chinese text? #329

Does it support inserting Chinese text? #329

Comments

seven1122 commented Jul 22, 2019

JorjMcKie commented Jul 22, 2019

seven1122 commented Jul 23, 2019 via email

JorjMcKie commented Jul 23, 2019

seven1122 commented Jul 24, 2019 via email

JorjMcKie commented Jul 24, 2019

seven1122 commented Jul 25, 2019

seven1122 commented Jul 25, 2019

JorjMcKie commented Jul 25, 2019

seven1122 commented Jul 25, 2019 • edited Loading

JorjMcKie commented Jul 25, 2019

seven1122 commented Jul 25, 2019

JorjMcKie commented Jul 25, 2019

JorjMcKie commented Jul 25, 2019 • edited Loading

HeroadZ commented Jun 26, 2021 • edited Loading

JorjMcKie commented Jun 26, 2021

HeroadZ commented Jun 26, 2021 • edited Loading

JorjMcKie commented Jun 26, 2021

HeroadZ commented Jun 26, 2021

JorjMcKie commented Jun 26, 2021

maiiabocharova commented Feb 25, 2022 • edited Loading

JorjMcKie commented Feb 25, 2022

maiiabocharova commented Feb 26, 2022

maiiabocharova commented Feb 26, 2022

JorjMcKie commented Feb 26, 2022

JorjMcKie commented Feb 26, 2022

buptyyf commented Oct 25, 2022

JorjMcKie commented Oct 25, 2022

buptyyf commented Oct 26, 2022

JorjMcKie commented Oct 26, 2022

seven1122 commented Jul 25, 2019 •

edited

Loading

JorjMcKie commented Jul 25, 2019 •

edited

Loading

HeroadZ commented Jun 26, 2021 •

edited

Loading

HeroadZ commented Jun 26, 2021 •

edited

Loading

maiiabocharova commented Feb 25, 2022 •

edited

Loading