pymupdf.Page.replace_image() will save an image stream file 2 times #4328

Muolae · 2025-02-25T09:51:11Z

Muolae
Feb 25, 2025

Description of the bug

new_xref = page.insert_image( page.rect, filename=filename, stream=stream, pixmap=pixmap ) doc.xref_copy(new_xref, xref) # copy over new to old

the code doc.xref_copy(new_xref, xref)copy the stream of new_image to the old xref,but it do not delete the stream of new_image. this bug will result in the old xref and the new_xref Point to the same data, and the same data is stored 2 times in the pdf file.

I want to replace the original picture with a smaller one.However, I found that the file size after replacement is more than the file size obtained by directly deleting the image plus the volume of the replaced image, which is exactly equal to the size of the image file used to replace, so I guess the file is actually stored twice.I looked at the source code and came to the conclusion above.

I add
page.delete_image(new_xref)
after it can Fix this,but I think it has batter solution.

How to reproduce the bug

use pymupdf.Page.replace_image to change image size,then look the change of size of pdf file.
sorry my English is bad.

PyMuPDF version

1.25.2

Operating system

Windows

Python version

3.9

JorjMcKie · 2025-02-25T09:58:50Z

JorjMcKie
Feb 25, 2025
Maintainer

I don't understand what you are actually doing: please provide your PDF and your script.
We cannot deal with issues that do not follow the requirements as stated in the bug issue form.

One intermediate comment though:
When saving the PDF after replacing / deleting an image make sure to use garbage collection and compression options so that the old data is removed! Recommended is using doc.ez_save().

0 replies

Muolae · 2025-02-25T10:12:07Z

Muolae
Feb 25, 2025
Author

I want to replace the original image in the pdf file with a smaller image, my program extracts each image in the file, then uses a compression function to reduce their size, and then replaces the original image. But I found that for example, if I delete all the pictures directly, the resulting file size is 1MB, and if I replace the original picture with a total size of 2MB, I get a file with a size of 5MB. I think the reasonable size is 3MB.This means that these pictures store two copies of the same data in a pdf file.Thank you for your reply
` for page in doc.pages():
imageList = page.get_images()

    for imginfo in imageList:
        xref = imginfo[0]  # 获取图像的 xref
        image_info = doc.extract_image(xref)

        # 保存图像到文件
        new_name = f"./src/temp/image_{xref}.{image_info['ext']}"
        with open(new_name, "wb") as img_file:
            img_file.write(image_info["image"])
        img = imgopen(new_name)
        pix_size += img.height * img.width * 3 // 1024
        img_size+=os.path.getsize(new_name) / 1024

        loc = page.get_image_rects(imginfo)
        # doc._deleteObject(imginfo[0])
        page.delete_image(xref)
        compressed_image_name = new_name.replace(new_name.split(".")[-1], 'jpg')
        picCompress.compress_by_quality(new_name, compressed_image_name, quality=15)
        zimg_size+=os.path.getsize(compressed_image_name)/1024
        # page.insert_image(loc[0], filename=compressed_image_name,overlay=False)
        page.replace_image(xref, filename=compressed_image_name)
        # my_replace(page,xref,filename=compressed_image_name)
    now_page += 1
    # ws.send("reg:" + str(int(100 * now_page / page_number)))
# print("图片面积大小:", pix_size, "KB")
# print("图片实际大小:", img_size, "KB")
# print("图片压缩大小:", zimg_size, "KB")
# 计算无图片文件大小
doc.ez_save(filename='./out/out.pdf')`

0 replies

JorjMcKie · 2025-02-25T10:21:25Z

JorjMcKie
Feb 25, 2025
Maintainer

I see no issue here, but rather a request for help. Moving this to "Discussions".

0 replies

JorjMcKie · 2025-02-25T10:31:27Z

JorjMcKie
Feb 25, 2025
Maintainer

There is no information what that image compression function is.
Please note that image compression is ignored by PDF! There is a handful of compression algorithms and image formats that PDF supports. For example "WEBP" is not among them.
If you pass an image to PDF it will compress it according to its supported algorithms.

The only thing you can do is accepting a loss of image quality: for example use grayscale or change resolution.
When you then use .replace_image(), do not delete the old image before that! This is done by the method automatically.

0 replies

Muolae · 2025-02-25T12:10:35Z

Muolae
Feb 25, 2025
Author

I wrote a test program to prove what I said and you can run it to give it a try. If I go the other way, delete the image and then insert the compressed image, the file size will be correct.


import fitz
import pymupdf
from PIL import Image
from PIL.Image import open as imgopen


def compress_by_quality(pic_path, out_path,quality):
    img = Image.open(pic_path)
    rgbimg=img.convert('RGB')
    rgbimg.save(out_path, format="JPEG", quality=quality, subsampling=0)


def parse_pdf_file(ws, filename):
    if not os.path.exists('./out/'):
        os.makedirs('./out/')
    if not os.path.exists('./temp/'):
        os.makedirs('./temp/')
    # 解析文件
    doc1 = fitz.open(filename)
    # The file size after all the pictures are deleted
    for page in doc1.pages():
        imageList = page.get_images()
        for imginfo in imageList:
            xref = imginfo[0]  # 获取图像的 xref
            page.delete_image(xref)
    doc1.ez_save(filename='./out/out.pdf')
    no_image_pdf_size=os.path.getsize('./out/out.pdf')/1024
    print("The file size after all the pictures are deleted:",no_image_pdf_size,"KB")

    doc2 = fitz.open(filename)
    compress_image_size = 0
    for page in doc2.pages():
        imageList = page.get_images()
        for imginfo in imageList:
            xref = imginfo[0]  # 获取图像的 xref
            image_info = doc2.extract_image(xref)
            # 保存图像到文件
            new_name = f"./temp/image_{xref}.{image_info['ext']}"
            with open(new_name, "wb") as img_file:
                img_file.write(image_info["image"])
            img = imgopen(new_name)
            compressed_image_name = new_name.replace(new_name.split(".")[-1], 'jpg')
            compress_by_quality(new_name, compressed_image_name, quality=15)
            compress_image_size+=os.path.getsize(compressed_image_name)/1024
            page.replace_image(xref, filename=compressed_image_name)
    doc2.ez_save(filename='./out/out.pdf')
    replaced_image_pdf_size = os.path.getsize('./out/out.pdf') / 1024
    print("The size of the file after replacing with a small image:",replaced_image_pdf_size,"KB")
    print("The total size of the smaller image:",compress_image_size,'KB')
    print("2*smaller image+no image pdf size",2*compress_image_size+no_image_pdf_size,"KB")
    print("compress pdf size",replaced_image_pdf_size)

    # 3
    doc3 = fitz.open(filename)
    for page in doc3.pages():
        imageList = page.get_images()
        for imginfo in imageList:
            xref = imginfo[0]  # 获取图像的 xref
            image_info = doc3.extract_image(xref)
            new_name = f"./temp/image_{xref}.{image_info['ext']}"
            with open(new_name, "wb") as img_file:
                img_file.write(image_info["image"])
            loc = page.get_image_rects(imginfo)
            # doc3._deleteObject(imginfo[0])
            page.delete_image()
            compressed_image_name = new_name.replace(new_name.split(".")[-1], 'jpg')
            compress_by_quality(new_name, compressed_image_name, quality=15)
            page.insert_image(loc[0], filename=compressed_image_name)
    doc3.ez_save(filename='./out/out.pdf')
    insert_pdf_size=os.path.getsize('./out/out.pdf') / 1024
    print("compress pdf size", insert_pdf_size,'KB')


parse_pdf_file(0,"./src/"+'PAMI 2024.pdf')

The result I get is this:

The file size after all the pictures are deleted: 813.1796875 KB
The size of the file after replacing with a small image: 6275.0625 KB
The total size of the smaller image: 2724.9033203125 KB
2*smaller image+no image pdf size: 6262.986328125 KB
compress pdf size: 6275.0625KB
compress pdf size by del and insert: 3581.892578125 KB

Or the other way around: every time you replace it with the exact same image, you'll notice that it's size getting bigger and bigger

# pdf压缩的核心功能函数
import os

import fitz
import pymupdf
from PIL import Image
from PIL.Image import open as imgopen


def parse_pdf_file2(i, filename):
    doc2 = fitz.open(filename)
    for page in doc2.pages():
        imageList = page.get_images()
        for imginfo in imageList:
            xref = imginfo[0]  # 获取图像的 xref
            image_info = doc2.extract_image(xref)
            # 保存图像到文件
            new_name = f"./temp/image_{xref}.{image_info['ext']}"
            with open(new_name, "wb") as img_file:
                img_file.write(image_info["image"])
            page.replace_image(xref, filename=new_name)
    doc2.ez_save(filename=f'./out/PAMI 2024_{i+1}.pdf')
    replaced_image_pdf_size = os.path.getsize(f'./out/PAMI 2024_{i+1}.pdf') / 1024
    print("compress pdf size",replaced_image_pdf_size)


for i in range(0,10):
    print(f"run {i} times,",end='')
    parse_pdf_file2(i,f"./out/"+f'PAMI 2024_{i}.pdf')

The result I get is this:

run 0 times,compress pdf size 24639.5048828125
run 1 times,compress pdf size 36589.765625
run 2 times,compress pdf size 48578.318359375
run 3 times,compress pdf size 60630.802734375
run 4 times,

When i is greater than four, my computer cannot continue to execute the program

I think it's the doc.xref_copy(new_xref, xref) statement in the replace_image() function that causes this,I think it's the doc.xref_copy(new_xref, xref) statement in the replace_image() function that causes this, adding a stream of the new image to the pdf, which is then copied back to the original image xref

def replace_image(page: pymupdf.Page, xref: int, *, filename=None, pixmap=None, stream=None):
    """Replace the image referred to by xref.

    Replace the image by changing the object definition stored under xref. This
    will leave the pages appearance instructions intact, so the new image is
    being displayed with the same bbox, rotation etc.
    By providing a small fully transparent image, an effect as if the image had
    been deleted can be achieved.
    A typical use may include replacing large images by a smaller version,
    e.g. with a lower resolution or graylevel instead of colored.

    Args:
        xref: the xref of the image to replace.
        filename, pixmap, stream: exactly one of these must be provided. The
            meaning being the same as in Page.insert_image.
    """
    doc = page.parent  # the owning document
    if not doc.xref_is_image(xref):
        raise ValueError("xref not an image")  # insert new image anywhere in page
    if bool(filename) + bool(stream) + bool(pixmap) != 1:
        raise ValueError("Exactly one of filename/stream/pixmap must be given")
    new_xref = page.insert_image(
        page.rect, filename=filename, stream=stream, pixmap=pixmap
    )
    doc.xref_copy(new_xref, xref)  # copy over new to old
    last_contents_xref = page.get_contents()[-1]
    # new image insertion has created a new /Contents source,
    # which we will set to spaces now
    doc.update_stream(last_contents_xref, b" ")

If I add page.delete_image (new_xref) after doc.xref_copy(new_xref, xref), I can fix this bug

Hopefully you can try to run my code, just need to change the path of the pdf file of the test, thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pymupdf.Page.replace_image() will save an image stream file 2 times #4328

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

pymupdf.Page.replace_image() will save an image stream file 2 times #4328

Uh oh!

Uh oh!

Muolae Feb 25, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 5 comments

Uh oh!

JorjMcKie Feb 25, 2025 Maintainer

Uh oh!

Muolae Feb 25, 2025 Author

Uh oh!

JorjMcKie Feb 25, 2025 Maintainer

Uh oh!

JorjMcKie Feb 25, 2025 Maintainer

Uh oh!

Muolae Feb 25, 2025 Author

Muolae
Feb 25, 2025

JorjMcKie
Feb 25, 2025
Maintainer

Muolae
Feb 25, 2025
Author

JorjMcKie
Feb 25, 2025
Maintainer

JorjMcKie
Feb 25, 2025
Maintainer

Muolae
Feb 25, 2025
Author