Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insertpdf to merge multiple pdf into one pdf, but test memory leak. #669

Closed
zhou3968322 opened this issue Sep 28, 2020 · 6 comments
Closed
Assignees
Labels

Comments

@zhou3968322
Copy link

Describe the bug

insertPDF cause a memory leak

To Reproduce (mandatory)

i test the following code in a docker container:

pdf_path_list = ["input_1.pdf", "input_2.pdf", ...., "input_33.pdf"]

for i in range(1000):
    out_pdf_path = "test_{}.pdf".format(i)
    bt = time.time()
    doc = fitz.Document()

    for j in range(len(pdf_path_list)):
        t0 = time.time()
        sub_doc = fitz.open(pdf_path_list[j])
        doc.insertPDF(sub_doc)
        sub_doc.close()
    print("insert pdf cost:{}".format(time.time() - bt))
    bt = time.time()
    # i have tested 0,1,2,3,4 and with option 4, pdf is the smallest
    doc.save(out_pdf_path, garbage=4)
    doc.close()
    print("save to pdf cost:{}".format(time.time() - bt))

when i tested ,the docker container memory usage increase from 96MB to 300MB, and then 400MB.

is there a memoy leak?

Your configuration (mandatory)

sys version and platform are as follows:

3.7.4 (default, Aug 14 2019, 12:19:22)
[GCC 8.3.0]
 linux

PyMuPDF 1.17.6: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-08-26 14:54:32.
Built for Python 3.7 on linux (64-bit).
@zhou3968322
Copy link
Author

i used memory-pix.py in #389,
here is result:

Memory deltas (MB) for insertPDF
0 stop delta 49.1
1 stop delta 13.0
2 stop delta 2.4
3 stop delta 1.1
4 stop delta 1.2
5 stop delta 1.1
6 stop delta 3.7
7 stop delta -0.3
8 stop delta 1.1
9 stop delta 1.2
10 stop delta 3.7
11 stop delta -1.4
12 stop delta 1.1
13 stop delta 1.1
14 stop delta 1.2
15 stop delta 5.2
16 stop delta -1.1
17 stop delta 1.1
18 stop delta 3.7
19 stop delta -0.3
20 stop delta 3.7
21 stop delta -1.4
22 stop delta 2.2
23 stop delta 2.3
24 stop delta 1.2
25 stop delta 1.1
26 stop delta 1.1
27 stop delta 2.2
28 stop delta 1.8
29 stop delta 1.1
30 stop delta 0.5
31 stop delta 1.1
32 stop delta 1.1
33 stop delta 3.6
34 stop delta 2.3
35 stop delta -1.3
36 stop delta 3.3
37 stop delta 1.4
38 stop delta -0.2
39 stop delta 1.1
40 stop delta 1.1
41 stop delta 3.6
42 stop delta -0.2
43 stop delta 1.2
44 stop delta 1.1
45 stop delta 1.1
46 stop delta 2.2
47 stop delta 3.7
48 stop delta 0.9
49 stop delta 1.4
================================================================================
Total pages processed: 1650
Duration: 87 sec

@JorjMcKie
Copy link
Collaborator

Thanks for reporting this.
I have located the major source of this. Will be fixed in next version.

@zhou3968322
Copy link
Author

zhou3968322 commented Sep 30, 2020

i have viewed the latest mupdf source code, page_merge.c(mupdf/source/tools/pdfmerge.c), pdf-graft.c(mupdf/source/pdf/pdf-graft.c):
add change code in page_merge method.

PyMuPDF/fitz/helper-other.i

Lines 510 to 513 in 9b07a4b

fz_always(ctx) {
pdf_drop_obj(ctx, obj);
pdf_drop_obj(ctx, ref);
}

    fz_always(ctx) {
        pdf_drop_obj(ctx, obj);
        pdf_drop_obj(ctx, page_dict);
        pdf_drop_obj(ctx, ref);
    }

and drop pdf_obj *o(when copy_annots).

i build my pymupdf with swig, insertPdf memory leak decreased.

then i test with code like this,(delete insertPdf method):

import gc
import psutil
import fitz
try:
    TOOLS = fitz.TOOLS
except:
    TOOLS = fitz.Tools()
gc.set_debug(gc.DEBUG_UNCOLLECTABLE)
pdf_path_list = ["input_1.pdf", "input_2.pdf", ...., "input_33.pdf"]

for i in range(1000):
    process = psutil.Process(os.getpid())
    for j in range(len(pdf_path_list)):
        page_start_memory = process.memory_info().rss / 2 ** 10
        sub_doc = fitz.open(pdf_path_list[j])
        sub_doc.close()
        page_end_memory = process.memory_info().rss / 2 ** 10
        increase = int(page_end_memory - page_start_memory)
        TOOLS.store_shrink(100)  # reset MuPDF global context
        gc.collect()
        print("i:{},j:{},increase memory:{}kb".format(i, j, increase))
    time.sleep(1)

after serveral loop times, it increases indeed(with no decrease when program running):

...
i:34,j:5,increase memory:0kb
i:34,j:6,increase memory:0kb
i:34,j:7,increase memory:380kb
i:34,j:8,increase memory:0kb
i:34,j:9,increase memory:0kb
...
i:73,j:24,increase memory:0kb
i:73,j:25,increase memory:0kb
i:73,j:26,increase memory:0kb
i:73,j:27,increase memory:260kb
i:73,j:28,increase memory:0kb
i:73,j:29,increase memory:0kb
...
i:123,j:19,increase memory:0kb
i:123,j:20,increase memory:0kb
i:123,j:21,increase memory:76kb
i:123,j:22,increase memory:0kb
i:123,j:23,increase memory:0kb
...
i:246,j:19,increase memory:0kb
i:246,j:20,increase memory:0kb
i:246,j:21,increase memory:260kb
i:246,j:22,increase memory:0kb
i:246,j:23,increase memory:0kb
...
i:287,j:25,increase memory:0kb
i:287,j:26,increase memory:0kb
i:287,j:27,increase memory:264kb
i:287,j:28,increase memory:0kb
i:287,j:29,increase memory:0kb

Is this normal?maybe there is some global variables not released.

@JorjMcKie
Copy link
Collaborator

You put your finger exactly in the wound, chapeau!
I indeed forgot to drop page_dict.
After doing this, the memory increase seems reduced by ca. 50% and I even saw occasioal decreases (data below are KB as well). Without this fix, the final memory landed at about 2.5 MB.
I can't detect other places where I might have forgotten to drop things.
But it seems there are problematic places in MuPDF code itself:

  • pdf_graft.c: not all "new indirect" objects seem to be dropped somewhere
  • pdf_page.c: pdf_flatten_inheritable_page_item localizes all inheritable objects of a page, but should use pdf_dict_put_drop.

So I guess we have to leave it like that, if we do not see other spots in PyMuPDF

PyMuPDF 1.17.8: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-09-28 06:31:45.
Built for Python 3.9 on win32 (64-bit).

Run 0: total 892, delta 892
Run 1: total 900, delta 8
Run 2: total 928, delta 28
Run 3: total 984, delta 56
Run 4: total 1048, delta 64
Run 5: total 1036, delta -12
Run 6: total 1036, delta 0
Run 7: total 1040, delta 4
Run 8: total 1044, delta 4
Run 9: total 1060, delta 16
Run 10: total 1036, delta -24
Run 11: total 1036, delta 0
Run 12: total 1040, delta 4
Run 13: total 1040, delta 0
Run 14: total 1044, delta 4
Run 15: total 1044, delta 0
Run 16: total 1116, delta 72
Run 17: total 1140, delta 24
Run 18: total 1144, delta 4
Run 19: total 1412, delta 268
Run 20: total 1212, delta -200
Run 21: total 1208, delta -4
Run 22: total 1208, delta 0
Run 23: total 1208, delta 0
Run 24: total 1212, delta 4
Run 25: total 1212, delta 0
Run 26: total 1244, delta 32
Run 27: total 1268, delta 24
Run 28: total 1280, delta 12
Run 29: total 1248, delta -32
Run 30: total 1240, delta -8
Run 31: total 1252, delta 12
Run 32: total 1636, delta 384
Run 33: total 1324, delta -312
Run 34: total 1328, delta 4
Run 35: total 1340, delta 0
Run 36: total 1344, delta 4
Run 37: total 1412, delta 68
Run 38: total 1420, delta 8
Run 39: total 1400, delta -20
Run 40: total 1396, delta -4
Run 41: total 1416, delta 20
Run 42: total 1416, delta 0
Run 43: total 1424, delta 8
Run 44: total 1436, delta 12
Run 45: total 1452, delta 16
Run 46: total 1452, delta 0
Run 47: total 1476, delta 24
Run 48: total 1468, delta -8
Run 49: total 1464, delta -4

@zhou3968322
Copy link
Author

ok,thanks for your input,i will try it.

@JorjMcKie
Copy link
Collaborator

This issue is fixed with v1.18.0 - at least for the major part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants