order of Text extraction from pdf is difference when the pdf files are generated in different platform(win and linux) #3816

kvrameshreddy · 2024-08-28T12:23:27Z

kvrameshreddy
Aug 28, 2024

i am trying to extract text from two pdf files, which are generated on different platforms (windows and linux) and compare them to identify the differences.
In this use case, even though the two files are similar i see few words come in a different order and failing while comparing.

i am attaching the pdf files and compare logic, can you help me out in solving this case

Customers95.pdf
Customers92.pdf
Customers95.pdf_vs_Customers92.pdf_text.pdf

import fitz
import difflib
pdf1 = "D:\\Projects\\DataOps\\pdfcompare\\Customers92.pdf"
pdf2 = "D:\\Projects\\DataOps\\pdfcompare\\Customers95.pdf"
def compare_text(pdf1, pdf2):
    opacity = 0.4
    color = (1.0, 0.3, 0.3)
    path = os.path.dirname(os.path.abspath(pdf1))
    file1 = fitz.open(pdf1)
    file2 = fitz.open(pdf2)
    page1 = file1[0]    
    page1.clean_contents()  # cleaning conents to handle geometry changes
    shape1 = page1.new_shape()
    page2 = file2[0]
    page2.clean_contents()  # cleaning conents to handle geometry changes
    shape2 = page2.new_shape()
    d = difflib.Differ()
    text1 = page1.get_text("words", sort=True)
    text2 = page2.get_text("words", sort=True)
    line1 = list(d.compare([x[4] for x in text1], [x[4] for x in text2]))     
    c1 = 0
    c2 = 0
    for i in line1:
        if i[0] == " ":
            c1 += 1
            c2 += 1
        elif i[0] == "-":
            shape1.draw_rect(text1[c1][0:4])
            shape1.finish(fill=color, fill_opacity=opacity)
            shape1.commit()
            c1 += 1
        elif i[0] == "+":
            shape2.draw_rect(text2[c2][0:4])
            shape2.finish(fill=color, fill_opacity=opacity)
            shape2.commit()
            c2 += 1

            
    w, h = fitz.paper_size(
                "a3-l"
    )  # width, height of output page (format ISO A3 landscape)
    rleft = fitz.paper_rect(
        "a4"
    )  # rectangle of left output page half (A4 portrait)
    rright = rleft + (rleft.width, 0, rleft.width, 0)
    doc = fitz.open()  # output file
    dpi = 175  # choose quality of the page images
    mat = fitz.Matrix(dpi / 72, dpi / 72)  # make a zoom mattrix
    
    page = doc.new_page(width=w, height=h)
    pageleft = file1[0]  # page for left image
    pageleft.clean_contents()  # cleaning conents to handle geometry changes
    pageright = file2[0]  # page for right image
    pageright.clean_contents()  # cleaning conents to handle geometry changes
    pixleft = pageleft.get_pixmap(matrix=mat)  # make images
    pixright = pageright.get_pixmap(matrix=mat)
    page.insert_image(rleft, pixmap=pixleft)
    p = fitz.Rect(40, 30, 555, 700)
    page.insert_textbox(
        p,  # bottom-left of 1st char
        pdf1 + " page: 1" ,  # the text (honors '\n')
        fontname="helv",  # the default font
        fontsize=11,  # the default font size
        rotate=0,  # also available: 90, 180, 270
        color=(1, 0.6, 0.6),
        align=1,
    )
    page.insert_image(rright, pixmap=pixright)
    p = fitz.Rect(635, 30, 1150, 700)
    page.insert_textbox(
        p,  # bottom-left of 1st char
        pdf2 + " page: 1" ,  # the text (honors '\n')
        fontname="helv",  # the default font
        fontsize=11,  # the default font size
        rotate=0,  # also available: 90, 180, 270
        color=(1, 0.6, 0.6),
        align=1,
    )
    page.draw_line((595, 0), (595, 842), color=(1, 0, 0), width=1)
    doc.save(
        os.path.join(
            path,
            os.path.basename(pdf1)
            + "_vs_"
            + os.path.basename(pdf2)
            + "_text.pdf",
        ),
        garbage=3,
        deflate=True,
    )
    return os.path.abspath(path)

compare_text(pdf1, pdf2)```


note: if the pdf's are generated on the same platform then everything works fine.

JorjMcKie · 2024-08-28T14:55:18Z

JorjMcKie
Aug 28, 2024
Maintainer

I believe you that the files are different - without looking at your comparison script.

What I am actually interested in why you believe they should be identical. Have they been created by the same program, running once on each of the platforms? With identical input data each?

8 replies

JorjMcKie Aug 28, 2024
Maintainer

To be more precise:
If that software is made with PyMuPDF, and uses the same version of Python and PyMuPDF, then we have a case for investigating things.

Otherwise, you just have to adjust your code so it can deal with whatever potential differences.

JorjMcKie Aug 28, 2024
Maintainer

Please especially be aware, that sorting by text coordinates as it is, does not forgive the slightest differences: even a deviation of bottom coordinates by 1E-7 (depending on the machines) can change the sequence of e.g. two words ...
So you may want to introduce some generosity (rounding to integer for example - hoping to be lucky that this will round to the same neighboring integer ...).)

kvrameshreddy Aug 28, 2024
Author

The software is of same version, But the software is not developed on python and PyMuPDF.

will try by rounding the coordinates.

The problem i see in extracting here is in one report the order of words come in a way, in the other few words either come at start of the sentence or at the end, even though they are in the same position on the report, i guess this is due to the coordinates differences, Is there anyway i can extract text in same order irrespective of the platforms on which pdf's files are generated

JorjMcKie Aug 29, 2024
Maintainer

See my previous post: it is up to your wits to find a solution.

kvrameshreddy Aug 29, 2024
Author

Sure, Thankyou

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

order of Text extraction from pdf is difference when the pdf files are generated in different platform(win and linux) #3816

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

order of Text extraction from pdf is difference when the pdf files are generated in different platform(win and linux) #3816

kvrameshreddy Aug 28, 2024

Replies: 1 comment · 8 replies

JorjMcKie Aug 28, 2024 Maintainer

JorjMcKie Aug 28, 2024 Maintainer

JorjMcKie Aug 28, 2024 Maintainer

kvrameshreddy Aug 28, 2024 Author

JorjMcKie Aug 29, 2024 Maintainer

kvrameshreddy Aug 29, 2024 Author

kvrameshreddy
Aug 28, 2024

Replies: 1 comment 8 replies

JorjMcKie
Aug 28, 2024
Maintainer

JorjMcKie Aug 28, 2024
Maintainer

JorjMcKie Aug 28, 2024
Maintainer

kvrameshreddy Aug 28, 2024
Author

JorjMcKie Aug 29, 2024
Maintainer

kvrameshreddy Aug 29, 2024
Author