Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list index out of range in to_pandas() #2979

Closed
poetaster opened this issue Jan 4, 2024 · 10 comments
Closed

list index out of range in to_pandas() #2979

poetaster opened this issue Jan 4, 2024 · 10 comments
Assignees
Labels
bug fix developed release schedule to be determined

Comments

@poetaster
Copy link

Description of the bug

I am able to read the included pdf and extract tables but the to_pandas function produces:

---> 52 df = tab.to_pandas()

~/.local/lib/python3.10/site-packages/fitz/table.py in to_pandas(self)
   1281             value = []
   1282             for j in range(len(extract)):
-> 1283                 value.append(extract[j][i])
   1284             pd_dict[key] = value
   1285 

118.pdf

How to reproduce the bug

Python 3.10.12

Using the uploaded file:

import pandas as pd  # import pandas
import fitz  # import PyMuPDF
if not hasattr(fitz.Page, "find_tables"):
    raise RuntimeError("This PyMuPDF version does not support the table feature")
    
doc = fitz.open("118.pdf")  # open example file
page = doc[0]  # read first page to demo the layout

dataframes = []  # list of DataFrames per table fragment

tabs = page.find_tables()  # locate tables on page
tab = tabs[0]  # assume fragment to be 1st table

extract = tab.extract()
print(extract, '\n')

print(tab.to_pandas())
#dataframes.append(tab.to_pandas())  # append this DataFrame

outputs

[['Severity class', 'Exposure class', 'Controllability class'], ['C1', 'C2', 'C3'], ['S1', 'E1', 'QM', 'QM', 'QM'], ['E2', 'QM', 'QM', 'QM'], ['E3', 'QM', 'QM', 'A'], ['E4', 'QM', 'A', 'B'], ['S2', 'E1', 'QM', 'QM', 'QM'], ['E2', 'QM', 'QM', 'A'], ['E3', 'QM', 'A', 'B'], ['E4', 'A', 'B', 'C'], ['S3', 'E1', 'QM', 'QM', 'A\na'], ['E2', 'QM', 'A', 'B'], ['E3', 'A', 'B', 'C'], ['E4', 'B', 'C', 'D'], ['a\n \nSee \n6.4.3.11\n.']] 

Table 0 column names: ['Severity class', 'Exposure class', 'Controllability class'], external: False
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_494121/3727834182.py in <module>
     50 
     51 show_image(page, f"Table & Header BBoxes")
---> 52 df = tab.to_pandas()

~/.local/lib/python3.10/site-packages/fitz/table.py in to_pandas(self)
   1281             value = []
   1282             for j in range(len(extract)):
-> 1283                 value.append(extract[j][i])
   1284             pd_dict[key] = value
   1285 

IndexError: list index out of range

PyMuPDF version

1.23.8

Operating system

Linux

Python version

3.10

@poetaster
Copy link
Author

poetaster commented Jan 4, 2024

For reference, since I think it's an issue with the multiple row headers the output from camelot as a screenshot. As one can see in the drawing below the dataframe output, camelot get's the layout correct but shifts the rows in the controllability part.

Screenshot from 2024-01-04 20-56-57

@poetaster
Copy link
Author

And finally what data can be extracted:

names0 = None  # column names for comparison purposes
all_extracts = []  # all table rows go here

for page in doc:  # iterate over the pages
    tabs = page.find_tables()  # find tables on page
    if tabs.tables == []:  # a page without table: stop processing
        break
    tab = tabs[0]  # access first table
    header = tab.header  # get its header
    external = header.external  # header outside table body?
    names = header.names  # column names
    if page.number == 0:  # on first page, store away column names
        names0 = names
    elif names != names0:  # not our table anymore
        break
    extract = tab.extract()  # get text for all table cells
    if not external:  # if header contained in table body 
        extract = extract[1:]  # omit repeating header row
    all_extracts.extend(extract)  # append to total list

print(f"The joined table has {len(all_extracts)} rows and {len(names0)} columns.\n")
print(names0)
for i, r in enumerate(all_extracts):
    print(r)
    if i >= 50:
        print("...")
        break

outputs

The joined table has 28 rows and 3 columns.

['Severity class', 'Exposure class', 'Controllability class']
['C1', 'C2', 'C3']
['S1', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'QM']
['E3', 'QM', 'QM', 'A']
['E4', 'QM', 'A', 'B']
['S2', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'A']
['E3', 'QM', 'A', 'B']
['E4', 'A', 'B', 'C']
['S3', 'E1', 'QM', 'QM', 'A\na']
['E2', 'QM', 'A', 'B']
['E3', 'A', 'B', 'C']
['E4', 'B', 'C', 'D']
['a\n \nSee \n6.4.3.11\n.']
['C1', 'C2', 'C3']
['S1', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'QM']
['E3', 'QM', 'QM', 'A']
['E4', 'QM', 'A', 'B']
['S2', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'A']
['E3', 'QM', 'A', 'B']
['E4', 'A', 'B', 'C']
['S3', 'E1', 'QM', 'QM', 'A\na']
['E2', 'QM', 'A', 'B']
['E3', 'A', 'B', 'C']
['E4', 'B', 'C', 'D']
['a\n \nSee \n6.4.3.11\n.']

close, but no cigar.

@poetaster
Copy link
Author

poetaster commented Jan 4, 2024

I just tested my theory adding:

1283 print(extract[j][i])

which produces:

C1
S1
E2
E3
E4
S2
E2
E3
E4
S3
E2
E3
E4
a
 
See 
6.4.3.11
.
C2
E1
QM
QM
QM
E1
QM
QM
A
E1
QM
A
B

Before it fails as before.

@JorjMcKie JorjMcKie added the bug label Jan 7, 2024
@JorjMcKie JorjMcKie self-assigned this Jan 7, 2024
@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Jan 13, 2024
@JorjMcKie
Copy link
Collaborator

This is what the fix achieves:

doc=fitz.open("118.pdf")
page=doc[0]
tab=page.find_tables()[0]
tab.to_pandas()
     Severity class Exposure class Controllability class  Col3  Col4
0              None           None                    C1    C2    C3
1                S1             E1                    QM    QM    QM
2              None             E2                    QM    QM    QM
3              None             E3                    QM    QM     A
4              None             E4                    QM     A     B
5                S2             E1                    QM    QM    QM
6              None             E2                    QM    QM     A
7              None             E3                    QM     A     B
8              None             E4                     A     B     C
9                S3             E1                    QM    QM    Aa
10             None             E2                    QM     A     B
11             None             E3                     A     B     C
12             None             E4                     B     C     D
13  a See 6.4.3.11.           None                  None  None  None

The label "fix developed" means that a rollout schedule still needs to be decided.
All changes of the fix however are inside file table.py so in essence just this file needs to be replaced.

@poetaster
Copy link
Author

Heh! Cool. Which branch is the fix in? I'll test it if I can find it :) But it can wait :) Thanks!

@JorjMcKie
Copy link
Collaborator

Heh! Cool. Which branch is the fix in? I'll test it if I can find it :) But it can wait :) Thanks!

Thanks for your willingness!
You can have the table.py file and simply copy it into your Python pymupdf installation folder. If you are using a most recent version (1.23.9 and up), that folder/file is one of ... /Python3.xx/Lib/site-packages/fitz/table.py (Windows) or ... /lib/python3.xx/site-packages/fitz/table.py (Linux).

@JorjMcKie
Copy link
Collaborator

The branch is named "Fix-table-issues".

@poetaster
Copy link
Author

Heh! Cool. Which branch is the fix in? I'll test it if I can find it :) But it can wait :) Thanks!

Thanks for your willingness! You can have the table.py file and simply copy it into your Python pymupdf installation folder. If you are using a most recent version (1.23.9 and up), that folder/file is one of ... /Python3.xx/Lib/site-packages/fitz/table.py (Windows) or ... /lib/python3.xx/site-packages/fitz/table.py (Linux).

I dropped in the file and presto! It's now a complete representation of the table! Yeah!

Thanks! I'll try to take some time in the coming days to look at your changes. I think they could benefit a number of other projects, too :)

I'm working with a lot of industry docs. Should I report this kind of thing as a matter of course? I'll try to determine fixes myself if I grok what's happened in this case.

@JorjMcKie
Copy link
Collaborator

Thank you for testing and confirmation!

I'm working with a lot of industry docs. Should I report this kind of thing as a matter of course? I'll try to determine fixes myself if I grok what's happened in this case.

Großartig wäre das 😉!

I indeed made other changes. For example, I am now identifying areas that contain connected vector graphics elements and treat the rectangle hull as additional, "virtual" lines. This makes more tables detectable like this kind of thing

image

Many table detectors fail because of missing left and right cell borders.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix developed release schedule to be determined
Projects
None yet
Development

No branches or pull requests

3 participants