list index out of range in to_pandas() #2979

poetaster · 2024-01-04T19:47:23Z

Description of the bug

I am able to read the included pdf and extract tables but the to_pandas function produces:

---> 52 df = tab.to_pandas()

~/.local/lib/python3.10/site-packages/fitz/table.py in to_pandas(self)
   1281             value = []
   1282             for j in range(len(extract)):
-> 1283                 value.append(extract[j][i])
   1284             pd_dict[key] = value
   1285

118.pdf

How to reproduce the bug

Python 3.10.12

Using the uploaded file:

import pandas as pd  # import pandas
import fitz  # import PyMuPDF
if not hasattr(fitz.Page, "find_tables"):
    raise RuntimeError("This PyMuPDF version does not support the table feature")
    
doc = fitz.open("118.pdf")  # open example file
page = doc[0]  # read first page to demo the layout

dataframes = []  # list of DataFrames per table fragment

tabs = page.find_tables()  # locate tables on page
tab = tabs[0]  # assume fragment to be 1st table

extract = tab.extract()
print(extract, '\n')

print(tab.to_pandas())
#dataframes.append(tab.to_pandas())  # append this DataFrame

outputs

[['Severity class', 'Exposure class', 'Controllability class'], ['C1', 'C2', 'C3'], ['S1', 'E1', 'QM', 'QM', 'QM'], ['E2', 'QM', 'QM', 'QM'], ['E3', 'QM', 'QM', 'A'], ['E4', 'QM', 'A', 'B'], ['S2', 'E1', 'QM', 'QM', 'QM'], ['E2', 'QM', 'QM', 'A'], ['E3', 'QM', 'A', 'B'], ['E4', 'A', 'B', 'C'], ['S3', 'E1', 'QM', 'QM', 'A\na'], ['E2', 'QM', 'A', 'B'], ['E3', 'A', 'B', 'C'], ['E4', 'B', 'C', 'D'], ['a\n \nSee \n6.4.3.11\n.']] 

Table 0 column names: ['Severity class', 'Exposure class', 'Controllability class'], external: False
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_494121/3727834182.py in <module>
     50 
     51 show_image(page, f"Table & Header BBoxes")
---> 52 df = tab.to_pandas()

~/.local/lib/python3.10/site-packages/fitz/table.py in to_pandas(self)
   1281             value = []
   1282             for j in range(len(extract)):
-> 1283                 value.append(extract[j][i])
   1284             pd_dict[key] = value
   1285 

IndexError: list index out of range

PyMuPDF version

1.23.8

Operating system

Linux

Python version

3.10

The text was updated successfully, but these errors were encountered:

poetaster · 2024-01-04T19:59:07Z

For reference, since I think it's an issue with the multiple row headers the output from camelot as a screenshot. As one can see in the drawing below the dataframe output, camelot get's the layout correct but shifts the rows in the controllability part.

poetaster · 2024-01-04T20:05:24Z

And finally what data can be extracted:

names0 = None  # column names for comparison purposes
all_extracts = []  # all table rows go here

for page in doc:  # iterate over the pages
    tabs = page.find_tables()  # find tables on page
    if tabs.tables == []:  # a page without table: stop processing
        break
    tab = tabs[0]  # access first table
    header = tab.header  # get its header
    external = header.external  # header outside table body?
    names = header.names  # column names
    if page.number == 0:  # on first page, store away column names
        names0 = names
    elif names != names0:  # not our table anymore
        break
    extract = tab.extract()  # get text for all table cells
    if not external:  # if header contained in table body 
        extract = extract[1:]  # omit repeating header row
    all_extracts.extend(extract)  # append to total list

print(f"The joined table has {len(all_extracts)} rows and {len(names0)} columns.\n")
print(names0)
for i, r in enumerate(all_extracts):
    print(r)
    if i >= 50:
        print("...")
        break

outputs

The joined table has 28 rows and 3 columns.

['Severity class', 'Exposure class', 'Controllability class']
['C1', 'C2', 'C3']
['S1', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'QM']
['E3', 'QM', 'QM', 'A']
['E4', 'QM', 'A', 'B']
['S2', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'A']
['E3', 'QM', 'A', 'B']
['E4', 'A', 'B', 'C']
['S3', 'E1', 'QM', 'QM', 'A\na']
['E2', 'QM', 'A', 'B']
['E3', 'A', 'B', 'C']
['E4', 'B', 'C', 'D']
['a\n \nSee \n6.4.3.11\n.']
['C1', 'C2', 'C3']
['S1', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'QM']
['E3', 'QM', 'QM', 'A']
['E4', 'QM', 'A', 'B']
['S2', 'E1', 'QM', 'QM', 'QM']
['E2', 'QM', 'QM', 'A']
['E3', 'QM', 'A', 'B']
['E4', 'A', 'B', 'C']
['S3', 'E1', 'QM', 'QM', 'A\na']
['E2', 'QM', 'A', 'B']
['E3', 'A', 'B', 'C']
['E4', 'B', 'C', 'D']
['a\n \nSee \n6.4.3.11\n.']

close, but no cigar.

poetaster · 2024-01-04T20:12:29Z

I just tested my theory adding:

1283 print(extract[j][i])

which produces:

C1
S1
E2
E3
E4
S2
E2
E3
E4
S3
E2
E3
E4
a
 
See 
6.4.3.11
.
C2
E1
QM
QM
QM
E1
QM
QM
A
E1
QM
A
B

Before it fails as before.

JorjMcKie · 2024-01-13T14:29:57Z

This is what the fix achieves:

doc=fitz.open("118.pdf")
page=doc[0]
tab=page.find_tables()[0]
tab.to_pandas()
     Severity class Exposure class Controllability class  Col3  Col4
0              None           None                    C1    C2    C3
1                S1             E1                    QM    QM    QM
2              None             E2                    QM    QM    QM
3              None             E3                    QM    QM     A
4              None             E4                    QM     A     B
5                S2             E1                    QM    QM    QM
6              None             E2                    QM    QM     A
7              None             E3                    QM     A     B
8              None             E4                     A     B     C
9                S3             E1                    QM    QM    Aa
10             None             E2                    QM     A     B
11             None             E3                     A     B     C
12             None             E4                     B     C     D
13  a See 6.4.3.11.           None                  None  None  None

The label "fix developed" means that a rollout schedule still needs to be decided.
All changes of the fix however are inside file table.py so in essence just this file needs to be replaced.

poetaster · 2024-01-13T16:44:29Z

Heh! Cool. Which branch is the fix in? I'll test it if I can find it :) But it can wait :) Thanks!

JorjMcKie · 2024-01-13T16:54:06Z

Heh! Cool. Which branch is the fix in? I'll test it if I can find it :) But it can wait :) Thanks!

Thanks for your willingness!
You can have the table.py file and simply copy it into your Python pymupdf installation folder. If you are using a most recent version (1.23.9 and up), that folder/file is one of ... /Python3.xx/Lib/site-packages/fitz/table.py (Windows) or ... /lib/python3.xx/site-packages/fitz/table.py (Linux).

JorjMcKie · 2024-01-13T16:59:31Z

The branch is named "Fix-table-issues".

poetaster · 2024-01-13T17:59:47Z

Heh! Cool. Which branch is the fix in? I'll test it if I can find it :) But it can wait :) Thanks!

Thanks for your willingness! You can have the table.py file and simply copy it into your Python pymupdf installation folder. If you are using a most recent version (1.23.9 and up), that folder/file is one of ... /Python3.xx/Lib/site-packages/fitz/table.py (Windows) or ... /lib/python3.xx/site-packages/fitz/table.py (Linux).

I dropped in the file and presto! It's now a complete representation of the table! Yeah!

Thanks! I'll try to take some time in the coming days to look at your changes. I think they could benefit a number of other projects, too :)

I'm working with a lot of industry docs. Should I report this kind of thing as a matter of course? I'll try to determine fixes myself if I grok what's happened in this case.

JorjMcKie · 2024-01-14T09:25:40Z

Thank you for testing and confirmation!

I'm working with a lot of industry docs. Should I report this kind of thing as a matter of course? I'll try to determine fixes myself if I grok what's happened in this case.

Großartig wäre das 😉!

I indeed made other changes. For example, I am now identifying areas that contain connected vector graphics elements and treat the rectangle hull as additional, "virtual" lines. This makes more tables detectable like this kind of thing

Many table detectors fail because of missing left and right cell borders.

julian-smith-artifex-com · 2024-01-15T12:06:15Z

Fixed in 1.23.13.

JorjMcKie added the bug label Jan 7, 2024

JorjMcKie self-assigned this Jan 7, 2024

JorjMcKie added the fix developed release schedule to be determined label Jan 13, 2024

julian-smith-artifex-com closed this as completed Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list index out of range in to_pandas() #2979

list index out of range in to_pandas() #2979

poetaster commented Jan 4, 2024

poetaster commented Jan 4, 2024 •

edited

Loading

poetaster commented Jan 4, 2024

poetaster commented Jan 4, 2024 •

edited

Loading

JorjMcKie commented Jan 13, 2024

poetaster commented Jan 13, 2024

JorjMcKie commented Jan 13, 2024

JorjMcKie commented Jan 13, 2024

poetaster commented Jan 13, 2024

JorjMcKie commented Jan 14, 2024

julian-smith-artifex-com commented Jan 15, 2024

list index out of range in to_pandas() #2979

list index out of range in to_pandas() #2979

Comments

poetaster commented Jan 4, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

poetaster commented Jan 4, 2024 • edited Loading

poetaster commented Jan 4, 2024

poetaster commented Jan 4, 2024 • edited Loading

JorjMcKie commented Jan 13, 2024

poetaster commented Jan 13, 2024

JorjMcKie commented Jan 13, 2024

JorjMcKie commented Jan 13, 2024

poetaster commented Jan 13, 2024

JorjMcKie commented Jan 14, 2024

julian-smith-artifex-com commented Jan 15, 2024

poetaster commented Jan 4, 2024 •

edited

Loading

poetaster commented Jan 4, 2024 •

edited

Loading