Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring Tables in Find_Text Option #2908

Closed
XariZaru opened this issue Dec 18, 2023 · 14 comments
Closed

Ignoring Tables in Find_Text Option #2908

XariZaru opened this issue Dec 18, 2023 · 14 comments
Labels
enhancement wontfix no intention to resolve

Comments

@XariZaru
Copy link

Is your feature request related to a problem? Please describe.
Ever since the find_tables method was added, I was wondering if there was a way to exclude these tables from the find_text function so that we could choose to include or exclude tables from the raw text.

Describe the solution you'd like
A method that grabs text while excluding table information.

Describe alternatives you've considered
I've tried to just use the find_text option in PymuPDF and then grabbing parsed tables from Adobe Extract API since the API seems to capture multi-cell tables better. The issue is that Adobe Extract API grabs raw text very poorly.

@JorjMcKie
Copy link
Collaborator

  1. If you have the bbox of the only table on the page, you can easily identify the up to four bboxes of the complementing page area.
    Then execute 4 searches with the respective clip rectangles and join their results.

  2. If multiple table bboxes are present, execute one search and post-process the resulting list of the search hits to exclude those intersecting any table.

The easiest (but less accurate) way is probably option 2. We might consider supporting something like an ignore argument, which accepts a list of rectangles to ignore.
But it is very obvious how to do that by simple own code.

@JorjMcKie JorjMcKie added the wontfix no intention to resolve label Dec 23, 2023
@JorjMcKie
Copy link
Collaborator

Here is another, much simpler approach:

  1. Find / process tables on the page
  2. Erase the text of all tables using redaction annotations
  3. Do all your text searches and extractions. Table text will no longer be includes now.
for tab in page.find_tables()
    # process the content of table 'tab'
    page.add_redact_annot(tab.bbox)  # wrap table in a redaction annotation

page.apply_redactions()  # erase all table text

# do text searches and text extractions here

@CaiSamuelsFSA
Copy link

How can you extract text and tables in the order that they show?

I basically want to extract text up to the table, then extract the table, and then extract the text after the table.

@JorjMcKie
Copy link
Collaborator

@CaiSamuelsFSA

This sounds like we can assume that there is only one table and it covers the full page width: no text to the left or right side of the table.

Well, this is a simple situation. First do a tab = page.find_tables()[0] to identifiy the table.
Then extract the text with coordinates above the table's y0 coordinate, then the table content, then the text with coordinates greater than y1 of the table.

page = doc[pno]  # load the page
tab = page.find_tables()[0]  # find table
tab_bbox = fitz.Rect(tab.bbox)  # the table's bbox as a rectangle
top_bbox = page.rect
top_bbox.y1 = tab_bbox.y0  # shorten from below
btm_bbox = page.rect
btm_bbox.y0 = tab_bbox.y1  # shorten from above

# we now have all 3 rectangles
top_text = page.get_text(clip=top_bbox)  # plain text above table
tab_content = tab.extract()  # list of table cell contents
btm_text = page.get_text(clip=btm_bbox)  # plain text below table

@CaiSamuelsFSA
Copy link

@JorjMcKie Sorry, I was just using one table as an example. There will be an unknown number of tables in the documents. I am basically creating a PDF to HTML converter which will have a range of different PDF files inputted into it.

@JorjMcKie
Copy link
Collaborator

I was suspecting this.

In a more general situation, things are of course a lot more complex:

  • there may be multiple tables per page
  • text outside tables may occur to the left or right or above or below any table

To cover this, you must use the same type of information that mentioned above. But it is logic that you must develop:

  • For each table tab0, ... tabn you have the respective table bbox bbox0, ... bboxn
  • You can extract text above / below all tables by looking at the minimum of all table y0 values and the maximum of all table y1 values.
  • To extract non-table text in between, you must look at page sub-rectangles to the left and the right of table bboxes and make sense of what a proper relative sequence of extracted text may be.

@CaiSamuelsFSA
Copy link

@JorjMcKie There is likely to be multiple tables on one page but there won't be any text to the left or the right of the tables, it will only be above or below. Do you have an example code snippet for this to get started? Thank you.

@JorjMcKie
Copy link
Collaborator

Do you have an example code snippet for this to get started?

No, but it should be obvious how to do that.

  • As mentioned, find all tables on page first.
  • You will then have a corresponding list of table bboxes, each have a top ("y0") and bottom ("y1") coordinate.
  • Make a corresponding list of table-free rectangles on the page. The first such rectangle has top coordinate 0 and bottom coord y0 of the first table. The second such text rectangle receives y1 of table 1 as top, and y0 of table 2 as bottom coordinate. And so on.
  • The last text rectangle has y1 of last table as top and page.rect.height as bottom coordinate.
  • Then proceed analogous to my first post for extracting the text portions: plain text, table 1, plain text, table 2, ..., plain text.

@CaiSamuelsFSA
Copy link

@JorjMcKie How do I also extract the font sizes of the text while doing this?

@JorjMcKie
Copy link
Collaborator

@JorjMcKie How do I also extract the font sizes of the text while doing this?

You need to replace plain text extraction page.get_text("text", ...) (default) by page.get_text("dict", clip=...) which returns a list of stacked dictionaries, documented here.
The "span" dictionary contains the font size.

If you also need the font size used inside the tables, you must use the same for each table cell using the cell bbox as clip.
The standard table text output is delivered via tab.extract() which contains the sheer text - no text metadata.

@CaiSamuelsFSA
Copy link

@JorjMcKie Thank you for your help. Do you know what the best method is for identifying headings in the PDF so that I can parse them into HTML tags (h1, h2 etc)? All I can think of is estimating headings based on the font size.

@JorjMcKie
Copy link
Collaborator

@JorjMcKie Thank you for your help. Do you know what the best method is for identifying headings in the PDF so that I can parse them into HTML tags (h1, h2 etc)? All I can think of is estimating headings based on the font size.

Yes - you are on the generally accepted track. There is no information inside the PDF itself about these things.

@CaiSamuelsFSA
Copy link

@JorjMcKie In regards to extracting the tables, is it possible to use a similar technique to extract images in the order that they show by extracting the text above and below the image?

@JorjMcKie
Copy link
Collaborator

Sure:
There is Page.get_image_info. Gives image position info and some more metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement wontfix no intention to resolve
Projects
None yet
Development

No branches or pull requests

3 participants