-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignoring Tables in Find_Text Option #2908
Comments
The easiest (but less accurate) way is probably option 2. We might consider supporting something like an |
Here is another, much simpler approach:
for tab in page.find_tables()
# process the content of table 'tab'
page.add_redact_annot(tab.bbox) # wrap table in a redaction annotation
page.apply_redactions() # erase all table text
# do text searches and text extractions here |
How can you extract text and tables in the order that they show? I basically want to extract text up to the table, then extract the table, and then extract the text after the table. |
This sounds like we can assume that there is only one table and it covers the full page width: no text to the left or right side of the table. Well, this is a simple situation. First do a page = doc[pno] # load the page
tab = page.find_tables()[0] # find table
tab_bbox = fitz.Rect(tab.bbox) # the table's bbox as a rectangle
top_bbox = page.rect
top_bbox.y1 = tab_bbox.y0 # shorten from below
btm_bbox = page.rect
btm_bbox.y0 = tab_bbox.y1 # shorten from above
# we now have all 3 rectangles
top_text = page.get_text(clip=top_bbox) # plain text above table
tab_content = tab.extract() # list of table cell contents
btm_text = page.get_text(clip=btm_bbox) # plain text below table |
@JorjMcKie Sorry, I was just using one table as an example. There will be an unknown number of tables in the documents. I am basically creating a PDF to HTML converter which will have a range of different PDF files inputted into it. |
I was suspecting this. In a more general situation, things are of course a lot more complex:
To cover this, you must use the same type of information that mentioned above. But it is logic that you must develop:
|
@JorjMcKie There is likely to be multiple tables on one page but there won't be any text to the left or the right of the tables, it will only be above or below. Do you have an example code snippet for this to get started? Thank you. |
No, but it should be obvious how to do that.
|
@JorjMcKie How do I also extract the font sizes of the text while doing this? |
You need to replace plain text extraction If you also need the font size used inside the tables, you must use the same for each table cell using the cell bbox as clip. |
@JorjMcKie Thank you for your help. Do you know what the best method is for identifying headings in the PDF so that I can parse them into HTML tags ( |
Yes - you are on the generally accepted track. There is no information inside the PDF itself about these things. |
@JorjMcKie In regards to extracting the tables, is it possible to use a similar technique to extract images in the order that they show by extracting the text above and below the image? |
Sure: |
Is your feature request related to a problem? Please describe.
Ever since the find_tables method was added, I was wondering if there was a way to exclude these tables from the find_text function so that we could choose to include or exclude tables from the raw text.
Describe the solution you'd like
A method that grabs text while excluding table information.
Describe alternatives you've considered
I've tried to just use the find_text option in PymuPDF and then grabbing parsed tables from Adobe Extract API since the API seems to capture multi-cell tables better. The issue is that Adobe Extract API grabs raw text very poorly.
The text was updated successfully, but these errors were encountered: