Skip to content

Commit 9df602b

Browse files
committed
Some fixes
* Extend the list of known bullet point Unicodes * Fix typo for detecting a "quad" drawing
1 parent 4e0d6c6 commit 9df602b

File tree

4 files changed

+34
-18
lines changed

4 files changed

+34
-18
lines changed

docs/src/changes.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,21 @@
44
Change Log
55
===========================================================================
66

7+
Changes in version 0.0.11
8+
--------------------------
9+
10+
Fixes:
11+
~~~~~~~
12+
13+
* `90 <https://github.com/pymupdf/RAG/issues/90>`_ "'Quad' object has no attribute 'tl'"
14+
* `88 <https://github.com/pymupdf/RAG/issues/88>`_ "Bug in is_significant function"
15+
16+
17+
Improvements:
18+
~~~~~~~~~~~~~~
19+
* Extended the list of known bullet point characters.
20+
21+
722
Changes in version 0.0.10
823
--------------------------
924

pymupdf4llm/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,15 +33,15 @@ pathlib.Path("output.md").write_bytes(md_text.encode())
3333

3434
Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, the parameter `pages=[...]` can be used to provide a list of zero-based page numbers to consider.
3535

36-
**New features as of v0.0.2:**
36+
**Feature Overview:**
3737

3838
* Support for pages with **_multiple text columns_**.
3939
* Support for **_image and vector graphics extraction_**:
4040

4141
1. Specify `pymupdf4llm.to_markdown("input.pdf", write_images=True)`. Default is `False`.
42-
2. Each image or vector graphic on the page will be extracted and stored as a PNG image named `"input.pdf-pno-index.png"` in the folder of `"input.pdf"`. Where `pno` is the 0-based page number and `index` is some sequence number.
43-
3. The image files will have width and height equal to the values on the page.
44-
4. Any text contained in the images or graphics will not be extracted, but become visible as image parts.
42+
2. Each image or vector graphic on the page will be extracted and stored as an image named `"input.pdf-pno-index.extension"` in a folder of your choice. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"), `pno` is the 0-based page number and `index` is some sequence number.
43+
3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`).
44+
4. Any text contained in the images or graphics will be extracted and **also become visible as part of the generated image**. This behavior can be changed via `force_text=False` (text only apears as part of the image).
4545

4646
* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with the text and some metadata.
4747

pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ def sanitize_spans(line):
6969
Returns:
7070
A list of sorted, and potentially cleaned-up spans
7171
"""
72-
line.sort(key=lambda s: s["bbox"].x0) # sort left to right
72+
# sort ascending horizontally
73+
line.sort(key=lambda s: s["bbox"].x0)
74+
# join spans, delete duplicates
7375
for i in range(len(line) - 1, 0, -1): # iterate back to front
7476
s0 = line[i - 1]
7577
s1 = line[i]
@@ -78,13 +80,17 @@ def sanitize_spans(line):
7880
delta = s1["size"] * 0.1
7981
if s0["bbox"].x1 + delta < s1["bbox"].x0:
8082
continue # all good: no joining neded
83+
84+
# We need to join bbox and text of two consecutive spans
85+
# On occasion, spans may also be duplicated.
86+
if s0["text"] != s1["text"] or s0["bbox"] != s1["bbox"]:
87+
s0["text"] += s1["text"]
8188
s0["bbox"] |= s1["bbox"] # join boundary boxes
82-
s0["text"] += s1["text"] # join the text
8389
del line[i] # delete the joined-in span
8490
line[i - 1] = s0 # update the span
8591
return line
8692

87-
if clip is None: # use TextPage if not provided
93+
if clip is None: # use TextPage rect if not provided
8894
clip = textpage.rect
8995
# extract text blocks - if bbox is not empty
9096
blocks = [
@@ -126,10 +132,7 @@ def sanitize_spans(line):
126132
sbbox = s["bbox"] # this bbox
127133
sbbox0 = line[-1]["bbox"] # previous bbox
128134
# if any of top or bottom coordinates are close enough, join...
129-
if (
130-
abs(sbbox.y1 - sbbox0.y1) <= y_delta
131-
or abs(sbbox.y0 - sbbox0.y0) <= y_delta
132-
):
135+
if abs(sbbox.y1 - sbbox0.y1) <= y_delta or abs(sbbox.y0 - sbbox0.y0) <= y_delta:
133136
line.append(s) # append to this line
134137
lrect |= sbbox # extend line rectangle
135138
continue
@@ -150,9 +153,7 @@ def sanitize_spans(line):
150153
return nlines
151154

152155

153-
def get_text_lines(
154-
page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False
155-
):
156+
def get_text_lines(page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False):
156157
"""Extract text by line keeping natural reading sequence.
157158
158159
Notes:

pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,15 +40,15 @@
4040
if fitz.pymupdf_version_tuple < (1, 24, 2):
4141
raise NotImplementedError("PyMuPDF version 1.24.2 or later is needed.")
4242

43-
bullet = (
43+
bullet = [
4444
"- ",
4545
"* ",
4646
chr(0xF0A7),
4747
chr(0xF0B7),
4848
chr(0xB7),
4949
chr(8226),
50-
chr(9679),
51-
)
50+
] + list(map(chr, range(9642, 9680)))
51+
5252
GRAPHICS_TEXT = "\n![](%s)\n"
5353

5454

@@ -193,7 +193,7 @@ def is_significant(box, paths):
193193
for itm in p["items"]:
194194
if itm[0] in ("l", "c"): # line or curve
195195
points.extend(itm[1:]) # append all the points
196-
elif itm[0] == "q": # quad
196+
elif itm[0] == "qu": # quad
197197
q = itm[1]
198198
# follow corners anti-clockwise
199199
points.extend([q.ul, q.ll, q.lr, q.ur, q.ul])

0 commit comments

Comments
 (0)