You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ Added
- Access to `curve` points. (E.g., `page.curves[0]["points"]`.)
- Ability for `.draw_line` to draw `curve` points.
@ Changed
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
- Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items.
- Now explicitly ignoring some (obscure) `pdfminer` object attributes.
- Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method.
@ Fixed
- Fixed typo bug when `.rect_edges` is called before `.edges`
Copy file name to clipboardExpand all lines: README.md
+28-6
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# PDFPlumber `v0.5.1`
1
+
# PDFPlumber `v0.5.2`
2
2
3
3
Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
4
4
@@ -102,6 +102,7 @@ Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to four
102
102
-`.annos`, each representing a single annotation-text character.
103
103
-`.lines`, each representing a single 1-dimensional line.
104
104
-`.rects`, each representing a single 2-dimensional rectangle.
105
+
-`.curves`, each representing a series of connected points.
105
106
106
107
Each object is represented as a simple Python `dict`, with the following properties:
107
108
@@ -130,7 +131,7 @@ Each object is represented as a simple Python `dict`, with the following propert
130
131
131
132
| Property | Description |
132
133
|----------|-------------|
133
-
|`page_number`| Page number on which this character was found.|
134
+
|`page_number`| Page number on which this line was found.|
134
135
|`height`| Height of line.|
135
136
|`width`| Width of line.|
136
137
|`x0`| Distance of left-side extremity from left side of page.|
@@ -147,7 +148,7 @@ Each object is represented as a simple Python `dict`, with the following propert
147
148
148
149
| Property | Description |
149
150
|----------|-------------|
150
-
|`page_number`| Page number on which this character was found.|
151
+
|`page_number`| Page number on which this rectangle was found.|
151
152
|`height`| Height of rectangle.|
152
153
|`width`| Width of rectangle.|
153
154
|`x0`| Distance of left side of rectangle from left side of page.|
@@ -160,6 +161,24 @@ Each object is represented as a simple Python `dict`, with the following propert
160
161
|`linewidth`| Thickness of line.|
161
162
|`object_type`| "rect"|
162
163
164
+
#### `curve` properties
165
+
166
+
| Property | Description |
167
+
|----------|-------------|
168
+
|`page_number`| Page number on which this curve was found.|
169
+
|`points`| Points — as a list of `(x, top)` tuples — describing the curve.|
170
+
|`height`| Height of curve's bounding box.|
171
+
|`width`| Width of curve's bounding box.|
172
+
|`x0`| Distance of curve's left-most point from left side of page.|
173
+
|`x1`| Distance of curve's right-most point from left side of the page.|
174
+
|`y0`| Distance of curve's lowest point from bottom of page.|
175
+
|`y1`| Distance of curve's highest point from bottom of page.|
176
+
|`top`| Distance of curve's highest point from top of page.|
177
+
|`bottom`| Distance of curve's lowest point from top of page.|
178
+
|`doctop`| Distance of curve's highest point from top of document.|
179
+
|`linewidth`| Thickness of line.|
180
+
|`object_type`| "curve"|
181
+
163
182
Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to two derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines) and `.edges` (which combines `.rect_edges` with `.lines`).
164
183
165
184
## Visual debugging
@@ -191,7 +210,7 @@ You can pass explicit coordinates or any `pdfplumber` PDF object (e.g., char, li
|`im.draw_line(line, stroke={color}, stroke_width=1)`|`im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`-like object, or a 4-tuple bounding box.|
213
+
|`im.draw_line(line, stroke={color}, stroke_width=1)`|`im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).|
195
214
|`im.draw_vline(location, stroke={color}, stroke_width=1)`|`im.draw_vlines(list_of_locations, **kwargs)`| Draws a vertical line at the x-coordinate indicated by `location`.|
196
215
|`im.draw_hline(location, stroke={color}, stroke_width=1)`|`im.draw_hlines(list_of_locations, **kwargs)`| Draws a horizontal line at the y-coordinate indicated by `location`.|
197
216
|`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`|`im.draw_rects(list_of_rects, **kwargs)`| Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.|
@@ -243,7 +262,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
243
262
"snap_tolerance": 3,
244
263
"join_tolerance": 3,
245
264
"edge_min_length": 3,
246
-
"text_word_threshold": 3,
265
+
"min_words_vertical": 3,
266
+
"min_words_horizontal": 1,
247
267
"keep_blank_chars": False,
248
268
"text_tolerance": 3,
249
269
"text_x_tolerance": None,
@@ -263,7 +283,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
263
283
|`"snap_tolerance"`| Parallel lines within `snap_tolerance` pixels will be "snapped" to the same horizontal or vertical position.|
264
284
|`"join_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
265
285
|`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
266
-
|`"text_word_threshold"`| When using the `text` strategy, at least `text_word_threshold` words must share the same alignment.|
286
+
|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
287
+
|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
267
288
|`"keep_blank_chars"`| When using the `text` strategy, consider `" "` chars to be *parts* of words and not word-separators.|
268
289
|`"text_tolerance"`, `"text_x_tolerance"`, `"text_y_tolerance"`| When the `text` strategy searches for words, it will expect the individual letters in each word to be no more than `text_tolerance` pixels apart.|
269
290
|`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges most be within `intersection_tolerance` pixels to be considered intersecting.|
@@ -290,6 +311,7 @@ Both `vertical_strategy` and `horizontal_strategy` accept the following options:
290
311
291
312
-[Using `extract_table` on a California Worker Adjustment and Retraining Notification (WARN) report](examples/notebooks/extract-table-ca-warn-report.ipynb). Demonstrates basic visual debugging and table extraction.
292
313
-[Using `extract_table` on the FBI's National Instant Criminal Background Check System PDFs](examples/notebooks/extract-table-nics.ipynb). Demonstrates how to use visual debugging to find optimal table extraction settings. Also demonstrates `Page.crop(...)` and `Page.extract_text(...)`
314
+
-[Inspecting and visualizing `curve` objects](examples/notebooks/ag-energy-roundup-curves.ipynb).
0 commit comments