Skip to content

Commit b44f2dc

Browse files
committed
v0.5.2
@ Added - Access to `curve` points. (E.g., `page.curves[0]["points"]`.) - Ability for `.draw_line` to draw `curve` points. @ Changed - Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold". - Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items. - Now explicitly ignoring some (obscure) `pdfminer` object attributes. - Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method. @ Fixed - Fixed typo bug when `.rect_edges` is called before `.edges`
1 parent 6d2d010 commit b44f2dc

11 files changed

+406
-44
lines changed

CHANGELOG.md

+14
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,20 @@ All notable changes to this project will be documented in this file. Currently g
44

55
The format is based on [Keep a Changelog](http://keepachangelog.com/).
66

7+
## [0.5.2] — 2017-02-27
8+
### Added
9+
- Access to `curve` points. (E.g., `page.curves[0]["points"]`.)
10+
- Ability for `.draw_line` to draw `curve` points.
11+
12+
### Changed
13+
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
14+
- Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items.
15+
- Now explicitly ignoring some (obscure) `pdfminer` object attributes.
16+
- Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve["points"]` and with `Pillow`'s underlying method.
17+
18+
### Fixed
19+
- Fixed typo bug when `.rect_edges` is called before `.edges`
20+
721
## [0.5.1] — 2017-02-26
822
### Added
923
- Quick-draw `PageImage` methods: `.draw_vline`, `.draw_vlines`, `.draw_hline`, and `.draw_hlines`.

README.md

+28-6
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# PDFPlumber `v0.5.1`
1+
# PDFPlumber `v0.5.2`
22

33
Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
44

@@ -102,6 +102,7 @@ Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to four
102102
- `.annos`, each representing a single annotation-text character.
103103
- `.lines`, each representing a single 1-dimensional line.
104104
- `.rects`, each representing a single 2-dimensional rectangle.
105+
- `.curves`, each representing a series of connected points.
105106

106107
Each object is represented as a simple Python `dict`, with the following properties:
107108

@@ -130,7 +131,7 @@ Each object is represented as a simple Python `dict`, with the following propert
130131

131132
| Property | Description |
132133
|----------|-------------|
133-
|`page_number`| Page number on which this character was found.|
134+
|`page_number`| Page number on which this line was found.|
134135
|`height`| Height of line.|
135136
|`width`| Width of line.|
136137
|`x0`| Distance of left-side extremity from left side of page.|
@@ -147,7 +148,7 @@ Each object is represented as a simple Python `dict`, with the following propert
147148

148149
| Property | Description |
149150
|----------|-------------|
150-
|`page_number`| Page number on which this character was found.|
151+
|`page_number`| Page number on which this rectangle was found.|
151152
|`height`| Height of rectangle.|
152153
|`width`| Width of rectangle.|
153154
|`x0`| Distance of left side of rectangle from left side of page.|
@@ -160,6 +161,24 @@ Each object is represented as a simple Python `dict`, with the following propert
160161
|`linewidth`| Thickness of line.|
161162
|`object_type`| "rect"|
162163

164+
#### `curve` properties
165+
166+
| Property | Description |
167+
|----------|-------------|
168+
|`page_number`| Page number on which this curve was found.|
169+
|`points`| Points — as a list of `(x, top)` tuples — describing the curve.|
170+
|`height`| Height of curve's bounding box.|
171+
|`width`| Width of curve's bounding box.|
172+
|`x0`| Distance of curve's left-most point from left side of page.|
173+
|`x1`| Distance of curve's right-most point from left side of the page.|
174+
|`y0`| Distance of curve's lowest point from bottom of page.|
175+
|`y1`| Distance of curve's highest point from bottom of page.|
176+
|`top`| Distance of curve's highest point from top of page.|
177+
|`bottom`| Distance of curve's lowest point from top of page.|
178+
|`doctop`| Distance of curve's highest point from top of document.|
179+
|`linewidth`| Thickness of line.|
180+
|`object_type`| "curve"|
181+
163182
Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to two derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines) and `.edges` (which combines `.rect_edges` with `.lines`).
164183

165184
## Visual debugging
@@ -191,7 +210,7 @@ You can pass explicit coordinates or any `pdfplumber` PDF object (e.g., char, li
191210

192211
| Single-object method | Bulk method | Description |
193212
|----------------------|-------------|-------------|
194-
|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`-like object, or a 4-tuple bounding box.|
213+
|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).|
195214
|`im.draw_vline(location, stroke={color}, stroke_width=1)`| `im.draw_vlines(list_of_locations, **kwargs)`| Draws a vertical line at the x-coordinate indicated by `location`.|
196215
|`im.draw_hline(location, stroke={color}, stroke_width=1)`| `im.draw_hlines(list_of_locations, **kwargs)`| Draws a horizontal line at the y-coordinate indicated by `location`.|
197216
|`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`| `im.draw_rects(list_of_rects, **kwargs)`| Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.|
@@ -243,7 +262,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
243262
"snap_tolerance": 3,
244263
"join_tolerance": 3,
245264
"edge_min_length": 3,
246-
"text_word_threshold": 3,
265+
"min_words_vertical": 3,
266+
"min_words_horizontal": 1,
247267
"keep_blank_chars": False,
248268
"text_tolerance": 3,
249269
"text_x_tolerance": None,
@@ -263,7 +283,8 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r
263283
|`"snap_tolerance"`| Parallel lines within `snap_tolerance` pixels will be "snapped" to the same horizontal or vertical position.|
264284
|`"join_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
265285
|`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
266-
|`"text_word_threshold"`| When using the `text` strategy, at least `text_word_threshold` words must share the same alignment.|
286+
|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
287+
|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
267288
|`"keep_blank_chars"`| When using the `text` strategy, consider `" "` chars to be *parts* of words and not word-separators.|
268289
|`"text_tolerance"`, `"text_x_tolerance"`, `"text_y_tolerance"`| When the `text` strategy searches for words, it will expect the individual letters in each word to be no more than `text_tolerance` pixels apart.|
269290
|`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges most be within `intersection_tolerance` pixels to be considered intersecting.|
@@ -290,6 +311,7 @@ Both `vertical_strategy` and `horizontal_strategy` accept the following options:
290311

291312
- [Using `extract_table` on a California Worker Adjustment and Retraining Notification (WARN) report](examples/notebooks/extract-table-ca-warn-report.ipynb). Demonstrates basic visual debugging and table extraction.
292313
- [Using `extract_table` on the FBI's National Instant Criminal Background Check System PDFs](examples/notebooks/extract-table-nics.ipynb). Demonstrates how to use visual debugging to find optimal table extraction settings. Also demonstrates `Page.crop(...)` and `Page.extract_text(...)`
314+
- [Inspecting and visualizing `curve` objects](examples/notebooks/ag-energy-roundup-curves.ipynb).
293315

294316
## Acknowledgments / Contributors
295317

examples/notebooks/ag-energy-roundup-curves.ipynb

+277
Large diffs are not rendered by default.
49.7 KB
Binary file not shown.

pdfplumber/_version.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
version_info = (0, 5, 1)
1+
version_info = (0, 5, 2)
22
__version__ = '.'.join(map(str, version_info))

pdfplumber/container.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ def annos(self):
4040

4141
@property
4242
def rect_edges(self):
43-
if hasattr(self, "_rect_edges"): return self._edges
43+
if hasattr(self, "_rect_edges"): return self._rect_edges
4444
rect_edges_gen = (utils.rect_to_edges(r) for r in self.rects)
4545
self._rect_edges = list(chain(*rect_edges_gen))
4646
return self._rect_edges

pdfplumber/display.py

+18-11
Original file line numberDiff line numberDiff line change
@@ -88,16 +88,18 @@ def reset(self):
8888
def copy(self):
8989
return self.__class__(self.page, self.original)
9090

91-
def draw_line(self, points_or_line,
91+
def draw_line(self, points_or_obj,
9292
stroke=DEFAULT_STROKE,
9393
stroke_width=DEFAULT_STROKE_WIDTH):
94-
if isinstance(points_or_line, (tuple, list)):
95-
points = points_or_line
94+
if isinstance(points_or_obj, (tuple, list)):
95+
points = points_or_obj
96+
elif type(points_or_obj) == dict and "points" in points_or_obj:
97+
points = points_or_obj["points"]
9698
else:
97-
obj = points_or_line
98-
points = (obj["x0"], obj["top"], obj["x1"], obj["bottom"])
99+
obj = points_or_obj
100+
points = ((obj["x0"], obj["top"]), (obj["x1"], obj["bottom"]))
99101
self.draw.line(
100-
self._reproject_bbox(points),
102+
list(map(self._reproject, points)),
101103
fill=stroke,
102104
width=stroke_width
103105
)
@@ -165,10 +167,10 @@ def draw_rect(self, bbox_or_obj,
165167

166168
if stroke_width > 0:
167169
segments = [
168-
(x0, top, x1, top), # top
169-
(x0, bottom, x1, bottom), # bottom
170-
(x0, top, x0, bottom), # left
171-
(x1, top, x1, bottom), # right
170+
((x0, top), (x1, top)), # top
171+
((x0, bottom), (x1, bottom)), # bottom
172+
((x0, top), (x0, bottom)), # left
173+
((x1, top), (x1, bottom)), # right
172174
]
173175
self.draw_lines(
174176
segments,
@@ -195,7 +197,12 @@ def draw_circle(self, center_or_obj,
195197
(obj["top"] + obj["bottom"]) / 2
196198
)
197199
cx, cy = center
198-
bbox = (cx - radius, cy - radius, cx + radius, cy + radius)
200+
bbox = self.decimalize((
201+
cx - radius,
202+
cy - radius,
203+
cx + radius,
204+
cy + radius
205+
))
199206
self.draw.ellipse(
200207
self._reproject_bbox(bbox),
201208
fill,

pdfplumber/page.py

+36-11
Original file line numberDiff line numberDiff line change
@@ -21,22 +21,22 @@ def __init__(self, pdf, page_obj, page_number=None, initial_doctop=0):
2121
self.initial_doctop = self.decimalize(initial_doctop)
2222

2323
cropbox = page_obj.attrs.get("CropBox", page_obj.attrs.get("MediaBox"))
24-
self.cropbox = tuple(map(self.decimalize, cropbox))
24+
self.cropbox = self.decimalize(cropbox)
2525

2626
if self.rotation in [ 90, 270 ]:
27-
self.bbox = tuple(map(self.decimalize, (
27+
self.bbox = self.decimalize((
2828
min(cropbox[1], cropbox[3]),
2929
min(cropbox[0], cropbox[2]),
3030
max(cropbox[1], cropbox[3]),
3131
max(cropbox[0], cropbox[2]),
32-
)))
32+
))
3333
else:
34-
self.bbox = tuple(map(self.decimalize, (
34+
self.bbox = self.decimalize((
3535
min(cropbox[0], cropbox[2]),
3636
min(cropbox[1], cropbox[3]),
3737
max(cropbox[0], cropbox[2]),
3838
max(cropbox[1], cropbox[3]),
39-
)))
39+
))
4040

4141
def decimalize(self, x):
4242
return utils.decimalize(x, self.pdf.precision)
@@ -69,11 +69,33 @@ def parse_objects(self):
6969
idc = self.initial_doctop
7070
pno = self.page_number
7171

72-
def process_object(obj):
72+
def point2coord(pt):
73+
x, y = pt
74+
return (
75+
d(x),
76+
h - d(y)
77+
)
78+
79+
IGNORE = [
80+
"bbox",
81+
"matrix",
82+
"_text",
83+
"_objs",
84+
"groups",
85+
"stream",
86+
"colorspace",
87+
"imagemask",
88+
"pts",
89+
]
90+
91+
NON_DECIMALIZE = [
92+
"fontname", "name", "upright",
93+
]
7394

74-
attr = dict((k, d(v)) for k, v in obj.__dict__.items()
75-
if isinstance(v, (float, int, string_types))
76-
and k[0] != "_")
95+
def process_object(obj):
96+
attr = dict((k, (v if k in NON_DECIMALIZE else d(v)))
97+
for k, v in obj.__dict__.items()
98+
if k not in IGNORE)
7799

78100
kind = re.sub(lt_pat, "", obj.__class__.__name__).lower()
79101
attr["object_type"] = kind
@@ -82,6 +104,9 @@ def process_object(obj):
82104
if hasattr(obj, "get_text"):
83105
attr["text"] = obj.get_text()
84106

107+
if kind == "curve":
108+
attr["points"] = list(map(point2coord, obj.pts))
109+
85110
if attr.get("y0") != None:
86111
attr["top"] = h - attr["y1"]
87112
attr["bottom"] = h - attr["y0"]
@@ -145,7 +170,7 @@ def objects(self):
145170
return self._objects
146171

147172
cropped = CroppedPage(self)
148-
cropped.bbox = tuple(map(self.decimalize, bbox))
173+
cropped.bbox = self.decimalize(bbox)
149174
return cropped
150175

151176
def within_bbox(self, bbox):
@@ -162,7 +187,7 @@ def objects(self):
162187
return self._objects
163188

164189
cropped = CroppedPage(self)
165-
cropped.bbox = tuple(map(self.decimalize, bbox))
190+
cropped.bbox = self.decimalize(bbox)
166191
return cropped
167192

168193
def filter(self, test_function):

pdfplumber/table.py

+9-8
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
DEFAULT_SNAP_TOLERANCE = 3
66
DEFAULT_JOIN_TOLERANCE = 3
7+
DEFAULT_MIN_WORDS_VERTICAL = 3
8+
DEFAULT_MIN_WORDS_HORIZONTAL = 1
79

810
def move_to_avg(objs, orientation):
911
"""
@@ -87,7 +89,7 @@ def get_group(edge):
8789
return edges
8890

8991
def words_to_edges_h(words,
90-
word_threshold=3):
92+
word_threshold=DEFAULT_MIN_WORDS_HORIZONTAL):
9193
"""
9294
Find (imaginary) horizontal lines that connect the tops of at least `word_threshold` words.
9395
"""
@@ -117,7 +119,7 @@ def words_to_edges_h(words,
117119
return edges
118120

119121
def words_to_edges_v(words,
120-
word_threshold=3):
122+
word_threshold=DEFAULT_MIN_WORDS_VERTICAL):
121123
"""
122124
Find (imaginary) vertical lines that connect the left, right, or center of at least `word_threshold` words.
123125
"""
@@ -213,7 +215,7 @@ def intersections_to_cells(intersections):
213215

214216
def edge_connects(p1, p2):
215217
def edges_to_set(edges):
216-
return set(map(tuple, [ x.items() for x in edges ]))
218+
return set(map(utils.obj_to_bbox, edges))
217219

218220
if p1[0] == p2[0]:
219221
common = edges_to_set(intersections[p1]["v"])\
@@ -395,7 +397,8 @@ def char_in_bbox(char, bbox):
395397
"snap_tolerance": DEFAULT_SNAP_TOLERANCE,
396398
"join_tolerance": DEFAULT_JOIN_TOLERANCE,
397399
"edge_min_length": 3,
398-
"text_word_threshold": 3,
400+
"min_words_vertical": DEFAULT_MIN_WORDS_VERTICAL,
401+
"min_words_horizontal": DEFAULT_MIN_WORDS_HORIZONTAL,
399402
"keep_blank_chars": False,
400403
"text_tolerance": 3,
401404
"text_x_tolerance": None,
@@ -505,7 +508,7 @@ def v_edge_desc_to_edge(desc):
505508
edge_type="lines")
506509
elif v_strat == "text":
507510
v_base = words_to_edges_v(words,
508-
word_threshold=settings["text_word_threshold"])
511+
word_threshold=settings["min_words_vertical"])
509512
elif v_strat == "explicit":
510513
v_base = []
511514

@@ -539,7 +542,7 @@ def h_edge_desc_to_edge(desc):
539542
edge_type="lines")
540543
elif h_strat == "text":
541544
h_base = words_to_edges_h(words,
542-
word_threshold=settings["text_word_threshold"])
545+
word_threshold=settings["min_words_horizontal"])
543546
elif h_strat == "explicit":
544547
h_base = []
545548

@@ -553,5 +556,3 @@ def h_edge_desc_to_edge(desc):
553556
)
554557
return utils.filter_edges(edges,
555558
min_length=settings["edge_min_length"])
556-
557-

pdfplumber/utils.py

+14-6
Original file line numberDiff line numberDiff line change
@@ -69,15 +69,24 @@ def decode_text(s):
6969
return ''.join(PDFDocEncoding[o] for o in ords)
7070

7171
def decimalize(v, q=None):
72-
if isinstance(v, numbers.Integral):
72+
# If already a decimal, just return itself
73+
if isinstance(v, Decimal):
74+
return v
75+
# If tuple/list passed, bulk-convert
76+
elif isinstance(v, (tuple, list)):
77+
return type(v)(decimalize(x, q) for x in v)
78+
# Convert int-like
79+
elif isinstance(v, numbers.Integral):
7380
return Decimal(int(v))
74-
if isinstance(v, numbers.Real):
81+
# Convert float-like
82+
elif isinstance(v, numbers.Real):
7583
if q != None:
7684
return Decimal(repr(v)).quantize(Decimal(repr(q)),
7785
rounding=ROUND_HALF_UP)
7886
else:
7987
return Decimal(repr(v))
80-
return v
88+
else:
89+
raise ValueError("Cannot convert {0} to Decimal.".format(v))
8190

8291
def is_dataframe(collection):
8392
cls = collection.__class__
@@ -117,8 +126,7 @@ def objects_to_bbox(objects):
117126
max(map(itemgetter("bottom"), objects)),
118127
)
119128

120-
def rect_to_bbox(rect):
121-
return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])
129+
obj_to_bbox = itemgetter("x0", "top", "x1", "bottom")
122130

123131
def bbox_to_rect(bbox):
124132
return {
@@ -267,7 +275,7 @@ def clip_obj(obj, bbox, score=None):
267275
return copy
268276

269277
def n_points_intersecting_bbox(objs, bbox):
270-
bbox = tuple(map(decimalize, bbox))
278+
bbox = decimalize(bbox)
271279
objs = to_list(objs)
272280
scores = (obj_inside_bbox_score(obj, bbox) for obj in objs)
273281
return list(scores)

0 commit comments

Comments
 (0)