Skip to content

Commit c3dae7b

Browse files
committed
TST: Add test for layout_mode_font_height_weight of PageObject.extract_text()
1 parent dad1788 commit c3dae7b

File tree

3 files changed

+85
-0
lines changed

3 files changed

+85
-0
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
The Crazy Ones
2+
October 14, 1998
3+
4+
Heres to the crazy ones. The misfits. The rebels. The troublemakers.
5+
The round pegs in the square holes.
6+
The ones who see things differently. Theyre not fond of rules. And
7+
they have no respect for the status quo. You can quote them,
8+
disagree with them, glorify or vilify them.
9+
About the only thing you cant do is ignore them. Because they change
10+
things. They invent. They imagine. They heal. They explore. They
11+
create. They inspire. They push the human race forward.
12+
Maybe they have to be crazy.
13+
How else can you stare at an empty canvas and see a work of art? Or
14+
sit in silence and hear a song thats never been written? Or gaze at
15+
a red planet and see a laboratory on wheels?
16+
We make tools for these kinds of people.
17+
While some see them as the crazy ones, we see genius. Because the
18+
people who are crazy enough to think they can change the world,
19+
are the ones who do.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
The Crazy Ones
2+
October 14, 1998
3+
4+
Heres to the crazy ones. The misfits. The rebels. The troublemakers.
5+
The round pegs in the square holes.
6+
7+
The ones who see things differently. Theyre not fond of rules. And
8+
they have no respect for the status quo. You can quote them,
9+
disagree with them, glorify or vilify them.
10+
11+
About the only thing you cant do is ignore them. Because they change
12+
things. They invent. They imagine. They heal. They explore. They
13+
create. They inspire. They push the human race forward.
14+
15+
Maybe they have to be crazy.
16+
17+
How else can you stare at an empty canvas and see a work of art? Or
18+
sit in silence and hear a song thats never been written? Or gaze at
19+
a red planet and see a laboratory on wheels?
20+
21+
We make tools for these kinds of people.
22+
23+
While some see them as the crazy ones, we see genius. Because the
24+
people who are crazy enough to think they can change the world,
25+
are the ones who do.

tests/test_text_extraction.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,3 +219,44 @@ def test_text_leading_height_unit():
219219
page = reader.pages[0]
220220
extracted = page.extract_text()
221221
assert "Something[cited]\n" in extracted
222+
223+
224+
def test_layout_mode_space_vertically_font_height_weight():
225+
"""Tests layout mode with vertical space and font height weight (issue #2915)"""
226+
with open(RESOURCE_ROOT / "crazyones.pdf", "rb") as inputfile:
227+
# Load PDF file from file
228+
reader = PdfReader(inputfile)
229+
page = reader.pages[0]
230+
231+
# Normal behaviour
232+
with open(RESOURCE_ROOT / "crazyones_layout_vertical_space.txt", "rb") as pdftext_file:
233+
pdftext = pdftext_file.read()
234+
235+
text = page.extract_text(extraction_mode="layout", layout_mode_space_vertically=True).encode("utf-8")
236+
237+
# Compare the text of the PDF to a known source
238+
for expected_line, actual_line in zip(text.splitlines(), pdftext.splitlines()):
239+
assert expected_line == actual_line
240+
241+
pdftext = pdftext.replace(b"\r\n", b"\n") # fix for windows
242+
assert text == pdftext, (
243+
"PDF extracted text differs from expected value.\n\n"
244+
"Expected:\n\n%r\n\nExtracted:\n\n%r\n\n" % (pdftext, text)
245+
)
246+
247+
# Blank lines are added to truly separate paragraphs
248+
with open(RESOURCE_ROOT / "crazyones_layout_vertical_space_font_height_weight.txt", "rb") as pdftext_file:
249+
pdftext = pdftext_file.read()
250+
251+
text = page.extract_text(extraction_mode="layout", layout_mode_space_vertically=True,
252+
layout_mode_font_height_weight=0.85).encode("utf-8")
253+
254+
# Compare the text of the PDF to a known source
255+
for expected_line, actual_line in zip(text.splitlines(), pdftext.splitlines()):
256+
assert expected_line == actual_line
257+
258+
pdftext = pdftext.replace(b"\r\n", b"\n") # fix for windows
259+
assert text == pdftext, (
260+
"PDF extracted text differs from expected value.\n\n"
261+
"Expected:\n\n%r\n\nExtracted:\n\n%r\n\n" % (pdftext, text)
262+
)

0 commit comments

Comments
 (0)