Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'int' object has no attribute 'isspace' #1983

Closed
michelcrypt4d4mus opened this issue Jul 19, 2023 · 9 comments · Fixed by #1994
Closed

AttributeError: 'int' object has no attribute 'isspace' #1983

michelcrypt4d4mus opened this issue Jul 19, 2023 · 9 comments · Fixed by #1994

Comments

@michelcrypt4d4mus
Copy link

Tried to extract text from attached PDF.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.4.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf.__version__)"
3.12.1

Code + PDF

The code is here

PDF is attached. It's public and can be used for tests etc.
New Jersey Coinbase staking securities charges 2023-0606_Coinbase-Penalty-and-C-D.pdf

Traceback

Traceback (most recent call last):
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/bin/sort_screenshots", line 6, in <module>
    sys.exit(sort_screenshots())
             ^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/workspace/clown_sort/clown_sort/__init__.py", line 43, in sort_screenshots
    file_to_sort.sort_file()
  File "/Users/uzor/workspace/clown_sort/clown_sort/files/sortable_file.py", line 60, in sort_file
    search_text = self.basename_without_ext + ' ' + (self.extracted_text() or '')
                                                     ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/workspace/clown_sort/clown_sort/files/pdf_file.py", line 50, in extracted_text
    for image_number, image in enumerate(page.images, start=1):
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2603, in __iter__
    for i in range(len(self)):
                   ^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2565, in __len__
    return len(self.ids_function())
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 479, in _get_ids_image
    self.inline_images = self._get_inline_images()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 662, in _get_inline_images
    extension, byte_stream, img = _xobj_to_image(ii["object"])
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/filters.py", line 814, in _xobj_to_image
    data = x_object_obj.get_data()  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 919, in get_data
    decoded._data = decode_stream_data(self)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/filters.py", line 613, in decode_stream_data
    data = ASCIIHexDecode.decode(data)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/site-packages/pypdf/filters.py", line 280, in decode
    elif char.isspace():
         ^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'isspace'
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 19, 2023
@pubpub-zz
Copy link
Collaborator

@michelcrypt4d4mus
Please provide code a simple code (focusing on failed page / image) to ease the analysis.

@pubpub-zz pubpub-zz added needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Jul 20, 2023
@MartinThoma
Copy link
Member

Hm, interesting. That works fine:

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf")

for page in reader.pages:
    print(page.extract_text())

@MartinThoma
Copy link
Member

I think the missing feature #1989 might be the issue here. It's just a bit hidden.

@pubpub-zz
Copy link
Collaborator

got something else behind too on page 10

@michelcrypt4d4mus
Copy link
Author

michelcrypt4d4mus commented Jul 21, 2023

@michelcrypt4d4mus Please provide code a simple code (focusing on failed page / image) to ease the analysis.

i linked to this code in my package; let me know if that's not enough. it's simple - just iterate over all pages + all images and extract text.

edit: the actual extraction is done with pytesseract but the code doesn't get that far bc it fails to iterate on the images in the image in pypdf as per my link in the original issue.

@pubpub-zz
Copy link
Collaborator

@michelcrypt4d4mus Please provide code a simple code (focusing on failed page / image) to ease the analysis.

i linked to this code in my package; let me know if that's not enough. it's simple - just iterate over all pages + all images and extract text.

edit: the actual extraction is done with pytesseract but the code doesn't get that far bc it fails to iterate on the images in the image in pypdf as per my link in the original issue.

What I would like is a standalone code, focusing directly on the page and image which is producing the error. It simplify our analysis not checking everything

@pubpub-zz
Copy link
Collaborator

I've finally found the issue. PR is proposed

@pubpub-zz pubpub-zz removed the needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem label Jul 21, 2023
MartinThoma pushed a commit that referenced this issue Jul 25, 2023
Please note that this is potentially backwards-incompatible!

This also fixes a bug.

Closes  #1983
@AfifaYousaf
Copy link

if line.is_space():
AttributeError: 'int' object has no attribute 'is_space'

Please let me know how to fix it

@pubpub-zz
Copy link
Collaborator

@AfifaYousaf
we can not help you with so limited information: Open a new issue, attach your code and pdf for analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants