Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotImplementedError: unsupported filter /JBIG2Decode #1989

Open
MartinThoma opened this issue Jul 20, 2023 · 18 comments
Open

NotImplementedError: unsupported filter /JBIG2Decode #1989

MartinThoma opened this issue Jul 20, 2023 · 18 comments
Labels
is-feature A feature request workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

Explanation

I found an example for the /JBIG2Decode filter :-)

Code Example

PDF: https://github.com/py-pdf/pypdf/files/12090692/New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf")

page = reader.pages[0]
for img in page.images:
    print(img.name)

gives

pypdf==3.12.2
Traceback (most recent call last):
  File "/home/moose/Downloads/pyissue/main.py", line 8, in <module>
    for img in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2604, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2600, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 522, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 844, in _xobj_to_image
    data = x_object_obj.get_data()  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/generic/_data_structures.py", line 919, in get_data
    decoded._data = decode_stream_data(self)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 634, in decode_stream_data
    raise NotImplementedError(f"unsupported filter {filter_
@MartinThoma MartinThoma added workflow-images From a users perspective, image handling is the affected feature/workflow is-feature A feature request labels Jul 20, 2023
@MartinThoma MartinThoma self-assigned this Jul 20, 2023
@MartinThoma
Copy link
Member Author

PDF found in #1983

@pubpub-zz
Copy link
Collaborator

from #951

Here is pdfminer implementation:
https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/jbig2.py

ItDoesntWorkScan.pdf

@MartinThoma
Copy link
Member Author

#2502 (comment) - here we have another example

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Apr 14, 2024

This might need a general design decision if I am not mistaken: Pillow does not seem to support JBIG2, while our implementation currently assumes that all images can be loaded as PIL.Image.Image (pdfminer.six does not use Pillow for saving images).

AFAIK there only is jbig2dec which would have to be used in a subprocess to get a "good" image format from the JBIG2 image embedded inside the PDF file (after adding the missing bytes, specifically "the JBIG2 file header, end-of-page segments, and end-of-file segment" which are not part of the XObject according to section 7.4.7 of the PDF 2.0 spec), although this might cause issues with masks etc. (jbig2dec itself is subject to APGL-3.0-or-later and with its strong copyleft effect (including SaaS) rather unlikely to become part of Pillow.) The alternative would be to parse the essential aspects like the pixel data from the JBIG2 image ourselves.

@mdecaro
Copy link

mdecaro commented Jun 1, 2024

Many platforms support a standalone 'jbig2dec' functionality (e.g. on Mac can brew install jbig2dec) - can you farm that functionality out to that routine? I was going to try, but can't seem to get the raw unfiltered image bytes from the page object. Prob my ignorance... (will keep digging)

@pubpub-zz
Copy link
Collaborator

Many platforms support a standalone 'jbig2dec' functionality (e.g. on Mac can brew install jbig2dec) - can you farm that functionality out to that routine? I was going to try, but can't seem to get the raw unfiltered image bytes from the page object. Prob my ignorance... (will keep digging)

the XObject is a ContentStream. You should be able to access the data with .get_data()

@stefan6419846
Copy link
Collaborator

Given the data, you should still have a look at the PDF specification on the filter (for PDF 2.0/ISO 32000-2:2020, this is section 7.4.7), especially as jbig2dec probably expects the header and footer to be present, which are omitted within PDF files.

@mdecaro
Copy link

mdecaro commented Jun 2, 2024 via email

@mdecaro
Copy link

mdecaro commented Jun 2, 2024 via email

@stefan6419846
Copy link
Collaborator

If you intended to append a file here, then this did not work due to you answering by e-mail. Uploading files usually requires using the GitHub UI itself.

@mdecaro
Copy link

mdecaro commented Jun 3, 2024 via email

@mdecaro
Copy link

mdecaro commented Jun 3, 2024

This is a quick McGiver fix for the /JBIG2 error. You need to load the 'jbig2dec' utility. On the Mac this is on home-brew. The fix assumes you have apple silicon - the path to the utility is /opt/homebrew/bin/jbig2ec. If you are Intel Mac, it probably is /usr/local/bin/jbig2dec????
filters.py.zip

@stefan6419846
Copy link
Collaborator

Relevant class from the code:

class JBIG2Decode:
    @staticmethod
    def decode(
        data: bytes,
        decode_parms: Optional[DictionaryObject] = None,
        **kwargs: Any,
    ) -> bytes:
        # decode_parms is unused here
        pathin = '/var/tmp/tempin.jbig2'
        pathout = '/var/tmp/tempout.jbig2'
        with open(pathin,"wb") as fl:
            fl.write(data)
        process = subprocess.run(['/opt/homebrew/bin/jbig2dec', '-e', '-o', pathout, pathin])
        with open(pathout,'rb') as fl:
            data = fl.read()
        os.unlink(pathin)
        os.unlink(pathout)
        return data

@pubpub-zz
Copy link
Collaborator

this solution is not valid for windows / linux...
Can you try to rehost the code into python natively ?

@stefan6419846
Copy link
Collaborator

This is untested and may cause issues due to opening the same file twice on Windows, but should in general be OS-independent:

class JBIG2Decode:
    @staticmethod
    def decode(
        data: bytes,
        decode_parms: Optional[DictionaryObject] = None,
        **kwargs: Any,
    ) -> bytes:
        # decode_parms is unused here
        with NamedTemporaryFile(suffix=".jbig2") as infile:
            infile.write(data)
            infile.seek(0)
            result = subprocess.run(
                [shutil.which("jbig2dec"), "--embedded", "--output", "-", infile],
                stdout=subprocess.PIPE
            )
        return result.stdout

@mdecaro
Copy link

mdecaro commented Jun 4, 2024 via email

@mdecaro
Copy link

mdecaro commented Jun 5, 2024 via email

@mdecaro
Copy link

mdecaro commented Jun 5, 2024

Liked your suggestions!!! I'll still try porting to python (at my pace unfortunately).

Here is the updated file.
filters.py.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants