Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient handling of inline images containing EI sequences #3107

Open
stefan6419846 opened this issue Feb 6, 2025 · 2 comments
Open
Labels
generic The generic submodule is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@stefan6419846
Copy link
Collaborator

pypdf is currently unable to correctly handle inline images whose actual content stream contains the sequence EI . This breaks text extraction as well.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.33-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
reader.pages[1].extract_text()

I currently do not have a file which would not contain personal data.

Excerpt of the relevant section (... marks redacted content):

...
BI
/IM true
/W 41
/H 41
/BPC 1
/D[1
0]
/F/CCF
/DP<</K -1
/Columns 41>>
ID >...EI E...
EI Q
q
...

Traceback

This is the complete traceback I see (... marks redacted content):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
  File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2073, in _extract_text
    for operands, operator in content.operations:
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1423, in operations
    self._parse_content_stream(BytesIO(self._data))
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1496, in read_object
    raise PdfReadError(
pypdf.errors.PdfReadError: Invalid Elementary Object starting with b'\x0b' @1495: b'I E\x0e\x1e\x8a\...\xe0\xc7\x0b$;...'
@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected generic The generic submodule is affected and removed PdfReader The PdfReader component is affected labels Feb 6, 2025
@stefan6419846
Copy link
Collaborator Author

stefan6419846 commented Feb 7, 2025

Further evaluating and thinking about this, it seems like this mostly involves guessing as the PDF specification is not mentioning this aspect at all.

Looking at the implementation of iText for Java, they seem to work around this by probing the 10 bytes after the potential EI operator: https://github.com/itext/itext-java/blob/dfcc46546d23280484f6c391364b455879cadb5d/kernel/src/main/java/com/itextpdf/kernel/pdf/canvas/parser/util/InlineImageParsingUtils.java#L340-L372

The PDF 2.0 specification accounts for this and introduces a mandatory length key (although recommended for another use case), but until we can see PDF 2.0 to be adopted widely as the old specifications, we will probably see that many years have passed.

Image

@stefan6419846
Copy link
Collaborator Author

I have been a bit busier than expected, but I am actively working on a fix for this already and should be able to file a PR at the start of next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic The generic submodule is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests

1 participant