Insufficient handling of inline images containing `EI` sequences #3107

stefan6419846 · 2025-02-06T14:19:36Z

pypdf is currently unable to correctly handle inline images whose actual content stream contains the sequence EI . This breaks text extraction as well.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.33-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
reader.pages[1].extract_text()

I currently do not have a file which would not contain personal data.

Excerpt of the relevant section (... marks redacted content):

...
BI
/IM true
/W 41
/H 41
/BPC 1
/D[1
0]
/F/CCF
/DP<</K -1
/Columns 41>>
ID >...EI E...
EI Q
q
...

Traceback

This is the complete traceback I see (... marks redacted content):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
  File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2073, in _extract_text
    for operands, operator in content.operations:
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1423, in operations
    self._parse_content_stream(BytesIO(self._data))
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1496, in read_object
    raise PdfReadError(
pypdf.errors.PdfReadError: Invalid Elementary Object starting with b'\x0b' @1495: b'I E\x0e\x1e\x8a\...\xe0\xc7\x0b$;...'

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2025-02-07T08:11:35Z

Further evaluating and thinking about this, it seems like this mostly involves guessing as the PDF specification is not mentioning this aspect at all.

Looking at the implementation of iText for Java, they seem to work around this by probing the 10 bytes after the potential EI operator: https://github.com/itext/itext-java/blob/dfcc46546d23280484f6c391364b455879cadb5d/kernel/src/main/java/com/itextpdf/kernel/pdf/canvas/parser/util/InlineImageParsingUtils.java#L340-L372

The PDF 2.0 specification accounts for this and introduces a mandatory length key (although recommended for another use case), but until we can see PDF 2.0 to be adopted widely as the old specifications, we will probably see that many years have passed.

stefan6419846 · 2025-02-21T14:15:08Z

I have been a bit busier than expected, but I am actively working on a fix for this already and should be able to file a PR at the start of next week.

stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected generic The generic submodule is affected and removed PdfReader The PdfReader component is affected labels Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient handling of inline images containing `EI` sequences #3107

Insufficient handling of inline images containing `EI` sequences #3107

stefan6419846 commented Feb 6, 2025

stefan6419846 commented Feb 7, 2025 •

edited

Loading

stefan6419846 commented Feb 21, 2025

Insufficient handling of inline images containing EI sequences #3107

Insufficient handling of inline images containing EI sequences #3107

Comments

stefan6419846 commented Feb 6, 2025

Environment

Code + PDF

Traceback

stefan6419846 commented Feb 7, 2025 • edited Loading

stefan6419846 commented Feb 21, 2025

Insufficient handling of inline images containing `EI` sequences #3107

Insufficient handling of inline images containing `EI` sequences #3107

stefan6419846 commented Feb 7, 2025 •

edited

Loading