You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pypdf is currently unable to correctly handle inline images whose actual content stream contains the sequence EI . This breaks text extraction as well.
Environment
Which environment were you using when you encountered the problem?
I currently do not have a file which would not contain personal data.
Excerpt of the relevant section (... marks redacted content):
...
BI
/IM true
/W 41
/H 41
/BPC 1
/D[1
0]
/F/CCF
/DP<</K -1
/Columns 41>>
ID >...EI E...
EI Q
q
...
Traceback
This is the complete traceback I see (... marks redacted content):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2378, in extract_text
return self._extract_text(
File "/home/stefan/pdf/pypdf/pypdf/_page.py", line 2073, in _extract_text
for operands, operator in content.operations:
File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1423, in operations
self._parse_content_stream(BytesIO(self._data))
File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "/home/stefan/pdf/pypdf/pypdf/generic/_data_structures.py", line 1496, in read_object
raise PdfReadError(
pypdf.errors.PdfReadError: Invalid Elementary Object starting with b'\x0b' @1495: b'I E\x0e\x1e\x8a\...\xe0\xc7\x0b$;...'
The text was updated successfully, but these errors were encountered:
stefan6419846
added
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
PdfReader
The PdfReader component is affected
generic
The generic submodule is affected
and removed
PdfReader
The PdfReader component is affected
labels
Feb 6, 2025
The PDF 2.0 specification accounts for this and introduces a mandatory length key (although recommended for another use case), but until we can see PDF 2.0 to be adopted widely as the old specifications, we will probably see that many years have passed.
pypdf is currently unable to correctly handle inline images whose actual content stream contains the sequence
EI
. This breaks text extraction as well.Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
I currently do not have a file which would not contain personal data.
Excerpt of the relevant section (
...
marks redacted content):Traceback
This is the complete traceback I see (
...
marks redacted content):The text was updated successfully, but these errors were encountered: