Whitelist not working #4407

kylefoley76 · 2025-04-02T01:25:32Z

Current Behavior

I'm using Tesseract with Python because it's too difficult to OCR when the languages are mixed between the Greek alphabet and the Latin alphabet. Too often I will get Cyrillic characters as an output. I was hoping that the whitelist feature would solve that problem. But this is not the case. When I input the following whitelist,

αςερτυθιοπλκξηγφδσζχψωβνμΣΕΡΤΥΘΙΟΠΛΚΞΗΓΦΔΣΑΖΧΨΩΒΝΜΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890/?<>{}*&,;.:-+=|1234567890

I get a reasonably good output for the Latin characters, but the Greek text is not very accurate. for example, here is an output

Contracted nouns and adjectives in -ους from -οος 63
Adjectives of material in -ots from -εος 64
Nouns in ts, -εως and -υς/-υ, -εως 65

But the correct output should be οῦς not -ots

However, even if the accuracy were 100%, that whitelist will not solve my problem because it does not use the diacritics. So when I use a whitelist with diacritics, such as

"ΆᾺΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨῬΡΣΤΘὝὛὙΎΥὟΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐεἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηῃήὴἤἢᾒᾓγϊῖιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧωὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόὸώὼᾣᾥπφψῤῥρςστθὖϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890/?<>{}*&,;.:-+=|1234567890 "

I get the output:

ΝΕΗΟΓΑΑΠΚ
Α
ΑΟΗΠΓΠΟΠ
ΑΟΕΠΓ
ΑΕΠΓΟ
ΑΠ
ἸΑΓΝΠΑΟΕΕ
ΡΟΡΟΠ
ΑΙΟΓΠΊ
ΠΟΙΠΕΟΓΠΓΕΠΟΏΡΒΡ
ΑΓ Ι
ΙΠΠΠΠΊΒΠ

I've tried locating the characters that are messing things up but there are too many. But it is certainly not any of these characters: /?<>{}*&,;.:-+=|

The image I'm trying to scan is uploaded. here is the exact python code I'm using:

import pytesseract
custom_oem_psm_config = '--oem 3 --psm 6 -c tessedit_char_whitelist="{}"'.format(
"ΆᾺΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨῬΡΣΤΘὝὛὙΎΥὟΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐεἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηῃήὴἤἢᾒᾓγϊῖιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧωὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόὸώὼᾣᾥπφψῤῥρςστθὖϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=| "
)
str4 = pytesseract.image_to_string(img1, config=custom_oem_psm_config,lang='eng+ell')
print(str4)

I'm using pytesseract 0.3.13 and I have tesseract 5.3.8 installed. Also chatgpt informs me that sometimes tessearact cannot handle large whitelists. if that is the case then i think it would be very easy to solve that problem.

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

No response

Operating System

No response

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

The text was updated successfully, but these errors were encountered:

amitdo · 2025-04-03T10:41:45Z

As you found out yourself, currently the whitelist/blacklist feature does not work as expected with the lstm ocr engine. Also, the lang1+lang2 combo does not work well.

amitdo · 2025-04-03T12:47:26Z

If you only need Greek and English, you can try tesseract in.png out -l Greek.

https://github.com/tesseract-ocr/tessdata_best/tree/main/script

CanadianHusky · 2025-05-30T09:59:59Z

I have the same problem with german diacritics (umlaute) on windows command line.
There is no way that I could find that allows
-c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÄÖÜßäöü0123456789
part of the problem is that default windows command line does not support utf8
but even from powershell terminal with chcp 65001 or separate config file, I could not get whitelist or blacklist to work properly.
Maybe an expert can look into this. It would make a great library even better.

amitdo added the allowlist / denylist label Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whitelist not working #4407

Whitelist not working #4407

kylefoley76 commented Apr 2, 2025

amitdo commented Apr 3, 2025 •

edited

Loading

Uh oh!

amitdo commented Apr 3, 2025

Uh oh!

CanadianHusky commented May 30, 2025

Uh oh!

Whitelist not working #4407

Whitelist not working #4407

Comments

kylefoley76 commented Apr 2, 2025

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

amitdo commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitdo commented Apr 3, 2025

Uh oh!

CanadianHusky commented May 30, 2025

Uh oh!

amitdo commented Apr 3, 2025 •

edited

Loading