You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Tesseract with Python because it's too difficult to OCR when the languages are mixed between the Greek alphabet and the Latin alphabet. Too often I will get Cyrillic characters as an output. I was hoping that the whitelist feature would solve that problem. But this is not the case. When I input the following whitelist,
I get a reasonably good output for the Latin characters, but the Greek text is not very accurate. for example, here is an output
Contracted nouns and adjectives in -ους from -οος 63
Adjectives of material in -ots from -εος 64
Nouns in ts, -εως and -υς/-υ, -εως 65
But the correct output should be οῦς not -ots
However, even if the accuracy were 100%, that whitelist will not solve my problem because it does not use the diacritics. So when I use a whitelist with diacritics, such as
I've tried locating the characters that are messing things up but there are too many. But it is certainly not any of these characters: /?<>{}*&,;.:-+=|
The image I'm trying to scan is uploaded. here is the exact python code I'm using:
I'm using pytesseract 0.3.13 and I have tesseract 5.3.8 installed. Also chatgpt informs me that sometimes tessearact cannot handle large whitelists. if that is the case then i think it would be very easy to solve that problem.
Expected Behavior
No response
Suggested Fix
No response
tesseract -v
No response
Operating System
No response
Other Operating System
No response
uname -a
No response
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response
The text was updated successfully, but these errors were encountered:
As you found out yourself, currently the whitelist/blacklist feature does not work as expected with the lstm ocr engine. Also, the lang1+lang2 combo does not work well.
I have the same problem with german diacritics (umlaute) on windows command line.
There is no way that I could find that allows -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÄÖÜßäöü0123456789
part of the problem is that default windows command line does not support utf8
but even from powershell terminal with chcp 65001 or separate config file, I could not get whitelist or blacklist to work properly.
Maybe an expert can look into this. It would make a great library even better.
Current Behavior
I'm using Tesseract with Python because it's too difficult to OCR when the languages are mixed between the Greek alphabet and the Latin alphabet. Too often I will get Cyrillic characters as an output. I was hoping that the whitelist feature would solve that problem. But this is not the case. When I input the following whitelist,
αςερτυθιοπλκξηγφδσζχψωβνμΣΕΡΤΥΘΙΟΠΛΚΞΗΓΦΔΣΑΖΧΨΩΒΝΜΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890/?<>{}*&,;.:-+=|1234567890
I get a reasonably good output for the Latin characters, but the Greek text is not very accurate. for example, here is an output
Contracted nouns and adjectives in -ους from -οος 63
Adjectives of material in -ots from -εος 64
Nouns in ts, -εως and -υς/-υ, -εως 65
But the correct output should be οῦς not -ots
However, even if the accuracy were 100%, that whitelist will not solve my problem because it does not use the diacritics. So when I use a whitelist with diacritics, such as
"ΆᾺΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨῬΡΣΤΘὝὛὙΎΥὟΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐεἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηῃήὴἤἢᾒᾓγϊῖιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧωὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόὸώὼᾣᾥπφψῤῥρςστθὖϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890/?<>{}*&,;.:-+=|1234567890 "
I get the output:
ΝΕΗΟΓΑΑΠΚ
Α
ΑΟΗΠΓΠΟΠ
ΑΟΕΠΓ
ΑΕΠΓΟ
ΑΠ
ἸΑΓΝΠΑΟΕΕ
ΡΟΡΟΠ
ΑΙΟΓΠΊ
ΠΟΙΠΕΟΓΠΓΕΠΟΏΡΒΡ
ΑΓ Ι
ΙΠΠΠΠΊΒΠ
I've tried locating the characters that are messing things up but there are too many. But it is certainly not any of these characters: /?<>{}*&,;.:-+=|
The image I'm trying to scan is uploaded. here is the exact python code I'm using:
I'm using pytesseract 0.3.13 and I have tesseract 5.3.8 installed. Also chatgpt informs me that sometimes tessearact cannot handle large whitelists. if that is the case then i think it would be very easy to solve that problem.
Expected Behavior
No response
Suggested Fix
No response
tesseract -v
No response
Operating System
No response
Other Operating System
No response
uname -a
No response
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response
The text was updated successfully, but these errors were encountered: