Description
Environment
-
Tesseract Version:
tesseract --version tesseract 4.0.0-beta.1 leptonica-1.76.0 (Jun 26 2018, 18:21:40) [MSC v.1900 LIB Release x86] libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX Found SSE
(says beta.1, but beta.3 seems to be correct)
-
Commit Number: AppVeyor: 4.0.0-beta.3.1776
-
Platform: Windows 10 64bit (but tesseract running as 32bit)
-
Tessdata: tessdata-fast
We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.
What I do is basically just iterating over the result iterator using RIL_WORD
, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.
But the official PDF output config has the same flaw, it's just more difficult to spot:
tesseract andromeda.png andromeda.tess4cli pdf
produces the following PDF:
andromeda.tess4cli.pdf
Files:
Behavior can be reproduced using the following PNG (which is a part of a bigger file):
Current Behavior:
The coordinates are correct for most words.
But for some words, there seems to be an error in computing the boundaries between the words.
Couriously, the wrong boundary is always before the last character of the previous word.
I almost looks like some kind of off-by-one error.
Expected Behavior:
Result with tesseract 3: