Skip to content

Incorrect word coordinates #1712

Open
@troplin

Description

@troplin

Environment

  • Tesseract Version:

    tesseract --version
    tesseract 4.0.0-beta.1
    leptonica-1.76.0 (Jun 26 2018, 18:21:40) [MSC v.1900 LIB Release x86]
     libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
    Found AVX
    Found SSE
    

    (says beta.1, but beta.3 seems to be correct)

  • Commit Number: AppVeyor: 4.0.0-beta.3.1776

  • Platform: Windows 10 64bit (but tesseract running as 32bit)

  • Tessdata: tessdata-fast

We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.

What I do is basically just iterating over the result iterator using RIL_WORD, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.

But the official PDF output config has the same flaw, it's just more difficult to spot:

tesseract andromeda.png andromeda.tess4cli pdf

produces the following PDF:
andromeda.tess4cli.pdf

Files:

Behavior can be reproduced using the following PNG (which is a part of a bigger file):

andromeda

Current Behavior:

The coordinates are correct for most words.
But for some words, there seems to be an error in computing the boundaries between the words.
Couriously, the wrong boundary is always before the last character of the previous word.
I almost looks like some kind of off-by-one error.

grafik

grafik

Expected Behavior:

Result with tesseract 3:

grafik

grafik

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions