Skip to content

Plans for tesseract 5.x.y #3673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amitdo opened this issue Dec 5, 2021 · 152 comments
Open

Plans for tesseract 5.x.y #3673

amitdo opened this issue Dec 5, 2021 · 152 comments

Comments

@amitdo
Copy link
Collaborator

amitdo commented Dec 5, 2021

I suggest to focus on 5.x for 2022 at least.

That means we should not break the API (and ABI?). Use C++17, not C++20/C++23.

@stweil
Copy link
Member

stweil commented Dec 22, 2021

What about releasing a 5.0.1 after Christmas at the end of December? I think there are several fixes since 5.0.0 which would be good for a new release.

@amitdo
Copy link
Collaborator Author

amitdo commented Dec 22, 2021

Mind reader :-)
I was about to suggest to release 5.0.1 before year end. It would be nice if we can fix #3683 before releasing 5.0.1.

@amitdo
Copy link
Collaborator Author

amitdo commented Dec 22, 2021

Right before tagging 5.0.1, you can update this sentence from the README:

The latest stable version is 5.0.0, released on November 30, 2021.

@amitdo amitdo mentioned this issue Dec 23, 2021
@egorpugin
Copy link
Contributor

What should be added into v5?
5.x changes could be merged into branch and cherry picked into v6 main.

@stweil
Copy link
Member

stweil commented Dec 23, 2021

We already have a wish list for improved training, a lot of issues with layout detection, want improved logging, and much more. Maintaining two branches did not work good with 4.x, and I am afraid it would not work better with 5.x.

@egorpugin
Copy link
Contributor

Maybe keep 5.0 as is? It is a good release with a number of changes.
Everything else will go straight into 6?

@amitdo
Copy link
Collaborator Author

amitdo commented Dec 26, 2021

@amitdo
Copy link
Collaborator Author

amitdo commented Jan 1, 2022

What about releasing a 5.0.1 after Christmas at the end of December? I think there are several fixes since 5.0.0 which would be good for a new release.

Do you plan to release 5.0.1 next week?

@stweil
Copy link
Member

stweil commented Jan 1, 2022

Yes, unless we discover that something very important is still missing.

@stweil
Copy link
Member

stweil commented Jan 6, 2022

It would be nice if we can fix #3683 before releasing 5.0.1.

There is still no fix, and I have no clang-cl, so I cannot look for a fix myself. Should we release 5.0.1 without a fix? Are other things missing for 5.0.1 (besides updating of the documentation)?

@egorpugin
Copy link
Contributor

clang-cl is not worth it currently.

@amitdo
Copy link
Collaborator Author

amitdo commented Jan 6, 2022

You can release 5.0.1 without the clang-cl fix.

@stweil
Copy link
Member

stweil commented Jan 7, 2022

Release 5.0.1 is now online.

@stweil
Copy link
Member

stweil commented Jan 7, 2022

The next release could be a new minor version 5.1.0 with new features, maybe end of January (unless there is an urgent need for a bug fix release 5.0.2). I want to have especially image information in ALTO and hOCR output (see PR #3710 which implements that for hOCR), maybe more from the project list. The new minor release would also disable OpenMP by default for autoconf builds, too.

@stweil stweil pinned this issue Feb 10, 2022
@amitdo
Copy link
Collaborator Author

amitdo commented Feb 14, 2022

https://packages.ubuntu.com/search?keywords=tesseract-ocr

@AlexanderP,

Are you going to update Ubuntu 22.04 to 5.0.1 soon? The feature freeze date is February 24.

@AlexanderP
Copy link

@amitdo

i uploaded:

I hope @jbreiden will upload them to debian.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 27, 2022

Hi @AlexanderP,

I hope @jbreiden will upload them to debian.

From https://tracker.debian.org/pkg/tesseract :

maintainer: [Alexander Pozdnyakov]

So, why can't you directly push new versions of Tesseract to Debian?

@stweil
Copy link
Member

stweil commented Feb 28, 2022

I'd like to create a new release Tesseract 5.1.0 soon. Originally I had planned it for end of January.

Are there any contributions or important bug fixes which should be included still pending (then I'd wait), or can we release now?

@Shreeshrii
Copy link
Collaborator

I suggest you go ahead with 5.1.0 now.

I would like to see improvements related to training and evaluation implemented, but they could go in a future release.

@stweil
Copy link
Member

stweil commented Mar 1, 2022

Release 5.1.0 is now available.

@AlexanderP
Copy link

@amitdo no rights to upload to debian

@stweil
Copy link
Member

stweil commented May 29, 2022

There are now several fixes and improvements in git master, so I think it's time for a new release 5.1.1.

@egorpugin, is it possible to fix the CI sw build which is currently failing?

Are there any other contributions or important bug fixes which should be included still pending (then I'd wait), or can we release now? Ideally #3782 should also be included.

@egorpugin
Copy link
Contributor

Yes, I'll check.

@zdenop
Copy link
Contributor

zdenop commented Jun 1, 2022

Unfortunately windows build does not work (for me): I tried Clang (14) and MS Visual Studio (2019). Here are logs:
clang_build.zip
msvc_build.zip

@amitdo
Copy link
Collaborator Author

amitdo commented Jun 1, 2022

cmake-win64 action fails (since March 29).

cmake and vcpkg actions pass.

@egorpugin
Copy link
Contributor

I fixed sw build in ci.
Zdenko, is it fails only on VS2019? Can you check VS2022.

@zdenop
Copy link
Contributor

zdenop commented Jun 1, 2022

cmake-win64 action has some strange error: it fails already on unzipping zlib (or maybe even earlier: during setting up shell?)

image

And vcpkg is IMO not building the HEAD, but 5.1.0:

image

And I see this with HEAD:

image

@zdenop
Copy link
Contributor

zdenop commented Jun 1, 2022

@egorpugin: VS2019 is quite heavily used. I would suggest supporting it with the next release...

@kloczek
Copy link

kloczek commented Jun 11, 2024

5.4.1 has one issue. It uses bundled googletest included in source tree as submodules.
Why not use system installed gtest? 🤔 All distros provides gtest ..

@amitdo
Copy link
Collaborator Author

amitdo commented Jun 11, 2024

@kloczek,

It's not the right place to discuss the gtest issue.

The gtest issue is not new and we discussed it in the past in #2838 and #3679.

@stweil
Copy link
Member

stweil commented Oct 17, 2024

It's time for a new bug fix release. Is there anything urgent which should be included or fixed in the next release?

@zdenop
Copy link
Contributor

zdenop commented Oct 17, 2024

I am in the process of creating cmake files with autotools (leptonica has it already) This is not critical, but it takes more time than I expect it...

@stweil
Copy link
Member

stweil commented Oct 21, 2024

... and it currently breaks the autotools builds.

@zdenop
Copy link
Contributor

zdenop commented Oct 22, 2024

This is unrelated topic as cmake generate tesseract.pc from other template (tesseract.pc.cmake) Maybe it could be unified, but this is not topic for now.

@amitdo
Copy link
Collaborator Author

amitdo commented Oct 22, 2024

@stweil, please go ahead with a new release.

@stweil
Copy link
Member

stweil commented Oct 22, 2024

I'll try to fix the CI failures before tagging a new release.

@egorpugin
Copy link
Contributor

I've checked this issue
https://github.com/tesseract-ocr/tesseract/pull/4330/files

TessBaseAPI::GetIterator() and some other methods (like GetUTF8Text()) return raw memory.
It would be nice so they return unique_ptr<T> instead.
Doing this we clearly state memory management of returned objects instead of documentation mention.

I propose to impove memory management of public APIs in tess v6 because it is API breakage.

In addition C API implementation will be updated from

TessResultIterator *TessBaseAPIGetIterator(TessBaseAPI *handle) {
  return handle->GetIterator();
}

to

TessResultIterator *TessBaseAPIGetIterator(TessBaseAPI *handle) {
  return handle->GetIterator().release();
}

So C API will be retained the same.


So,

  1. How and when do we want API breaking changes?
  2. What other public API/ABI changes do we want? We need a tracking issue for it? Do we have one already?
  3. I think we have enough 5.x.x releases already, maybe switch master branch to v6 and create separate v5 branch for small fixes?

@stweil
Copy link
Member

stweil commented Oct 28, 2024

I just added #4336, and we can discuss and track API changes there.

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 8, 2024

@stweil, when do you plan to make a new release?

@stweil
Copy link
Member

stweil commented Nov 8, 2024

Soon (this weekend) unless there is something open or missing which requires more time. The new release will contain enough changes to justify the move to 5.5.0.

@stweil
Copy link
Member

stweil commented Nov 12, 2024

Meanwhile release 5.5.0 is available. Thank you to everybody who contributed in any way.

@tesseract-ocr tesseract-ocr deleted a comment from alfredordgzs Mar 13, 2025
@stweil
Copy link
Member

stweil commented Apr 27, 2025

I think it's time for a new release 5.5.1, maybe on 2025-05-01.
Are there important things which are still missing for the new release?

@AlexanderP
Copy link

Hi.
Tesseract-OCR from Git is not compiled with "-Werror=format-security".

tesseract_5.5.0+git6543-d6805c26-1_amd64.log

@egorpugin
Copy link
Contributor

egorpugin commented Apr 29, 2025

@AlexanderP, what is your compiler and its version?

@AlexanderP
Copy link

AlexanderP commented Apr 30, 2025

@egorpugin

Debian Sid

g++ -v 

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-linux-gnu/14/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 14.2.0-19' --with-bugurl=file:///usr/share/doc/gcc-14/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2,rust --prefix=/usr --with-gcc-major-version-only --program-suffix=-14 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-libstdcxx-backtrace --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/reproducible-path/gcc-14-14.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/reproducible-path/gcc-14-14.2.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=3
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.2.0 (Debian 14.2.0-19)

@egorpugin
Copy link
Contributor

egorpugin commented Apr 30, 2025

@AlexanderP

Can you try to change this function with the following and see if the error is still there please?
https://github.com/tesseract-ocr/tesseract/blob/main/src/ccutil/tprintf.h#L35

template <typename ... Types> auto tprintf(const char *fmt, Types && ... args) {
  return fprintf(get_debugfp(), fmt, std::forward<Types>(args)...);
}

@egorpugin
Copy link
Contributor

Reverted to va_args 4a39a49

@davidecavestro
Copy link

Do you plan to provide an official RPM for 5.5.1?
I could be wrong, but it seems that RPMs from @AlexanderP are no longer available for previous versions.

PS: I just posted on the group yesterday, but I'm also writing here in the hope that Alexander will see it.

@AlexanderP
Copy link

AlexanderP commented May 8, 2025

@davidecavestro
Copy link

davidecavestro commented May 9, 2025

Thank you very much @AlexanderP!
Since I also need the rpm for the old Centos8 (for use on UBI8), would you mind adding them or even sharing the old Centos8 rpm sources so we can manage to build them?
Alternatively, I would start from the sources of the rpms you just released.

@amitdo
Copy link
Collaborator Author

amitdo commented May 16, 2025

@stweil, please make a new release.

@davidecavestro
Copy link

@davidecavestro I will build 5.5.1 for openSUSE and Fedora.

Created two projects: https://build.opensuse.org/project/show/home:Alexander_Pozdnyakov:Fedora https://build.opensuse.org/project/show/home:Alexander_Pozdnyakov:SUSE

The appimage will also be created. https://github.com/AlexanderP/tesseract-appimage/releases/tag/v5.5.0

@AlexanderP I saw in the meantime you published a repo for Centos8, but tesseract-common depends on tesseract-langpack-eng which seems not available.

Image

@stweil
Copy link
Member

stweil commented May 23, 2025

@stweil, please make a new release.

Yes, I want to do this on next Sunday. And to simplify the release process, I suggest that ChangeLog will no longer track all changes but only link to the release notes on GitHub. Or should we remove this file?

@amitdo
Copy link
Collaborator Author

amitdo commented May 25, 2025

I agree to the suggested change (link to the release notes or removal of the whole file).

@stweil
Copy link
Member

stweil commented May 25, 2025

The new release 5.5.1 is now available. Thank you to everyone who contributed to it by reporting issues, providing pull requests, testing, offering advice, and participating in discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

12 participants