Skip to content

A simple Spring Boot application to convert image-based PDFs to text-embedded PDFs.

Notifications You must be signed in to change notification settings

darkn3to/pdfocr

Repository files navigation

pdfocr - A Spring Boot App

This application converts image-based PDFs into text-embedded PDFs using Tesseract OCR. It leverages Tesseract for text extraction and Apache PDFBox for PDF manipulation, providing an efficient solution for making scanned documents searchable and selectable.

Features

• Utilizes Tesseract OCR to accurately extract text from images within PDF files.

• Ensures the output PDF retains the original layout and formatting while embedding the extracted text.

• Multi-threading support for processing mutiple pages in parallel.

Requirements / Dependencies

JDK - Java Development Kit.

Tess4j library for performing ocr.

Pdfbox library for pdf manipulation.

Gradle - a build automation tool.

Usage

  1. Clone the repository by using the command:

    git clone https://github.com/darkn3to/pdfocr.git

    or simply download the zip file from the code dropdown button above.

  2. Navigate to the cloned directory.

  3. Run the command:

    ./gradlew clean shadowJar
  4. Run the jar file using:

    java -jar app/build/libs/pdf_ocr-1.0-all.jar <source_file_path> <dest_file_path>
  5. (Optional) One may also provide the 'm' flag as a third parameter to use the multi-threading funtionality.

    java -jar app/build/libs/pdf_ocr-1.0-all.jar <source_file_path> <dest_file_path> m

Packaged Binaries

You may download the application from the 'Releases' tab. The pdfocr.exe is a CLI-based application that can be executed by navigating to the directory having the .exe file and running: pdfocr <source_file_path> <dest_file_path> or pdfocr <source_file_path> <dest_file_path> m .

NOTE:

Please ensure that you have tessdata installed on C: drive or put it in the build/libs folder if you want to use your own tessdata with other languages included. Also ensure that you have opencv wrapper for java installed if you want to use image processing using opencv.

About

A simple Spring Boot application to convert image-based PDFs to text-embedded PDFs.

Topics

Resources

Stars

Watchers

Forks

Languages