pdfocr - A Spring Boot App

This application converts image-based PDFs into text-embedded PDFs using Tesseract OCR. It leverages Tesseract for text extraction and Apache PDFBox for PDF manipulation, providing an efficient solution for making scanned documents searchable and selectable.

Features

• Utilizes Tesseract OCR to accurately extract text from images within PDF files.

• Ensures the output PDF retains the original layout and formatting while embedding the extracted text.

• Multi-threading support for processing mutiple pages in parallel.

Requirements / Dependencies

• JDK - Java Development Kit.

• Tess4j library for performing ocr.

• Pdfbox library for pdf manipulation.

• Gradle - a build automation tool.

Usage

Clone the repository by using the command:
```
git clone https://github.com/darkn3to/pdfocr.git
```
or simply download the zip file from the code dropdown button above.
Navigate to the cloned directory.
Run the command:
```
./gradlew clean shadowJar
```

Run the jar file using:

java -jar app/build/libs/pdf_ocr-1.0-all.jar <source_file_path> <dest_file_path>

(Optional) One may also provide the 'm' flag as a third parameter to use the multi-threading funtionality.
```
java -jar app/build/libs/pdf_ocr-1.0-all.jar <source_file_path> <dest_file_path> m
```

Packaged Binaries

You may download the application from the 'Releases' tab. The pdfocr.exe is a CLI-based application that can be executed by navigating to the directory having the .exe file and running: pdfocr <source_file_path> <dest_file_path> or pdfocr <source_file_path> <dest_file_path> m .

NOTE:

Please ensure that you have tessdata installed on C: drive or put it in the build/libs folder if you want to use your own tessdata with other languages included. Also ensure that you have opencv wrapper for java installed if you want to use image processing using opencv.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
app		app
build-logic		build-logic
gradle		gradle
list		list
utilities		utilities
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
alice.pdf		alice.pdf
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfocr - A Spring Boot App

Features

Requirements / Dependencies

Usage

Packaged Binaries

NOTE:

About

Releases

Languages

darkn3to/pdfocr

Folders and files

Latest commit

History

Repository files navigation

pdfocr - A Spring Boot App

Features

Requirements / Dependencies

Usage

Packaged Binaries

NOTE:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages