This application converts image-based PDFs into text-embedded PDFs using Tesseract OCR. It leverages Tesseract for text extraction and Apache PDFBox for PDF manipulation, providing an efficient solution for making scanned documents searchable and selectable.
• Utilizes Tesseract OCR to accurately extract text from images within PDF files.
• Ensures the output PDF retains the original layout and formatting while embedding the extracted text.
• Multi-threading support for processing mutiple pages in parallel.
• JDK - Java Development Kit.
• Tess4j library for performing ocr.
• Pdfbox library for pdf manipulation.
• Gradle - a build automation tool.
-
Clone the repository by using the command:
git clone https://github.com/darkn3to/pdfocr.git
or simply download the zip file from the code dropdown button above.
-
Navigate to the cloned directory.
-
Run the command:
./gradlew clean shadowJar
-
Run the jar file using:
java -jar app/build/libs/pdf_ocr-1.0-all.jar <source_file_path> <dest_file_path>
-
(Optional) One may also provide the 'm' flag as a third parameter to use the multi-threading funtionality.
java -jar app/build/libs/pdf_ocr-1.0-all.jar <source_file_path> <dest_file_path> m
You may download the application from the 'Releases' tab. The pdfocr.exe is a CLI-based application that can be executed by navigating to the directory having the .exe file and running:
pdfocr <source_file_path> <dest_file_path>
or
pdfocr <source_file_path> <dest_file_path> m
.
Please ensure that you have tessdata installed on C: drive or put it in the build/libs folder if you want to use your own tessdata with other languages included. Also ensure that you have opencv wrapper for java installed if you want to use image processing using opencv.