Skip to content

Extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision). searches for regex in the result using a YAML-based template system saves results as CSV, JSON or XML or renames PDF files to match the content.

Notifications You must be signed in to change notification settings

Taher5253/Invoicepdf2Data

Repository files navigation

Data extractor for PDF invoices

extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision).
searches for regex in the result using a YAML-based template system
saves results as CSV, JSON or XML or renames PDF files to match the content.

INSTALLATION OF VIRTUAL ENVIRONMENT AND FLASK:

commands to install virtual environment:

sudo apt-get install python3-pip
sudo pip3 install virtualenv
virtualenv venv
source venv/bin/activate
pip install Flask
pip install pymongo
pip install invoice2data


AFTER INSTALLATION GO TO PROJECT DIRECTORY:

eg:-
(venv) taher@ubuntu:~/projects/invoice_reader_ai$


RUN THE FOLLOWING COMMAND FROM PROJECT DIRECTORY:
export FLASK_APP=pdfinvoice2data.py
flask run

GO TO YOUR BROWSER RUN THE PROJECT:
http://127.0.0.1:5000/


Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

About

Extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision). searches for regex in the result using a YAML-based template system saves results as CSV, JSON or XML or renames PDF files to match the content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages