Skip to content

ENSTA Paris student project in partnership with BNP Paribas focused on document layout analysis and segmentation, using state-of-the-art AI models.

License

Notifications You must be signed in to change notification settings

pmorelr/document_ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning Based Document Layout Analysis

What is this?


This is a student project of ENSTA Paris in partnership with BNP Paribas focused on Document Layout Analysis using state-of-the-art AI tools.

This work aims to build an integrated pipeline around the task of classifying entities in documents, using models based on Deep Learning. This comprises the initial steps of extracting features from the datasets we propose to use (DocLayNet, DocBank), pre-processing the data, training the models, evaluating them; and finally, inference tools with trained models. This same pipeline can be used for fine-tuning other models in the Document AI domain.

Additionally, we conducted a study aimed at the optimal classification of entities that often add noise during document classification, such as headers and footers. For this, we tested different data labeling techniques and made a comparative study about the best methodology to be followed.


Contributions


Among the activities carried out during this project, we can mention:

  • Development of feature extraction scripts from datasets such as DocLayNet and DocBank.
  • Implementation of Mask R-CNN, LayoutLM, LayoutLMv3 and DiT models.
  • Development of training, testing and evaluation scripts for the aforementioned models.
  • Construction of a pipeline that integrates the different fronts of this project, from data pre-processing to inferences with trained models.
  • Conduction a study on the ideal training approach for optimal classification of noisy entities (such as headers and footers).

How do I run this code?


1. Data and Requirements

First, install all the requirements. Then, download DocLayNet. Both core and extra datasets should be downloaded. Their content should be allocated inside /data/raw/DocLayNet as follows:


*
├── COCO
│   ├── test.json
│   ├── train.json
│   └── val.json
├── PNG
│   ├── <hash>.png
│   ├── ...
├── JSON
│   ├── <hash>.json
│   ├── ...

2. Run run.py

This is the main script of this project that will allow you to browse the tools and models implemented here, without having to run the base scripts (which is, of course, also an option).

Running this script will give the user 3 options:

  • Feature Extraction: Preparation of the dataset from its raw version to a version with the most adapted structure for model input. You are given the option to get a dataset adapted for vision-only models (Mask R-CNN) or for multimodal models (of the LayoutLM type). It is also possible to select 3 different approaches for noise labeling (default, binary and all). The tools can be found in /src/features, where more freedom is given to the user.
  • Training: Allows you to train the 3 models implemented here, and to select parameters such as the percentage of training data to be used and the number of epochs. Additionally, for models of the LayoutLM type, you are given the possibility to pull the trained model into the HuggingFace Hub. The original scripts can be found in /src/models, and the models will be saved locally in /models.
  • Evaluation: Provides tools for analyzing your trained models (or similar models available on HuggingFace's Hub). You are given the option to select the percentage of the test data that will be used. At the end, graphs and statistics linked to the models' score are provided. These scripts can also be found in /src/models.

Authors

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. 
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── train_scripts.py
│   │   └── evaluation_scripts.py
│   │
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

About

ENSTA Paris student project in partnership with BNP Paribas focused on document layout analysis and segmentation, using state-of-the-art AI models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •