Zone2OCR is a tool for document layout analysis. This tool aims at mapping a set of zones generated by a segmentation algorithm (e.g., dhSegment) to the regions generated by OCR engine.
- Clone this repository
- Install Anaconda or Miniconda (installation procedure)
- Create a virtual environment and activate it
conda create -n <ENV_NAME> python=3.6
conda activate <ENV_NAME>
(Optional) If one wants to run the segmentation algorithm (dhSegment) pretrained on ImageNet + Europeana historical Newspaper Project, install Tensorflow 1.13 first with
# For cpu conda install -c conda-forge tensorflow=1.13 # For gpu conda install tensorflow-gpu=1.13.1
and then install dhSegment dependencies with
pip install ./dhsegment/.
- Install Zone2OCR dependencies with
pip install .
- Make sure to prepare a valid file structure as below: (Note: all segmentation result xml files should match with OCR xml files)
.root
├── zone_xmls # segmentation results
│ ├── image1.xml
│ ├── ...
│ └── image8.xml
├── ocr_xmls # OCR results
│ ├── image1.xml
│ ├── ...
│ └── image8.xml
└── images # (optional) images for visual inspection
├── image1.jpg
├── ...
└── image8.jpg
(Optional) Run pretrained dhSegment to collect segmentation result xml files
python run_segmentation.py -i <IMAGE_DIR> -s <SAVE_DIR> [-t <SMALL_REGION_THRESHOLD>] [-v (True|False)]
-i
: The path to the folder containing image to be processed-s
: The path to the folder to store output xml file-t
: (Optional) A threshold of area(zone)/area(full_page) ratio for ignoring small zones [0,1] (default: 0.005)-v
: (Optional) Increase output verbosity (default: False)
- Run mapping
python zone2ocr.py -zx <ZONE_XML_DIR> -ox <OCR_XML_DIR> [-t <IOU_THRESHOLD>] -s <SAVE_DIR> [-v (True|False)]
-zx
: The path to the folder containing segmentation result xml files-ox
: The path to the folder containing OCR xml files-t
: (Optional) A threshold of intersection over union to ignore small zones [0,1] (default: 0.1)-s
: The path to the folder to store outputJSON
file-v
: (Optional) Increase output verbosity (default: False)
- Both segmentation result and OCR XML file have to follow PAGE XML-schema
- Output
JSON
file follows the below structure:
[
{
"zone_coord" : [
[x1,y1],[x2,y2],[x3,y3],[x4,y4] // Found zone 1
],
"zone_texts": [
"text1", // Matched OCR zone 1's text contents within the zone 1
"text2", // Matched OCR zone 2's text contents within the zone 1
...,
]
"ocr_coord" : [
[
[x1,y1],[x2,y2],[x3,y3],[x4,y4] // Matched OCR zone 1
],
[
[x1',y1'],[x2',y2'],[x3',y3'],[x4',y4'] // Matched OCR zone 2
],
...,
]
]
"ocr_texts" : [
"text1", // Matched OCR zone 1's text contents
"text2", // Matched OCR zone 2's text contents
...,
]
},
{
... // Found zone 2
},
...
]
- Chulwoo Pack - University of Nebraska-Lincoln - email - cpack@cse.unl.edu
Main parts of dhSegment code are adapted from the work by Benoit Seguin and Sofia Ares Oliveira - DHLAB, EPFL - git - https://github.com/dhlab-epfl/dhSegment
This project is licensed under the GPL License - see the LICENSE file for details