Zone2OCR

Zone2OCR is a tool for document layout analysis. This tool aims at mapping a set of zones generated by a segmentation algorithm (e.g., dhSegment) to the regions generated by OCR engine.

Installation

Clone this repository
Install Anaconda or Miniconda (installation procedure)
Create a virtual environment and activate it

conda create -n <ENV_NAME> python=3.6
conda activate <ENV_NAME>

(Optional) If one wants to run the segmentation algorithm (dhSegment) pretrained on ImageNet + Europeana historical Newspaper Project, install Tensorflow 1.13 first with
# For cpu
conda install -c conda-forge tensorflow=1.13
# For gpu
conda install tensorflow-gpu=1.13.1
and then install dhSegment dependencies with
pip install ./dhsegment/.

Install Zone2OCR dependencies with

pip install .

Usage

Make sure to prepare a valid file structure as below: (Note: all segmentation result xml files should match with OCR xml files)

.root
├── zone_xmls     # segmentation results
│   ├── image1.xml  
│   ├── ...
│   └── image8.xml
├── ocr_xmls      # OCR results
│   ├── image1.xml  
│   ├── ...
│   └── image8.xml
└── images        # (optional) images for visual inspection
    ├── image1.jpg  
    ├── ...
    └── image8.jpg

(Optional) Run pretrained dhSegment to collect segmentation result xml files
python run_segmentation.py -i <IMAGE_DIR> -s <SAVE_DIR> [-t <SMALL_REGION_THRESHOLD>] [-v (True|False)]
-i: The path to the folder containing image to be processed

-s: The path to the folder to store output xml file

-t: (Optional) A threshold of area(zone)/area(full_page) ratio for ignoring small zones [0,1] (default: 0.005)

-v: (Optional) Increase output verbosity (default: False)

Run mapping

python zone2ocr.py -zx <ZONE_XML_DIR> -ox <OCR_XML_DIR> [-t <IOU_THRESHOLD>] -s <SAVE_DIR> [-v (True|False)]

-zx: The path to the folder containing segmentation result xml files
-ox: The path to the folder containing OCR xml files
-t: (Optional) A threshold of intersection over union to ignore small zones [0,1] (default: 0.1)
-s: The path to the folder to store output JSON file
-v: (Optional) Increase output verbosity (default: False)

Remark

Both segmentation result and OCR XML file have to follow PAGE XML-schema
Output JSON file follows the below structure:

[
  {
    "zone_coord" : [
      [x1,y1],[x2,y2],[x3,y3],[x4,y4]              // Found zone 1
    ],
    "zone_texts": [                               
      "text1",                                     // Matched OCR zone 1's text contents within the zone 1
      "text2",                                     // Matched OCR zone 2's text contents within the zone 1
      ...,
    ]
    "ocr_coord" : [
      [
        [x1,y1],[x2,y2],[x3,y3],[x4,y4]            // Matched OCR zone 1
      ],
      [
        [x1',y1'],[x2',y2'],[x3',y3'],[x4',y4']    // Matched OCR zone 2
      ],
        ...,
      ]
    ]
    "ocr_texts" : [
      "text1",                                     // Matched OCR zone 1's text contents
      "text2",                                     // Matched OCR zone 2's text contents
      ...,
    ]
  },
  {
    ...                                            // Found zone 2
  },
  ...
]

Authors

Chulwoo Pack - University of Nebraska-Lincoln - email - cpack@cse.unl.edu

Acknowledgements

Main parts of dhSegment code are adapted from the work by Benoit Seguin and Sofia Ares Oliveira - DHLAB, EPFL - git - https://github.com/dhlab-epfl/dhSegment

License

This project is licensed under the GPL License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
asset		asset
demo		demo
dhSegment		dhSegment
example		example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
run_segmentation.py		run_segmentation.py
setup.py		setup.py
utils.py		utils.py
zone2OCR.py		zone2OCR.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zone2OCR

Installation

Usage

Remark

Authors

Acknowledgements

License

About

Releases

Packages

Languages

License

chulwoopack/Zone2OCR

Folders and files

Latest commit

History

Repository files navigation

Zone2OCR

Installation

Usage

Remark

Authors

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages