Skip to content

anaezquerro/separ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌱 Parsing as Sequence Labeling

Hi 👋 This is a Python implementation for sequence labeling algorithms (and others!) for Dependency, Constituency and Semantic Parsing.

Installation

To run this code Python >=3.8 is required, although we recommend using Python >=3.11 in a GPU system with NVIDIA drivers (>=535) and CUDA (>=12.4) installed. To use the official evaluation of the semantic and constituency parsers we suggest installing:

It is possible to disable the evaluation with the official scripts by setting the SDP_SCRIPT and CON_SCRIPT to None in the separ/utils/common.py script. Then, the evaluation will be performed with Python code (slight variations might occur in some cases).

We provided the environment.yaml to create an Anaconda environment as an alternative to the requirements.txt.

conda env create -n separ -f environment.yaml

Warning

Some functions in the code use the native concurrent.futures package for asynchronous execution. The CPU multithreading acceleration is guaranteed in Linux and MacOS systems, but unexpected behaviors can be experienced in Windows, so we highly suggest using WSL virtualization for Windows users or disabling CPU parallelism setting the variable NUM_WORKERS to 1 in the separ/utils/common.py script.

Data preparation

To deploy and evaluate our models we relied on well-known treebanks in Dependency, Constituency and Semantic Parsing:

To pre-process all datasets you need to locate all compressed files (penn_treebank.zip, SPMRL_SHARED_2014_NO_ARABIC.zip, sdp2014_2015_LDC2016T10.tgz and iwpt2021stdata.tgz) in the treebanks folder. Then, run from the treebanks folder (you might need sudo privileges and the zip library installed) these scripts:

  • parse-ptb.py: To split the PTB into three sets (train, development and test) in the raw bracketing format. We follow the recommended split: sections 2-21 for training, 22 for validation and 23 for test. The result should be something like this:
treebanks/
    ptb/
        train.ptb
        dev.ptb.
        test.ptb 
  • parse-spmrl.py: To create a subfolder per language of the SPMRL multilingual dataset. The result should be something like this:
treebanks/
    spmrl-2014/
        de/ 
            train.ptb
            dev.ptb
            test.ptb
        ...
        sv/
  • parse-sdp.py: To create a subfolder per treebank of the SDP dataset. For the DM (English), PAS (English), PSD (English) and PSD (Czech) treebanks we used section 20 for validation (evaluation tests are the in-distribution and out-of-distribution files).
treebanks/
    sdp-2015/
        dm/
            train.sdp
            dev.sdp
            id.sdp
            ood.sdp
        ...
        zh/
  • parse-iwpt.py: To create subfolder per language in the IWPT dataset. We grouped the different treebanks per language and concatenated at split leve to obtain a single treebank per language.
treebanks/
    iwpt-2021/
        ar/
            train.conllu
            dev.conllu
            test.conllu
        ...
        uk/

Usage

You can train, evaluate and predict different parser from terminal with run.py. Each parser has a string identifier that is introduced as the first argument of the run.py script. The following table shows the parsers available with its corresponding paper and the proper arguments that can be introduced:

Identifier Parser Paper Arguments
dep-idx Absolute and relative indexing Strzyz et al. (2019) rel
dep-pos PoS-tag relative indexing Strzyz et al. (2019) gold
dep-bracket Bracketing encoding ($k$-planar) Strzyz et al. (2020) k
dep-bit4 $4$-bit projective encoding Gómez-Rodríguez et al. (2023) proj
dep-bit7 $7$-bit $2$-planar encoding Gómez-Rodríguez et al. (2023)
dep-eager Arc-Eager system Nivre and Fernández-González (2002) stack, buffer, proj
dep-biaffine Biaffine dependency parser Dozat et al. (2016)
dep-hexa Hexa-Tagging Amini et al. (2023) proj
con-idx Absolute and relative indexing Gómez-Rodríguez and Vilares (2018) rel
con-tetra Tetra-Tagging Kitaev and Klein (2020)
sdp-idx Absolute and relative indexing Ezquerro et al. (2024) rel
sdp-bracket Bracketing encoding ($k$-planar) Ezquerro et al. (2024) k
sdp-bit4k $4k$-bit encoding Ezquerro et al. (2024) k
sdp-bit6k $6k$-bit encoding Ezquerro et al. (2024) k

Training

To train a parser from scratch, the run.py script should follow this syntax:

python3 run.py <parser-identifier> <specific-args> \
    -p <path> -c <conf> -d <device> (--load --seed <seed> --proj <proj-mode>) 
    train --train <train-path> --dev <dev-path> --test <test-paths> (--num-workers <num-workers>)

where:

  • <parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
  • <specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
  • <path> is a folder to store the training results (including the parser.pt file),
  • <conf> is the model configuration file (see some examples in configs folder),
  • <device> is the CUDA integer device (e.g. 0),
  • <train-path>, <dev-path> and <test-paths> are the paths to the training, development and test sets (multiple test paths are possible).

And optionally:

  • --load: Whether to load the parser from an existing parser.pt file. If it is specified, the <path> argument should be a path to a file, not a folder.
  • --seed: Specify other seed value. By default, this code always uses the seed 123. The default value can be fixed in the trasepar/utils/common.py script.
  • --num-workers: Number of threads to also parallelize worload in CPU. By default is set to 1.

Evaluation

Evaluation with a trained parser is also performed with the run.py script.

python3 run.py <parser-identifier>  -p <path> -d <device> eval <input> \
    (--output <output> --batch-size <batch-size> --num-workers <num-workers>)

where:

  • <parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
  • <specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
  • <path> is the path where the parser has been stored (e.g. the parser.pt file created after trianing).
  • <conf> is the model configuration file (see some examples in configs folder),
  • <device> is the CUDA integer device (e.g. 0),
  • <input> is the annotated file to perform the evaluation.

And optionally:

  • <output>: Folder to store the result metric.
  • <batch-size>: Inference batch size. By default is set to 100.
  • <num-workers>: Number of threads to also parallelize worload in CPU. By default is set to 1.

Prediction

Prediction with a trained parser is also conducted from the run.py script.

python3 run.py <parser-identifier> -p <path> -d <device> predict <input> <output> \
    (--batch-size <batch-size> --num-workers <num-workers>)

where:

  • <parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
  • <specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
  • <path> is the path where the parser has been stored (e.g. the parser.pt file created after trianing).
  • <conf> is the model configuration file (see some examples in configs folder),
  • <device> is the CUDA integer device (e.g. 0),
  • <input> is the annotated file to perform the evaluation.
  • <output> is the file to store the predicted file.

And optionally:

  • <batch-size>: Inference batch size. By default is set to 100.
  • <num-workers>: Number of threads to also parallelize workload in CPU. By default is set to 1.

Examples

Check the docs folder for specific examples running different dependency (docs/dep.md), constituency (docs/con.md) and semantic (docs/sdp.md) parsers. The docs/examples.ipynb notebook includes some examples of how to use the implemented classes and methods to parse and linearize input graphs/trees.

About

Parsing as Sequence Labeling with Neural Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages