🌱 Parsing as Sequence Labeling

Hi 👋 This is a Python implementation for sequence labeling algorithms (and others!) for Dependency, Constituency and Semantic Parsing.

Dependency Parsing:
- Absolute and relative indexing (Strzyz et al., 2019).
- PoS-tag relative indexing (Strzyz et al., 2019).
- Bracketing encoding ($k$-planar) (Strzyz et al., 2020).
- Arc-Eager transition-based system (Nivre and Fernández-González, 2002).
- $4$-bit projective encoding (Gómez-Rodríguez et al., 2023).
- $7$-bit $2$-planar encoding (Gómez-Rodríguez et al., 2023).
- Hexa-Tagging projective encoding (Amini et al., 2023).
- Biaffine dependency parser (Dozat et al., 2016).
Semantic Dependency Parsing:
- Absolute and relative indexing (Ezquerro et al., 2024).
- Bracketing encoding ($k$-planar) (Ezquerro et al., 2024).
- $4k$-bit encoding (Ezquerro et al., 2024).
- $6k$-bit encoding (Ezquerro et al., 2024).
- Biaffine semantic parser (Dozat et al., 2017).
Constituency Parsing:
- Absolute and relative indexing (Gómez-Rodríguez and Vilares, 2018).
- Tetra-Tagging (Kitaev and Klein, 2020).

Installation

To run this code Python >=3.8 is required, although we recommend using Python >=3.11 in a GPU system with NVIDIA drivers (>=535) and CUDA (>=12.4) installed. To use the official evaluation of the semantic and constituency parsers we suggest installing:

semantic-dependency-parsing/toolkit at eval/sdp-eval.
EVALB compiled at eval/EVALB.

It is possible to disable the evaluation with the official scripts by setting the SDP_SCRIPT and CON_SCRIPT to None in the separ/utils/common.py script. Then, the evaluation will be performed with Python code (slight variations might occur in some cases).

We provided the environment.yaml to create an Anaconda environment as an alternative to the requirements.txt.

conda env create -n separ -f environment.yaml

Warning

Some functions in the code use the native concurrent.futures package for asynchronous execution. The CPU multithreading acceleration is guaranteed in Linux and MacOS systems, but unexpected behaviors can be experienced in Windows, so we highly suggest using WSL virtualization for Windows users or disabling CPU parallelism setting the variable NUM_WORKERS to 1 in the separ/utils/common.py script.

Data preparation

To deploy and evaluate our models we relied on well-known treebanks in Dependency, Constituency and Semantic Parsing:

Dependency Parsing: We used various multilingual treebanks in the CoNLL-U format publicly available in the Universal Dependencies website.
Constituency Parsing: PTB (Marcus et al., 2004) and SPMRL corpus (Seddah et al., 2011).
Semantic Parsing: SDP (Oepen et al., 2015) and IWPT (Bouma et al., 2021) treebanks.

To pre-process all datasets you need to locate all compressed files (penn_treebank.zip, SPMRL_SHARED_2014_NO_ARABIC.zip, sdp2014_2015_LDC2016T10.tgz and iwpt2021stdata.tgz) in the treebanks folder. Then, run from the treebanks folder (you might need sudo privileges and the zip library installed) these scripts:

parse-ptb.py: To split the PTB into three sets (train, development and test) in the raw bracketing format. We follow the recommended split: sections 2-21 for training, 22 for validation and 23 for test. The result should be something like this:

treebanks/
    ptb/
        train.ptb
        dev.ptb.
        test.ptb

parse-spmrl.py: To create a subfolder per language of the SPMRL multilingual dataset. The result should be something like this:

treebanks/
    spmrl-2014/
        de/ 
            train.ptb
            dev.ptb
            test.ptb
        ...
        sv/

parse-sdp.py: To create a subfolder per treebank of the SDP dataset. For the DM (English), PAS (English), PSD (English) and PSD (Czech) treebanks we used section 20 for validation (evaluation tests are the in-distribution and out-of-distribution files).

treebanks/
    sdp-2015/
        dm/
            train.sdp
            dev.sdp
            id.sdp
            ood.sdp
        ...
        zh/

parse-iwpt.py: To create subfolder per language in the IWPT dataset. We grouped the different treebanks per language and concatenated at split leve to obtain a single treebank per language.

treebanks/
    iwpt-2021/
        ar/
            train.conllu
            dev.conllu
            test.conllu
        ...
        uk/

Usage

You can train, evaluate and predict different parser from terminal with run.py. Each parser has a string identifier that is introduced as the first argument of the run.py script. The following table shows the parsers available with its corresponding paper and the proper arguments that can be introduced:

Identifier	Parser	Paper	Arguments
`dep-idx`	Absolute and relative indexing	Strzyz et al. (2019)	`rel`
`dep-pos`	PoS-tag relative indexing	Strzyz et al. (2019)	`gold`
`dep-bracket`	Bracketing encoding ($k$-planar)	Strzyz et al. (2020)	`k`
`dep-bit4`	$4$-bit projective encoding	Gómez-Rodríguez et al. (2023)	`proj`
`dep-bit7`	$7$-bit $2$-planar encoding	Gómez-Rodríguez et al. (2023)
`dep-eager`	Arc-Eager system	Nivre and Fernández-González (2002)	`stack`, `buffer`, `proj`
`dep-biaffine`	Biaffine dependency parser	Dozat et al. (2016)
`dep-hexa`	Hexa-Tagging	Amini et al. (2023)	`proj`
`con-idx`	Absolute and relative indexing	Gómez-Rodríguez and Vilares (2018)	`rel`
`con-tetra`	Tetra-Tagging	Kitaev and Klein (2020)
`sdp-idx`	Absolute and relative indexing	Ezquerro et al. (2024)	`rel`
`sdp-bracket`	Bracketing encoding ($k$-planar)	Ezquerro et al. (2024)	`k`
`sdp-bit4k`	$4k$-bit encoding	Ezquerro et al. (2024)	`k`
`sdp-bit6k`	$6k$-bit encoding	Ezquerro et al. (2024)	`k`

Training

To train a parser from scratch, the run.py script should follow this syntax:

python3 run.py <parser-identifier> <specific-args> \
    -p <path> -c <conf> -d <device> (--load --seed <seed> --proj <proj-mode>) 
    train --train <train-path> --dev <dev-path> --test <test-paths> (--num-workers <num-workers>)

where:

<parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
<specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
<path> is a folder to store the training results (including the parser.pt file),
<conf> is the model configuration file (see some examples in configs folder),
<device> is the CUDA integer device (e.g. 0),
<train-path>, <dev-path> and <test-paths> are the paths to the training, development and test sets (multiple test paths are possible).

And optionally:

--load: Whether to load the parser from an existing parser.pt file. If it is specified, the <path> argument should be a path to a file, not a folder.
--seed: Specify other seed value. By default, this code always uses the seed 123. The default value can be fixed in the trasepar/utils/common.py script.
--num-workers: Number of threads to also parallelize worload in CPU. By default is set to 1.

Evaluation

Evaluation with a trained parser is also performed with the run.py script.

python3 run.py <parser-identifier>  -p <path> -d <device> eval <input> \
    (--output <output> --batch-size <batch-size> --num-workers <num-workers>)

where:

<parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
<specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
<path> is the path where the parser has been stored (e.g. the parser.pt file created after trianing).
<conf> is the model configuration file (see some examples in configs folder),
<device> is the CUDA integer device (e.g. 0),
<input> is the annotated file to perform the evaluation.

And optionally:

<output>: Folder to store the result metric.
<batch-size>: Inference batch size. By default is set to 100.
<num-workers>: Number of threads to also parallelize worload in CPU. By default is set to 1.

Prediction

Prediction with a trained parser is also conducted from the run.py script.

python3 run.py <parser-identifier> -p <path> -d <device> predict <input> <output> \
    (--batch-size <batch-size> --num-workers <num-workers>)

where:

<parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
<specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
<path> is the path where the parser has been stored (e.g. the parser.pt file created after trianing).
<conf> is the model configuration file (see some examples in configs folder),
<device> is the CUDA integer device (e.g. 0),
<input> is the annotated file to perform the evaluation.
<output> is the file to store the predicted file.

And optionally:

<batch-size>: Inference batch size. By default is set to 100.
<num-workers>: Number of threads to also parallelize workload in CPU. By default is set to 1.

Examples

Check the docs folder for specific examples running different dependency (docs/dep.md), constituency (docs/con.md) and semantic (docs/sdp.md) parsers. The docs/examples.ipynb notebook includes some examples of how to use the implemented classes and methods to parse and linearize input graphs/trees.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌱 Parsing as Sequence Labeling

Installation

Data preparation

Usage

Training

Evaluation

Prediction

Examples

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
docs		docs
trasepar		trasepar
treebanks		treebanks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt
run.py		run.py

License

anaezquerro/separ

Folders and files

Latest commit

History

Repository files navigation

🌱 Parsing as Sequence Labeling

Installation

Data preparation

Usage

Training

Evaluation

Prediction

Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages