Hi 👋 This is a Python implementation for sequence labeling algorithms (and others!) for Dependency, Constituency and Semantic Parsing.
-
Dependency Parsing:
- Absolute and relative indexing (Strzyz et al., 2019).
- PoS-tag relative indexing (Strzyz et al., 2019).
- Bracketing encoding (
$k$ -planar) (Strzyz et al., 2020). - Arc-Eager transition-based system (Nivre and Fernández-González, 2002).
-
$4$ -bit projective encoding (Gómez-Rodríguez et al., 2023). -
$7$ -bit$2$ -planar encoding (Gómez-Rodríguez et al., 2023). - Hexa-Tagging projective encoding (Amini et al., 2023).
- Biaffine dependency parser (Dozat et al., 2016).
-
Semantic Dependency Parsing:
- Absolute and relative indexing (Ezquerro et al., 2024).
- Bracketing encoding (
$k$ -planar) (Ezquerro et al., 2024). -
$4k$ -bit encoding (Ezquerro et al., 2024). -
$6k$ -bit encoding (Ezquerro et al., 2024). - Biaffine semantic parser (Dozat et al., 2017).
-
Constituency Parsing:
- Absolute and relative indexing (Gómez-Rodríguez and Vilares, 2018).
- Tetra-Tagging (Kitaev and Klein, 2020).
To run this code Python >=3.8 is required, although we recommend using Python >=3.11 in a GPU system with NVIDIA drivers (>=535) and CUDA (>=12.4) installed. To use the official evaluation of the semantic and constituency parsers we suggest installing:
- semantic-dependency-parsing/toolkit at eval/sdp-eval.
- EVALB compiled at eval/EVALB.
It is possible to disable the evaluation with the official scripts by setting the SDP_SCRIPT
and CON_SCRIPT
to None
in the separ/utils/common.py script. Then, the evaluation will be performed with Python code (slight variations might occur in some cases).
We provided the environment.yaml to create an Anaconda environment as an alternative to the requirements.txt.
conda env create -n separ -f environment.yaml
Warning
Some functions in the code use the native concurrent.futures package for asynchronous execution. The CPU multithreading acceleration is guaranteed in Linux and MacOS systems, but unexpected behaviors can be experienced in Windows, so we highly suggest using WSL virtualization for Windows users or disabling CPU parallelism setting the variable NUM_WORKERS
to 1
in the separ/utils/common.py script.
To deploy and evaluate our models we relied on well-known treebanks in Dependency, Constituency and Semantic Parsing:
- Dependency Parsing: We used various multilingual treebanks in the CoNLL-U format publicly available in the Universal Dependencies website.
- Constituency Parsing: PTB (Marcus et al., 2004) and SPMRL corpus (Seddah et al., 2011).
- Semantic Parsing: SDP (Oepen et al., 2015) and IWPT (Bouma et al., 2021) treebanks.
To pre-process all datasets you need to locate all compressed files (penn_treebank.zip, SPMRL_SHARED_2014_NO_ARABIC.zip, sdp2014_2015_LDC2016T10.tgz and iwpt2021stdata.tgz) in the treebanks folder. Then, run from the treebanks folder (you might need sudo
privileges and the zip
library installed) these scripts:
- parse-ptb.py: To split the PTB into three sets (train, development and test) in the raw bracketing format. We follow the recommended split: sections 2-21 for training, 22 for validation and 23 for test. The result should be something like this:
treebanks/
ptb/
train.ptb
dev.ptb.
test.ptb
- parse-spmrl.py: To create a subfolder per language of the SPMRL multilingual dataset. The result should be something like this:
treebanks/
spmrl-2014/
de/
train.ptb
dev.ptb
test.ptb
...
sv/
- parse-sdp.py: To create a subfolder per treebank of the SDP dataset. For the DM (English), PAS (English), PSD (English) and PSD (Czech) treebanks we used section 20 for validation (evaluation tests are the in-distribution and out-of-distribution files).
treebanks/
sdp-2015/
dm/
train.sdp
dev.sdp
id.sdp
ood.sdp
...
zh/
- parse-iwpt.py: To create subfolder per language in the IWPT dataset. We grouped the different treebanks per language and concatenated at split leve to obtain a single treebank per language.
treebanks/
iwpt-2021/
ar/
train.conllu
dev.conllu
test.conllu
...
uk/
You can train, evaluate and predict different parser from terminal with run.py. Each parser has a string identifier that is introduced as the first argument of the run.py script. The following table shows the parsers available with its corresponding paper and the proper arguments that can be introduced:
Identifier | Parser | Paper | Arguments |
---|---|---|---|
dep-idx |
Absolute and relative indexing | Strzyz et al. (2019) | rel |
dep-pos |
PoS-tag relative indexing | Strzyz et al. (2019) | gold |
dep-bracket |
Bracketing encoding ( |
Strzyz et al. (2020) | k |
dep-bit4 |
|
Gómez-Rodríguez et al. (2023) | proj |
dep-bit7 |
|
Gómez-Rodríguez et al. (2023) | |
dep-eager |
Arc-Eager system | Nivre and Fernández-González (2002) |
stack , buffer , proj
|
dep-biaffine |
Biaffine dependency parser | Dozat et al. (2016) | |
dep-hexa |
Hexa-Tagging | Amini et al. (2023) | proj |
con-idx |
Absolute and relative indexing | Gómez-Rodríguez and Vilares (2018) | rel |
con-tetra |
Tetra-Tagging | Kitaev and Klein (2020) | |
sdp-idx |
Absolute and relative indexing | Ezquerro et al. (2024) | rel |
sdp-bracket |
Bracketing encoding ( |
Ezquerro et al. (2024) | k |
sdp-bit4k |
|
Ezquerro et al. (2024) | k |
sdp-bit6k |
|
Ezquerro et al. (2024) | k |
To train a parser from scratch, the run.py script should follow this syntax:
python3 run.py <parser-identifier> <specific-args> \
-p <path> -c <conf> -d <device> (--load --seed <seed> --proj <proj-mode>)
train --train <train-path> --dev <dev-path> --test <test-paths> (--num-workers <num-workers>)
where:
<parser-identifier>
is the identifier specified in the table above (e.g.dep-idx
),<specific-args>
are the specific arguments of each parser (e.g.--rel
fordep-idx
),<path>
is a folder to store the training results (including theparser.pt
file),<conf>
is the model configuration file (see some examples in configs folder),<device>
is the CUDA integer device (e.g.0
),<train-path>
,<dev-path>
and<test-paths>
are the paths to the training, development and test sets (multiple test paths are possible).
And optionally:
--load
: Whether to load the parser from an existingparser.pt
file. If it is specified, the<path>
argument should be a path to a file, not a folder.--seed
: Specify other seed value. By default, this code always uses the seed123
. The default value can be fixed in the trasepar/utils/common.py script.--num-workers
: Number of threads to also parallelize worload in CPU. By default is set to 1.
Evaluation with a trained parser is also performed with the run.py script.
python3 run.py <parser-identifier> -p <path> -d <device> eval <input> \
(--output <output> --batch-size <batch-size> --num-workers <num-workers>)
where:
<parser-identifier>
is the identifier specified in the table above (e.g.dep-idx
),<specific-args>
are the specific arguments of each parser (e.g.--rel
fordep-idx
),<path>
is the path where the parser has been stored (e.g. theparser.pt
file created after trianing).<conf>
is the model configuration file (see some examples in configs folder),<device>
is the CUDA integer device (e.g.0
),<input>
is the annotated file to perform the evaluation.
And optionally:
<output>
: Folder to store the result metric.<batch-size>
: Inference batch size. By default is set to 100.<num-workers>
: Number of threads to also parallelize worload in CPU. By default is set to 1.
Prediction with a trained parser is also conducted from the run.py script.
python3 run.py <parser-identifier> -p <path> -d <device> predict <input> <output> \
(--batch-size <batch-size> --num-workers <num-workers>)
where:
<parser-identifier>
is the identifier specified in the table above (e.g.dep-idx
),<specific-args>
are the specific arguments of each parser (e.g.--rel
fordep-idx
),<path>
is the path where the parser has been stored (e.g. theparser.pt
file created after trianing).<conf>
is the model configuration file (see some examples in configs folder),<device>
is the CUDA integer device (e.g.0
),<input>
is the annotated file to perform the evaluation.<output>
is the file to store the predicted file.
And optionally:
<batch-size>
: Inference batch size. By default is set to 100.<num-workers>
: Number of threads to also parallelize workload in CPU. By default is set to 1.
Check the docs folder for specific examples running different dependency (docs/dep.md), constituency (docs/con.md) and semantic (docs/sdp.md) parsers. The docs/examples.ipynb notebook includes some examples of how to use the implemented classes and methods to parse and linearize input graphs/trees.