Skip to content

Commit d899489

Browse files
Merge pull request #10 from wellcometrust/feature/ivyleavedtoadflax/parsing
Add parsing model
2 parents 58e723b + b16a1a1 commit d899489

31 files changed

+1063
-408
lines changed

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Changelog
2+
3+
## 2020.3.2 - Pre-release
4+
5+
* Adds parse command that can be called with `python -m deep_reference_parser parse`
6+
* Rename predict command to 'split' which can be called with `python -m deep_reference_parser parse`
7+
* Squashes most `tensorflow`, `keras_contrib`, and `numpy` warnings in `__init__.py` resulting from old versions and soon-to-be deprecated functions.
8+
* Reduces verbosity of logging, improving CLI clarity.
9+
10+
## 2020.2.0 - Pre-release
11+
12+
First release. Features train and predict functions tested mainly for the task of labelling reference (e.g. academic references) spans in policy documents (e.g. documents produced by government, NGOs, etc).
13+

Makefile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,9 @@ models: $(artefacts)
8181
datasets = data/splitting/2019.12.0_splitting_train.tsv \
8282
data/splitting/2019.12.0_splitting_test.tsv \
8383
data/splitting/2019.12.0_splitting_valid.tsv \
84-
data/splitting/2020.2.0_parsing_train.tsv \
85-
data/splitting/2020.2.0_parsing_test.tsv \
86-
data/splitting/2020.2.0_parsing_valid.tsv
84+
data/parsing/2020.3.2_parsing_train.tsv \
85+
data/parsing/2020.3.2_parsing_test.tsv \
86+
data/parsing/2020.3.2_parsing_valid.tsv
8787

8888

8989
rodrigues_datasets = data/rodrigues/clean_train.txt \

README.md

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44

55
Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
66

7-
The intention for this project, like Rodrigues et al. (2018) is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection, reference component detection, and reference type classification.
7+
The BiLSTM model is based on Rodrigues et al. (2018), and like this project, the intention is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection (splitting), reference component detection (parsing), and reference type classification (classification) in a single neural network and stacked CRF.
88

99
### Current status:
1010

1111
|Component|Individual|MultiTask|
1212
|---|---|---|
13-
|Spans|✔️ Implemented|❌ Not Implemented|
14-
|Components|❌ Not Implemented|❌ Not Implemented|
15-
|Type|❌ Not Implemented|❌ Not Implemented|
13+
|Spans (splitting)|✔️ Implemented|❌ Not Implemented|
14+
|Components (parsing)|✔️ Implemented|❌ Not Implemented|
15+
|Type (classification)|❌ Not Implemented|❌ Not Implemented|
1616

1717
### The model
1818

@@ -29,7 +29,9 @@ The model itself is based on the work of [Rodrigues et al. (2018)](https://githu
2929

3030
### Performance
3131

32-
#### Span detection
32+
On the validation set.
33+
34+
#### Span detection (splitting)
3335

3436
|token|f1|support|
3537
|---|---|---|
@@ -39,13 +41,24 @@ The model itself is based on the work of [Rodrigues et al. (2018)](https://githu
3941
|o|0.9561|32666|
4042
|weighted avg|0.9746|129959|
4143

44+
#### Components (parsing)
45+
46+
|token|f1|support|
47+
|---|---|---|
48+
|author|0.9467|2818|
49+
|title|0.8994|4931|
50+
|year|0.8774|418|
51+
|o|0.9592|13685|
52+
|weighted avg|0.9425|21852|
53+
4254
#### Computing requirements
4355

4456
Models are trained on AWS instances using CPU only.
4557

4658
|Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
4759
|---|---|---|---|---|
4860
|Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
61+
|Components|11:02:59|m4.4xlarge|$0.88|$9.72|
4962

5063
## tl;dr: Just get me to the references!
5164

@@ -64,10 +77,15 @@ cat > references.txt <<EOF
6477
EOF
6578
6679
67-
# Run the model. This will take a little time while the weights and embeddings
68-
# are downloaded. The weights are about 300MB, and the embeddings 950MB.
80+
# Run the splitter model. This will take a little time while the weights and
81+
# embeddings are downloaded. The weights are about 300MB, and the embeddings
82+
# 950MB.
6983
70-
python -m deep_reference_parser predict --verbose "$(cat references.txt)"
84+
python -m deep_reference_parser split "$(cat references.txt)"
85+
86+
# For parsing:
87+
88+
python -m deep_reference_parser parse "$(cat references.txt)"
7189
```
7290

7391
## The longer guide
@@ -133,27 +151,37 @@ $ python -m deep_reference_parser
133151
Using TensorFlow backend.
134152
135153
ℹ Available commands
136-
train, predict
154+
parse, split, train
137155
```
138156

139157
For additional help, you can pass a command with the `-h`/`--help` flag:
140158

141159
```
142-
$ python -m deep_reference_parser predict --help
143-
Using TensorFlow backend.
144-
usage: deep_reference_parser predict [-h]
145-
[-c]
146-
[-t] [-v]
147-
text
160+
$ python -m deep_reference_parser split --help
161+
usage: deep_reference_parser split [-h]
162+
[-c]
163+
[-t] [-o None]
164+
text
165+
166+
Runs the default splitting model and pretty prints results to console unless
167+
--outfile is parsed with a path. Can output either tokens (with -t|--tokens)
168+
or split naively into references based on the b-r tag (default).
169+
170+
NOTE: that this function is provided for examples only and should not be used
171+
in production as the model is instantiated each time the command is run. To
172+
use in a production setting, a more sensible approach would be to replicate
173+
the split or parse functions within your own logic.
174+
148175
149176
positional arguments:
150177
text Plaintext from which to extract references
151178
152179
optional arguments:
153180
-h, --help show this help message and exit
154-
-c --config-file Path to config file
181+
-c, --config-file Path to config file
155182
-t, --tokens Output tokens instead of complete references
156-
-v, --verbose Output more verbose results
183+
-o, --outfile Path to json file to which results will be written
184+
157185
158186
```
159187

@@ -192,13 +220,13 @@ February i-r
192220
If you wish to use the latest model that we have trained, you can simply run:
193221

194222
```
195-
python -m deep_reference_parser predict <input text>
223+
python -m deep_reference_parser split <input text>
196224
```
197225

198226
If you wish to use a custom model that you have trained, you must specify the config file which defines the hyperparameters for that model using the `-c` flag:
199227

200228
```
201-
python -m deep_reference_parser predict -c new_model.ini <input text>
229+
python -m deep_reference_parser split -c new_model.ini <input text>
202230
```
203231

204232
Use the `-t` flag to return the raw token predictions, and the `-v` to return everything in a much more user friendly format.

deep_reference_parser/__init__.py

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,39 @@
1+
# Tensorflow and Keras emikt a very large number of warnings that are very
2+
# distracting on the command line. These lines here (while undesireable)
3+
# reduce the level of verbosity.
4+
15
import sys
26
import warnings
7+
import os
8+
9+
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
310

411
if not sys.warnoptions:
512
warnings.filterwarnings("ignore", category=FutureWarning)
13+
warnings.filterwarnings("ignore", category=DeprecationWarning)
14+
warnings.filterwarnings("ignore", category=UserWarning)
15+
16+
import tensorflow as tf
617

18+
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
19+
20+
from .common import download_model_artefact
721
from .deep_reference_parser import DeepReferenceParser
822
from .logger import logger
923
from .model_utils import get_config
10-
from .reference_utils import (break_into_chunks, labels_to_prodigy, load_data,
11-
load_tsv, prodigy_to_conll, prodigy_to_lists,
12-
read_jsonl, read_pickle, write_json, write_jsonl,
13-
write_pickle, write_to_csv, write_txt)
24+
from .reference_utils import (
25+
break_into_chunks,
26+
labels_to_prodigy,
27+
load_data,
28+
load_tsv,
29+
prodigy_to_conll,
30+
prodigy_to_lists,
31+
read_jsonl,
32+
read_pickle,
33+
write_json,
34+
write_jsonl,
35+
write_pickle,
36+
write_to_csv,
37+
write_txt,
38+
)
1439
from .tokens_to_references import tokens_to_references
15-
from .common import download_model_artefact

deep_reference_parser/__main__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,12 @@
1010
import sys
1111
from wasabi import msg
1212
from .train import train
13-
from .predict import predict
13+
from .split import split
14+
from .parse import parse
1415

1516
commands = {
16-
"predict": predict,
17+
"split": split,
18+
"parse": parse,
1719
"train": train,
1820
}
1921

deep_reference_parser/__version__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
__name__ = "deep_reference_parser"
2-
__version__ = "2020.2.0"
2+
__version__ = "2020.3.0"
33
__description__ = "Deep learning model for finding and parsing references"
44
__url__ = "https://github.com/wellcometrust/deep_reference_parser"
55
__author__ = "Wellcome Trust DataLabs Team"
66
__author_email__ = "Grp_datalabs-datascience@Wellcomecloud.onmicrosoft.com"
77
__license__ = "MIT"
8-
__model_version__ = "2019.12.0_splitting"
8+
__splitter_model_version__ = "2019.12.0_splitting"
9+
__parser_model_version__ = "2020.3.2_parsing"

deep_reference_parser/common.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,16 @@
66
from urllib import parse, request
77

88
from .logger import logger
9-
from .__version__ import __model_version__
9+
from .__version__ import __splitter_model_version__, __parser_model_version__
10+
1011

1112
def get_path(path):
12-
return os.path.join(
13-
os.path.dirname(__file__),
14-
path
15-
)
13+
return os.path.join(os.path.dirname(__file__), path)
14+
15+
16+
SPLITTER_CFG = get_path(f"configs/{__splitter_model_version__}.ini")
17+
PARSER_CFG = get_path(f"configs/{__parser_model_version__}.ini")
1618

17-
LATEST_CFG = get_path(f'configs/{__model_version__}.ini')
1819

1920
def download_model_artefact(artefact, s3_slug):
2021
""" Checks if model artefact exists and downloads if not
@@ -38,19 +39,23 @@ def download_model_artefact(artefact, s3_slug):
3839

3940
request.urlretrieve(url, artefact)
4041

42+
4143
def download_model_artefacts(model_dir, s3_slug, artefacts=None):
4244
"""
4345
"""
4446

4547
if not artefacts:
4648

4749
artefacts = [
48-
"char2ind.pickle", "ind2label.pickle", "ind2word.pickle",
49-
"label2ind.pickle", "maxes.pickle", "weights.h5",
50-
"word2ind.pickle"
50+
"char2ind.pickle",
51+
"ind2label.pickle",
52+
"ind2word.pickle",
53+
"label2ind.pickle",
54+
"maxes.pickle",
55+
"weights.h5",
56+
"word2ind.pickle",
5157
]
5258

5359
for artefact in artefacts:
5460
artefact = os.path.join(model_dir, artefact)
5561
download_model_artefact(artefact, s3_slug)
56-
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
[DEFAULT]
2+
version = 2020.3.2_parsing
3+
description = First experiment which includes Reach labelled data in the
4+
training set. All annotated parsing data were combined, and then split using
5+
a 50% (train), 25% (test), 25% (valid) split. The Rodrigues data is then
6+
added to the training set to bulk it out.
7+
8+
[data]
9+
test_proportion = 0.25
10+
valid_proportion = 0.25
11+
data_path = data/
12+
respect_line_endings = 0
13+
respect_doc_endings = 1
14+
line_limit = 250
15+
policy_train = data/parsing/2020.3.2_parsing_train.tsv
16+
policy_test = data/parsing/2020.3.2_parsing_test.tsv
17+
policy_valid = data/parsing/2020.3.2_parsing_valid.tsv
18+
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
19+
20+
[build]
21+
output_path = models/parsing/2020.3.2_parsing/
22+
output = crf
23+
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
24+
pretrained_embedding = 0
25+
dropout = 0.5
26+
lstm_hidden = 400
27+
word_embedding_size = 300
28+
char_embedding_size = 100
29+
char_embedding_type = BILSTM
30+
optimizer = rmsprop
31+
32+
[train]
33+
epochs = 10
34+
batch_size = 100
35+
early_stopping_patience = 5
36+
metric = val_f1
37+
38+
[evaluate]
39+
out_file = evaluation_data.tsv

0 commit comments

Comments
 (0)