wellcometrust
diff --git a/‎CHANGELOG.md
Lines changed: 13 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 13 additions & 0 deletions
diff --git a/‎Makefile
Lines changed: 3 additions & 3 deletions b/‎Makefile
Lines changed: 3 additions & 3 deletions
diff --git a/‎README.md
Lines changed: 47 additions & 19 deletions b/‎README.md
Lines changed: 47 additions & 19 deletions
diff --git a/‎deep_reference_parser/__init__.py
Lines changed: 29 additions & 5 deletions b/‎deep_reference_parser/__init__.py
Lines changed: 29 additions & 5 deletions
diff --git a/‎deep_reference_parser/__main__.py
Lines changed: 4 additions & 2 deletions b/‎deep_reference_parser/__main__.py
Lines changed: 4 additions & 2 deletions
diff --git a/‎deep_reference_parser/__version__.py
Lines changed: 3 additions & 2 deletions b/‎deep_reference_parser/__version__.py
Lines changed: 3 additions & 2 deletions
diff --git a/‎deep_reference_parser/common.py
Lines changed: 15 additions & 10 deletions b/‎deep_reference_parser/common.py
Lines changed: 15 additions & 10 deletions
diff --git a/‎deep_reference_parser/configs/2020.3.2_parsing.ini
Lines changed: 39 additions & 0 deletions b/‎deep_reference_parser/configs/2020.3.2_parsing.ini
Lines changed: 39 additions & 0 deletions
@@ -0,0 +1,13 @@
+# Changelog 
+
+## 2020.3.2 - Pre-release
+
+* Adds parse command that can be called with `python -m deep_reference_parser parse` 
+* Rename predict command to 'split' which can be called with `python -m deep_reference_parser parse` 
+* Squashes most `tensorflow`, `keras_contrib`, and `numpy` warnings in `__init__.py` resulting from old versions and soon-to-be deprecated functions.
+* Reduces verbosity of logging, improving CLI clarity.
+
+## 2020.2.0 - Pre-release
+
+First release. Features train and predict functions tested mainly for the task of labelling reference (e.g. academic references) spans in policy documents (e.g. documents produced by government, NGOs, etc).
+
@@ -81,9 +81,9 @@ models: $(artefacts)
 datasets = data/splitting/2019.12.0_splitting_train.tsv \
            data/splitting/2019.12.0_splitting_test.tsv \
            data/splitting/2019.12.0_splitting_valid.tsv \
-		   data/splitting/2020.2.0_parsing_train.tsv \
-           data/splitting/2020.2.0_parsing_test.tsv \
-           data/splitting/2020.2.0_parsing_valid.tsv
+		   data/parsing/2020.3.2_parsing_train.tsv \
+           data/parsing/2020.3.2_parsing_test.tsv \
+           data/parsing/2020.3.2_parsing_valid.tsv
 
 
 rodrigues_datasets = data/rodrigues/clean_train.txt \
 
@@ -4,15 +4,15 @@
 
 Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
 
-The intention for this project, like Rodrigues et al. (2018) is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection, reference component detection, and reference type classification.
+The BiLSTM model is based on Rodrigues et al. (2018), and like this project, the intention is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection (splitting), reference component detection (parsing), and reference type classification (classification) in a single neural network and stacked CRF.
 
 ### Current status:
 
 |Component|Individual|MultiTask|
 |---|---|---|
-|Spans|✔️ Implemented|❌ Not Implemented|
-|Components|❌ Not Implemented|❌ Not Implemented|
-|Type|❌ Not Implemented|❌ Not Implemented|
+|Spans (splitting)|✔️ Implemented|❌ Not Implemented|
+|Components (parsing)|✔️ Implemented|❌ Not Implemented|
+|Type (classification)|❌ Not Implemented|❌ Not Implemented|
 
 ### The model
 
@@ -29,7 +29,9 @@ The model itself is based on the work of [Rodrigues et al. (2018)](https://githu
 
 ### Performance
 
-#### Span detection
+On the validation set.
+
+#### Span detection (splitting)
 
 |token|f1|support|
 |---|---|---|
@@ -39,13 +41,24 @@ The model itself is based on the work of [Rodrigues et al. (2018)](https://githu
 |o|0.9561|32666|
 |weighted avg|0.9746|129959|
 
+#### Components (parsing)
+
+|token|f1|support|
+|---|---|---|
+|author|0.9467|2818|
+|title|0.8994|4931|
+|year|0.8774|418|
+|o|0.9592|13685|
+|weighted avg|0.9425|21852|
+
 #### Computing requirements
 
 Models are trained on AWS instances using CPU only.
 
 |Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
 |---|---|---|---|---|
 |Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
+|Components|11:02:59|m4.4xlarge|$0.88|$9.72|
 
 ## tl;dr: Just get me to the references!
 
@@ -64,10 +77,15 @@ cat > references.txt <<EOF
 EOF
 
 
-# Run the model. This will take a little time while the weights and embeddings 
-# are downloaded. The weights are about 300MB, and the embeddings 950MB.
+# Run the splitter model. This will take a little time while the weights and 
+# embeddings are downloaded. The weights are about 300MB, and the embeddings 
+# 950MB.
 
-python -m deep_reference_parser predict --verbose "$(cat references.txt)"
+python -m deep_reference_parser split "$(cat references.txt)"
+
+# For parsing:
+
+python -m deep_reference_parser parse "$(cat references.txt)"
 ```
 
 ## The longer guide
@@ -133,27 +151,37 @@ $ python -m deep_reference_parser
 Using TensorFlow backend.
 
 ℹ Available commands
-train, predict
+parse, split, train 
 ```
 
 For additional help, you can pass a command with the `-h`/`--help` flag:
 
 ```
-$ python -m deep_reference_parser predict --help
-Using TensorFlow backend.
-usage: deep_reference_parser predict [-h]
-                                     [-c]
-                                     [-t] [-v]
-                                     text
+$ python -m deep_reference_parser split --help
+usage: deep_reference_parser split [-h]
+                                   [-c]
+                                   [-t] [-o None]
+                                   text
+
+    Runs the default splitting model and pretty prints results to console unless
+    --outfile is parsed with a path. Can output either tokens (with -t|--tokens)
+    or split naively into references based on the b-r tag (default).
+
+    NOTE: that this function is provided for examples only and should not be used
+    in production as the model is instantiated each time the command is run. To
+    use in a production setting, a more sensible approach would be to replicate
+    the split or parse functions within your own logic.
+    
 
 positional arguments:
   text                  Plaintext from which to extract references
 
 optional arguments:
   -h, --help            show this help message and exit
-  -c  --config-file     Path to config file
+  -c, --config-file     Path to config file
   -t, --tokens          Output tokens instead of complete references
-  -v, --verbose         Output more verbose results
+  -o, --outfile         Path to json file to which results will be written
+
 
 ```
 
@@ -192,13 +220,13 @@ February	i-r
 If you wish to use the latest model that we have trained, you can simply run:
 
 ```
-python -m deep_reference_parser predict <input text>
+python -m deep_reference_parser split <input text>
 ```
 
 If you wish to use a custom model that you have trained, you must specify the config file which defines the hyperparameters for that model using the `-c` flag:
 
 ```
-python -m deep_reference_parser predict -c new_model.ini <input text>
+python -m deep_reference_parser split -c new_model.ini <input text>
 ```
 
 Use the `-t` flag to return the raw token predictions, and the `-v` to return everything in a much more user friendly format.
 
@@ -1,15 +1,39 @@
+# Tensorflow and Keras emikt a very large number of warnings that are very
+# distracting on the command line. These lines here (while undesireable)
+# reduce the level of verbosity.
+
 import sys
 import warnings
+import os
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
 
 if not sys.warnoptions:
     warnings.filterwarnings("ignore", category=FutureWarning)
+    warnings.filterwarnings("ignore", category=DeprecationWarning)
+    warnings.filterwarnings("ignore", category=UserWarning)
+
+import tensorflow as tf
 
+tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
+
+from .common import download_model_artefact
 from .deep_reference_parser import DeepReferenceParser
 from .logger import logger
 from .model_utils import get_config
-from .reference_utils import (break_into_chunks, labels_to_prodigy, load_data,
-                              load_tsv, prodigy_to_conll, prodigy_to_lists,
-                              read_jsonl, read_pickle, write_json, write_jsonl,
-                              write_pickle, write_to_csv, write_txt)
+from .reference_utils import (
+    break_into_chunks,
+    labels_to_prodigy,
+    load_data,
+    load_tsv,
+    prodigy_to_conll,
+    prodigy_to_lists,
+    read_jsonl,
+    read_pickle,
+    write_json,
+    write_jsonl,
+    write_pickle,
+    write_to_csv,
+    write_txt,
+)
 from .tokens_to_references import tokens_to_references
-from .common import download_model_artefact
 
@@ -10,10 +10,12 @@
     import sys
     from wasabi import msg
     from .train import train
-    from .predict import predict
+    from .split import split
+    from .parse import parse
 
     commands = {
-        "predict": predict,
+        "split": split,
+        "parse": parse,
         "train": train,
     }
 
 
@@ -1,8 +1,9 @@
 __name__ = "deep_reference_parser"
-__version__ = "2020.2.0"
+__version__ = "2020.3.0"
 __description__ = "Deep learning model for finding and parsing references"
 __url__ = "https://github.com/wellcometrust/deep_reference_parser"
 __author__ = "Wellcome Trust DataLabs Team"
 __author_email__ = "Grp_datalabs-datascience@Wellcomecloud.onmicrosoft.com"
 __license__ = "MIT"
-__model_version__ = "2019.12.0_splitting"
+__splitter_model_version__ = "2019.12.0_splitting"
+__parser_model_version__ = "2020.3.2_parsing"
@@ -6,15 +6,16 @@
 from urllib import parse, request
 
 from .logger import logger
-from .__version__ import __model_version__
+from .__version__ import __splitter_model_version__, __parser_model_version__
+
 
 def get_path(path):
-    return os.path.join(
-        os.path.dirname(__file__),
-        path
-    )
+    return os.path.join(os.path.dirname(__file__), path)
+
+
+SPLITTER_CFG = get_path(f"configs/{__splitter_model_version__}.ini")
+PARSER_CFG = get_path(f"configs/{__parser_model_version__}.ini")
 
-LATEST_CFG = get_path(f'configs/{__model_version__}.ini')
 
 def download_model_artefact(artefact, s3_slug):
     """ Checks if model artefact exists and downloads if not
@@ -38,19 +39,23 @@ def download_model_artefact(artefact, s3_slug):
 
         request.urlretrieve(url, artefact)
 
+
 def download_model_artefacts(model_dir, s3_slug, artefacts=None):
     """
     """
 
     if not artefacts:
 
         artefacts = [
-            "char2ind.pickle", "ind2label.pickle", "ind2word.pickle",
-            "label2ind.pickle", "maxes.pickle", "weights.h5", 
-            "word2ind.pickle"
+            "char2ind.pickle",
+            "ind2label.pickle",
+            "ind2word.pickle",
+            "label2ind.pickle",
+            "maxes.pickle",
+            "weights.h5",
+            "word2ind.pickle",
         ]
 
     for artefact in artefacts:
         artefact = os.path.join(model_dir, artefact)
         download_model_artefact(artefact, s3_slug)
-
 
@@ -0,0 +1,39 @@
+[DEFAULT]
+version = 2020.3.2_parsing
+description = First experiment which includes Reach labelled data in the
+    training set. All annotated parsing data were combined, and then split using
+    a 50% (train), 25% (test), 25% (valid) split. The Rodrigues data is then
+    added to the training set to bulk it out.
+
+[data]
+test_proportion = 0.25
+valid_proportion = 0.25
+data_path = data/
+respect_line_endings = 0
+respect_doc_endings = 1
+line_limit = 250
+policy_train = data/parsing/2020.3.2_parsing_train.tsv
+policy_test = data/parsing/2020.3.2_parsing_test.tsv
+policy_valid = data/parsing/2020.3.2_parsing_valid.tsv
+s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
+
+[build]
+output_path = models/parsing/2020.3.2_parsing/
+output = crf
+word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
+pretrained_embedding = 0
+dropout = 0.5
+lstm_hidden = 400
+word_embedding_size = 300
+char_embedding_size = 100
+char_embedding_type = BILSTM
+optimizer = rmsprop
+
+[train]
+epochs = 10
+batch_size = 100
+early_stopping_patience = 5
+metric = val_f1
+
+[evaluate]
+out_file = evaluation_data.tsv