Skip to content
forked from dhdhagar/arboEL

Entity Linking & Discovery solution

License

Notifications You must be signed in to change notification settings

YerongLi2/arboEL

 
 

Repository files navigation

ArboEL

(Entity Linking & Discovery via Arborescence-based Supervised Clustering)


Thanks to BLINK for the infrastructure of this project!

Overview

Our paper: https://arxiv.org/abs/2109.01242

ArboEL is an entity linking and discovery system, which uses a directed MST (arborescence) supervised clustering objective to train a bi-encoder BERT model coupled with a transductive graph partitioning inference routine that makes predictions by jointly considering links between mentions as well as between mentions and entities.

Citing

If you use ArboEL, please cite the following paper:

@misc{agarwal2021entity,
      title={Entity Linking and Discovery via Arborescence-based Supervised Clustering}, 
      author={Dhruv Agarwal and Rico Angell and Nicholas Monath and Andrew McCallum},
      year={2021},
      eprint={2109.01242},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Setting up

  • Install conda (we recommend miniconda)
  • Create an environment and install dependencies
    conda create -n blink37 -y python=3.7 && conda activate blink37 && pip install -r requirements.txt && conda install cython pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
  • Build cluster-linking special_partition function (from Cython)
    cd blink/biencoder/special_partition; python setup.py build_ext --inplace
  • Our setup assumes GPU availability
    • The code for our paper was run using 2 NVIDIA Quadro RTX 8000

Datasets

  • MedMentions (full): The MedMentions corpus consists of 4,392 papers (Titles and Abstracts) randomly selected from among papers released on PubMed in 2016, that were in the biomedical field, published in the English language, and had both a Title and an Abstract.
  • ZeShEL: The Zero Shot Entity Linking dataset was constructed using multiple sub-domains in Wikia from FANDOM with automatically extracted labeled mentions using hyper-links.

Pre-processing

  • For MedMentions
    # Create the entity dictionary
    python blink/preprocess/medmentions_dictionary.py
    # Pre-process the query mentions
    python blink/preprocess/medmentions_preprocess.py
  • For ZeShEL
    # Create the entity dictionary
    python blink/preprocess/zeshel_dictionary.py
    # Pre-process the query mentions
    python blink/preprocess/zeshel_preprocess.py

Bi-encoder Training

MST

Example command for MedMentions

python blink/biencoder/train_biencoder_mst.py --bert_model=models/biobert-base-cased-v1.1 --data_path=data/medmentions/processed --output_path=models/trained/medmentions_mst/pos_neg_loss/no_type --pickle_src_path=models/trained/medmentions --num_train_epochs=5 --train_batch_size=128 --gradient_accumulation_steps=4 --eval_interval=10000 --pos_neg_loss --force_exact_search --embed_batch_size=3500 --data_parallel

k-NN negatives

Example command for MedMentions

python blink/biencoder/train_biencoder_mult.py --bert_model=models/biobert-base-cased-v1.1 --data_path=data/medmentions/processed --output_path=models/trained/medmentions/pos_neg_loss/no_type --pickle_src_path=models/trained/medmentions --num_train_epochs=5 --train_batch_size=128 --gradient_accumulation_steps=4 --eval_interval=10000 --pos_neg_loss --force_exact_search --embed_batch_size=3500 --data_parallel

In-batch negatives

Example command for MedMentions

python blink/biencoder/train_biencoder.py --bert_model=models/biobert-base-cased-v1.1 --num_train_epochs=5 --data_path=data/medmentions/processed --output_path=models/trained/medmentions_blink --data_parallel --train_batch_size=128 --eval_batch_size=128 --eval_interval=10000

Bi-encoder Inference

Linking

Example command for MedMentions

python blink/biencoder/eval_cluster_linking.py --bert_model=models/biobert-base-cased-v1.1 --data_path=data/medmentions/processed --output_path=models/trained/medmentions_mst/eval/pos_neg_loss/no_type/wo_type --pickle_src_path=models/trained/medmentions/eval --path_to_model=models/trained/medmentions_mst/pos_neg_loss/no_type/epoch_best_5th/pytorch_model.bin --recall_k=64 --embed_batch_size=3500 --force_exact_search --data_parallel

Discovery

Example command for MedMentions

python blink/biencoder/eval_entity_discovery.py --bert_model=models/biobert-base-cased-v1.1 --data_path=data/medmentions/processed --output_path=models/trained/medmentions_mst/eval/pos_neg_loss/directed --pickle_src_path=models/trained/medmentions/eval --embed_data_path=models/trained/medmentions_mst/eval/pos_neg_loss --use_types --force_exact_search --graph_mode=directed --exact_threshold=127.87733985396665 --exact_knn=8 --data_parallel

Questions / Feedback

If you have any questions, comments, or feedback on our work, you can reach out at dagarwal@cs.umass.edu, or open a GitHub issue.

Licence

ArboEL is MIT licensed. See the LICENSE file for details.

About

Entity Linking & Discovery solution

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 77.5%
  • HTML 17.9%
  • Cython 2.5%
  • Shell 2.1%