Open source code for ACL 2020 Paper: Integrating Multimodal Information in Large Pretrained Transformers
-
Configure
global_configs.py
global_configs.py
defines global constants such as dimension of each data modality (text, acoustic, visual) and cpu/gpu settings. It also defines which layer MAG will be injected. The following are default configuration.os.environ["CUDA_VISIBLE_DEVICES"] = "0" os.environ["WANDB_PROGRAM"] = "multimodal_driver.py" DEVICE = torch.device("cuda:0") # contextualized text embedding from pre-trained BERT / XLNet TEXT_DIM = 768 # acoustic / visual embedding dimension from MOSI dataset ACOUSTIC_DIM = 74 VISUAL_DIM = 47 XLNET_INJECTION_INDEX = 1
-
Training MAG-BERT / MAG-XLNet on MOSI
First, install python dependancies using
pip install -r requirements.txt
Training scripts:
- MAG-BERT
python mutlimodal_driver.py --model bert-base-uncased
- MAG-XLNet
python multimodal_driver.py --model xlnet-base-cased
By default,
mutlimodal_driver.py
will attempt to create a Weights and Biases (W&B) project to log your runs and results. If you wish to disable W&B logging, set environment variable toWANDB_MODE=dryrun
. - MAG-BERT
-
Model usage
We would like to thank huggingface for providing and open-sourcing BERT / XLNet code for developing our models. Note that bert.py / xlnet.py are based on huggingface's implmentation.
MAG
from modeling import MAG hidden_size, beta_shift, dropout_prob = 768, 1e-3, 0.5 multimodal_gate = MAG(hidden_size, beta_shift, dropout_prob) fused_embedding = multimodal_gate(text_embedding, visual_embedding, acoustic_embedding)
MAG-BERT
from bert import MAG_BertForSequenceClassification class MultimodalConfig(object): def __init__(self, beta_shift, dropout_prob): self.beta_shift = beta_shift self.dropout_prob = dropout_prob multimodal_config = MutlimodalConfig(beta_shift=1e-3, dropout_prob=0.5) model = MAG_BertForSequenceClassification.from_pretrained( 'bert-base-uncased', multimodal_config=multimodal_config, num_labels=1, ) outputs = model(input_ids, visual, acoustic, attention_mask, position_ids) logits = outputs[0]
MAG-XLNet
from xlnet import MAG_XLNetForSequenceClassification class MultimodalConfig(object): def __init__(self, beta_shift, dropout_prob): self.beta_shift = beta_shift self.dropout_prob = dropout_prob multimodal_config = MutlimodalConfig(beta_shift=1e-3, dropout_prob=0.5) model = MAG_XLNet_ForSequenceClassification.from_pretrained( 'xlnet-base-cased', multimodal_config=multimodal_config, num_labels=1, ) outputs = model(input_ids, visual, acoustic, attention_mask, position_ids) logits = outputs[0]
For MAG-BERT / MAG-XLNet usage, visual, acoustic are torch.FloatTensor of shape (batch_size, sequence_length, modality_dim).
input_ids, attention_mask, position_ids are torch.LongTensor of shape (batch_size, sequence_length). For more details on how these tensors should be formatted / generated, please refer to
multimodal_driver.py
'sconvert_to_features
method and huggingface's documentation
- Kamrul Hasan: mhasan8@cs.rochester.edu
- Wasifur Rahman: rahmanwasifur@gmail.com
- Sangwu Lee: sangwulee2@gmail.com