You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,11 @@
13
13
14
14
---
15
15
16
+
> [!NOTE]
17
+
> We have pushed some breaking changes to the repository on June 4. To enable us to do smarter packing, we refactored the way image and text embeddings are combined. To keep everything as smooth as possible, we have trained a new nanoVLM-450M with this new pipeline, while leaving the old nanoVLM-222M compatible with the old pipeline If you clone this repository now or pull the updated to your local machine, the default will be the new 450M Model. If you would like a simpler understanding and a simpler codebase, you can use the v0.1 release. This works out of the box with the old 222M model.
18
+
19
+
---
20
+
16
21
nanoVLM is the simplest repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. The code itself is very readable and approachable, the model consists of a Vision Backbone (`models/vision_transformer.py`~150 lines), Language Decoder (`models/language_model.py`~250 lines), Modality Projection (`models/modality_projection.py`~50 lines) and the VLM itself (`models/vision_language_model.py`~100 lines) and a simple training loop (`train.py`~200 lines).
17
22
18
23
Similar to Andrej Karpathy's nanoGPT, we wanted to equip the community with a very simple implementation and training script for Vision Language Models. We do not claim this to be a new SOTA model, rather an educational effort that packs quite a bit of punch if you have the right hardware! You should be able to tweak and play around with the code in no time.
# The tokenizer has different behavior for padding and truncation:
38
-
# 1. If the full text (answer + question) is shorter than the max length, it gets padded on the left
39
-
# 2. If the full text is longer than the max length, it gets truncated on the right
40
-
# Therefore, I need to handle multiple cases, this is the different scenarios:
41
-
# If the full text is longer than the max length, we need to set the labels to -100 for the whole sample (we want to ignore the whole sample)
42
-
# If the full text is shorter than the max length, we need to set the labels to -100 only for the question part, and create causal language modeling labels for the answer part, taking into account the padding
38
+
labels[:, :-1] =input_ids[:, 1:].clone() # Shift labels for causal LM
39
+
labels[:, -1] =-100# Last token has no target
43
40
44
-
# Determine if sequences were truncated
41
+
# Determine original lengths before padding/truncation to handle truncation cases
extra_token_amount: int=1# Number of extra tokens for the VLM (image start, image end, image token)
24
+
lm_vocab_size: int=lm_base_vocab_size+extra_token_amount# Not a great way to do this, but it works for now (vlm_extra_tokens cannot be a dict, since this is mutable, and a Field has no len() function)
23
25
lm_n_heads: int=9
24
26
lm_n_kv_heads: int=3
25
27
lm_dropout: float=0.0
26
28
lm_n_blocks: int=30
27
29
lm_attn_scaling: float=1.0
28
-
IMAGE_TOKEN_LENGTH: int=49
29
-
TOTAL_SEQUENCE_LENGTH: int=128
30
-
lm_max_length: int=TOTAL_SEQUENCE_LENGTH-IMAGE_TOKEN_LENGTH# Maximum length for the language model, derived from TOTAL_SEQUENCE_LENGTH and IMAGE_TOKEN_LENGTH
30
+
lm_max_length: int=512
31
31
lm_use_tokens: bool=False# Decide if the LM expects tokens or embeddings as input (if using as a backbone for the VLM, set to False)
32
32
lm_tie_weights: bool=True# Decide if you want to tie the LM Head weight to the token embedding weights
0 commit comments