Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82

lusxvr · 2025-05-27T17:50:38Z

No description provided.

…eval are still missing)

models/config.py

lusxvr · 2025-05-28T13:48:04Z

First training runs with the new embedding handling are in!

Blue: baseline run with current state of the main branch
Red: new embedding handling, otherwise same config as the other run

We can see that the losses follow each other very closely. I think considering that the whole tokenization and image/text combination logic was overworked in the PR this is exactly what we are looking for. While there is still a small difference, I think this could simply be due to the fact that we add additional <image_start> and <image_end> tokens to the input_ids, so the model does not see 100% the same things as it did before the refactor.

I would be interested to hear your take on this @andimarafioti, do you think this is something we have to investigate further or a complete match of the loss curves will never be achieved with this span of refactoring anyways and it is therefore fine?

Additionally, we can see that our current accuracy measurement is really not a good metric, both the current main and feature branch perform similarly bad, maybe since I did not pay any attention to the actual value of the learning rate (as long as it was the same between the two runs). Nevertheless, I wanted to include it for completeness.

This new implementation is marginally slower (~2%) than the old one. I believe this is due to the fact that you cannot beat the computational complexity of simply concatenating the embeddings, finding & replacing the right ones takes a bit more effort. I am trying to see if I can improve this further but I think this is would be acceptable, considering the amount of flexibility for better packing strategies we gain with this.

lusxvr · 2025-06-03T10:01:34Z

The new embedding handing logic is ready! We train nanoVLM-450M with it and it achieves the same accuracy as the main branch with a slightly higher loss (since we have added indicative tokens of the image position now this makes sense).

data/collators.py

andimarafioti

Good work! I think this is 90% there, and I just added a few comments for improvements. The main biggie is that it looks to me like you're predicting the image tokens :S

data/collators.py

models/vision_language_model.py

train.py

lusxvr added 4 commits May 27, 2025 15:35

added special tokens

6e53742

made changes backwards compatible

0846093

adapted collator to handle image replacement tokens

70df1dc

adapted VLM to handle image replacement tokens (MMStar and therefore …

7a3ec7e

…eval are still missing)

andimarafioti reviewed May 27, 2025

View reviewed changes

models/config.py Outdated Show resolved Hide resolved

andimarafioti reviewed May 27, 2025

View reviewed changes

models/config.py Show resolved Hide resolved

lusxvr mentioned this pull request May 28, 2025

Image-splitting: Enabling higher resolutions through image-splitting #62

Open

lusxvr added 3 commits May 28, 2025 12:47

adapted evaluation

04e0fd4

Merge branch 'main' into embd_combination

80d36a6

comparison run to main

4dd05c8

lusxvr added 13 commits May 28, 2025 14:15

changed token/sec calculation

00ea46e

fixed forward loop

41f6996

simplified logic and improved generate

ee6c6fa

ablation runs

c1d21e4

fixed grad norm log when using grad accum

807da1e

test run

1789ae3

back to old config

52f99f0

cleaned logging

3813a47

tried to fix generate (still not working)

5079e4e

trained 450M model with new embeddings

a507702

changed tokenizer back to cosmo

527856f

fixed typo

6360627

changed default model in generate

b8b19a5

lusxvr marked this pull request as ready for review June 3, 2025 10:01

lusxvr added 2 commits June 3, 2025 10:36

cleaned config

bd6e509

more comprehensive run dating and max grad norm

6e92046

andimarafioti reviewed Jun 3, 2025

View reviewed changes

data/collators.py Outdated Show resolved Hide resolved

andimarafioti reviewed Jun 3, 2025

View reviewed changes

data/collators.py Outdated Show resolved Hide resolved

data/collators.py Show resolved Hide resolved

models/vision_language_model.py Outdated Show resolved Hide resolved

train.py Show resolved Hide resolved

train.py Show resolved Hide resolved

lusxvr added 6 commits June 3, 2025 15:51

cleaned and incorporated suggestions

9f9b028

post-processed generate

614d913

cleaned branch for merge and checked compatibility

524ba29

cleaned naming

29dd1c5

fixed lr scheduler

29706d9

updated config and README

072441b

lusxvr merged commit 731d8c6 into main Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82

Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82

Uh oh!

lusxvr commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

lusxvr commented May 28, 2025 •

edited

Loading

Uh oh!

lusxvr commented Jun 3, 2025

Uh oh!

Uh oh!

andimarafioti left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82

Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82

Uh oh!

Conversation

lusxvr commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

lusxvr commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lusxvr commented Jun 3, 2025

Uh oh!

Uh oh!

andimarafioti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lusxvr commented May 28, 2025 •

edited

Loading