Skip to content

Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jun 4, 2025

Conversation

lusxvr
Copy link
Member

@lusxvr lusxvr commented May 27, 2025

No description provided.

@lusxvr
Copy link
Member Author

lusxvr commented May 28, 2025

image

First training runs with the new embedding handling are in!

Blue: baseline run with current state of the main branch
Red: new embedding handling, otherwise same config as the other run

We can see that the losses follow each other very closely. I think considering that the whole tokenization and image/text combination logic was overworked in the PR this is exactly what we are looking for. While there is still a small difference, I think this could simply be due to the fact that we add additional <image_start> and <image_end> tokens to the input_ids, so the model does not see 100% the same things as it did before the refactor.

I would be interested to hear your take on this @andimarafioti, do you think this is something we have to investigate further or a complete match of the loss curves will never be achieved with this span of refactoring anyways and it is therefore fine?

Additionally, we can see that our current accuracy measurement is really not a good metric, both the current main and feature branch perform similarly bad, maybe since I did not pay any attention to the actual value of the learning rate (as long as it was the same between the two runs). Nevertheless, I wanted to include it for completeness.

This new implementation is marginally slower (~2%) than the old one. I believe this is due to the fact that you cannot beat the computational complexity of simply concatenating the embeddings, finding & replacing the right ones takes a bit more effort. I am trying to see if I can improve this further but I think this is would be acceptable, considering the amount of flexibility for better packing strategies we gain with this.

@lusxvr
Copy link
Member Author

lusxvr commented Jun 3, 2025

The new embedding handing logic is ready! We train nanoVLM-450M with it and it achieves the same accuracy as the main branch with a slightly higher loss (since we have added indicative tokens of the image position now this makes sense).

image

@lusxvr lusxvr marked this pull request as ready for review June 3, 2025 10:01
Copy link
Member

@andimarafioti andimarafioti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I think this is 90% there, and I just added a few comments for improvements. The main biggie is that it looks to me like you're predicting the image tokens :S

@lusxvr lusxvr merged commit 731d8c6 into main Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants