-
Notifications
You must be signed in to change notification settings - Fork 255
Refactored Image and Text combination to use placeholder tokens instead of simple concatination #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
First training runs with the new embedding handling are in! Blue: baseline run with current state of the main branch We can see that the losses follow each other very closely. I think considering that the whole tokenization and image/text combination logic was overworked in the PR this is exactly what we are looking for. While there is still a small difference, I think this could simply be due to the fact that we add additional I would be interested to hear your take on this @andimarafioti, do you think this is something we have to investigate further or a complete match of the loss curves will never be achieved with this span of refactoring anyways and it is therefore fine? Additionally, we can see that our current accuracy measurement is really not a good metric, both the current main and feature branch perform similarly bad, maybe since I did not pay any attention to the actual value of the learning rate (as long as it was the same between the two runs). Nevertheless, I wanted to include it for completeness. This new implementation is marginally slower (~2%) than the old one. I believe this is due to the fact that you cannot beat the computational complexity of simply concatenating the embeddings, finding & replacing the right ones takes a bit more effort. I am trying to see if I can improve this further but I think this is would be acceptable, considering the amount of flexibility for better packing strategies we gain with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! I think this is 90% there, and I just added a few comments for improvements. The main biggie is that it looks to me like you're predicting the image tokens :S
No description provided.