Replies: 3 comments
-
Here's two attempts with flash attention enabled. The response questions are irrelevant, and the conversation completely loses focus. And here's with flash attention disabled. Settings otherwise entirely the same. Guesses the right country in two questions. |
Beta Was this translation helpful? Give feedback.
-
Tested on an NVIDIA RTX 4000 SFF Ada Generation, an RTX 3060 and a GTX 1080. All seem to be affected. |
Beta Was this translation helpful? Give feedback.
-
It seems the way the implement it for older you than amper make it that its hamer the quality for speed |
Beta Was this translation helpful? Give feedback.
-
Based on my own observations and user feedback for my application, I've noticed a significant degradation of output quality when flash attention is enabled. This was noticed on Llama3-8B and Falcon-7B base models, both running on GPU.
The application is continuously making use of context shifting. Is there any known interaction issue between this and flash attention?
Particularly, when flash attention is enabled, completion becomes extremely vague or superficially reactionary and randomly switches topic.
Any similar experiences, or any thoughts on the combination of context shifting and flash attention?
Beta Was this translation helpful? Give feedback.
All reactions