We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
目前的推理架构中vit和llm部分是互相抢占的,这种情况下是如何保证TTFT的呢?
The text was updated successfully, but these errors were encountered:
这部分是使用best effort 的策略, 所以没有绝对意义上的保证。
不过由于LLM的特征,大部分的输入,都是在生成阶段时间占了绝大部分,ViT和GPT的Context阶段相对占比较少,而生成阶段无法使用所有的SM,这时候的ViT执行时间并没有太大影响。 而ViT和LLM部分的Context阶段又是顺序的,所以大部分的时间TTFT的时间并不受影响。
当然,按照这种逻辑,如果有些输入是GPT部分的输出token很短的情况,这种情况下会有一定的影响。
本质上这种是利用了CUDA的利用率不足,让CUDA的调度来填满这些SM的使用的原理。
Sorry, something went wrong.
@kzjeef 感谢解答,另外qwen2vl TRT backend似乎还不支持dp,是这样的吗?
这里说的DP是多卡Data parallel?
No branches or pull requests
目前的推理架构中vit和llm部分是互相抢占的,这种情况下是如何保证TTFT的呢?
The text was updated successfully, but these errors were encountered: