-
Notifications
You must be signed in to change notification settings - Fork 173
[Bug]: Inference failed using enable_prefix_caching=True
in 0.73rc2
#447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
And When I reduce the prefix input length,
An error occurred:
|
update: When I further reduce the length of the prefix input, the script looks like: import time
from vllm import LLM, SamplingParams
# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = "You are a helpful assistant, and my name is joe, i'm 18 years old, please answer me a question"
def get_generation_time(llm, sampling_params, prompts):
# time the generation
start_time = time.time()
output = llm.generate(prompts, sampling_params=sampling_params)
end_time = time.time()
# print the output and generation time
print(f"Output: {output[0].outputs[0].text}")
print(f"Generation time: {end_time - start_time} seconds.")
# set enable_prefix_caching=True to enable APC
llm = LLM(
model='lmsys/longchat-13b-16k',
enable_prefix_caching=True
)
sampling_params = SamplingParams(temperature=0, max_tokens=100)
# Querying the age of John Doe
get_generation_time(
llm,
sampling_params,
LONG_PROMPT + "Question: How old am I ? ",
)
# Querying the age of Zack Blue
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
get_generation_time(
llm,
sampling_params,
LONG_PROMPT + "Question: what is my name ? ",
) it works, but the acceleration effect of APC does not seem to work:
|
@Potabk As I tried your first long-sequence script using the Qwen2.5-7B model, it was able to generate results normally with an earlier commit. We will look into identifying the root cause of the issue. |
It needs new version of NNAL. we'll address it once it's released. |
Same here, seems to come from nnal/atb Mki::LogSinkFile::DeleteOldestFile() , any fixes ? |
Will fix until releases of new version of NNAL |
Since nnal 8.1rc1 has released, and this bug is verified repaired(see #644 ), we close this issue |
Uh oh!
There was an error while loading. Please reload this page.
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I am using the vllm-ascend v0.7.3rc2 image
quay.io/ascend/vllm-ascend:v0.7.3rc2
to test feature of Automatic Prefix Caching, and my test script runs as follow:and the error trace as follow:
the device debug log:
The text was updated successfully, but these errors were encountered: