Replies: 1 comment
-
GGUFs pack everything into one file, so you only need one file. A lot of providers offers ready ggufs nowdays. Use search for Once you have the repo id, you can run the download using the built in from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", #specify HF repo id where the model files are hosted
filename="*q8_0.gguf", #will download file that matches this pattern name (so one ending with q8_0.gguf, so an 8bit quant)
local_dir="./ai/llm_models/", #optionally specific dir to save the model file (other wise it will cache it in the cache dir)
#verbose=True,
# n_gpu_layers=-1, #uncomment this if you have llama-cpp-python with gpu support installed, otherwise it uses CPU
chat_format="llama", #specify chat template, not sure if this is obligatory or detected automatically
)
output = llm.create_chat_completion(
messages = [
{"role": "system", "content": ""},
{
"role": "user",
"content": "Hi, I'm just testing if it works.",
}
],
max_tokens=256, #max number of tokens, you can check for each model what's the max it can support online
)
print(output) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to run 8bit files using the LlamaCpp-Python library. Should I upload 2 files at the same time? Can you share sample code?
Beta Was this translation helpful? Give feedback.
All reactions