Skip to content

Commit f88a30c

Browse files
authored
Merge branch 'main' into cuda
2 parents d2612a4 + c032fc6 commit f88a30c

22 files changed

+1169
-1142
lines changed

.github/workflows/build-and-release.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ jobs:
4242
shell: cmd
4343

4444
- name: Build wheels
45-
uses: pypa/cibuildwheel@v2.19.2
45+
uses: pypa/cibuildwheel@v2.20.0
4646
env:
4747
# disable repair
4848
CIBW_REPAIR_WHEEL_COMMAND: ""
@@ -69,7 +69,7 @@ jobs:
6969
platforms: linux/arm64
7070

7171
- name: Build wheels
72-
uses: pypa/cibuildwheel@v2.19.2
72+
uses: pypa/cibuildwheel@v2.20.0
7373
env:
7474
CIBW_SKIP: "*musllinux* pp*"
7575
CIBW_REPAIR_WHEEL_COMMAND: ""

.github/workflows/build-wheels-metal.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ jobs:
4343
shell: cmd
4444

4545
- name: Build wheels
46-
uses: pypa/cibuildwheel@v2.19.2
46+
uses: pypa/cibuildwheel@v2.20.0
4747
env:
4848
# disable repair
4949
CIBW_REPAIR_WHEEL_COMMAND: ""

.github/workflows/generate-index-from-release.yaml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
name: Wheels Index
22

33
on:
4-
# Trigger on any new release
5-
release:
6-
types: [published]
4+
# Trigger on new release
5+
workflow_run:
6+
workflows: ["Release", "Build Wheels (CUDA)", "Build Wheels (Metal)"]
7+
types:
8+
- completed
79

810
# Allows you to run this workflow manually from the Actions tab
911
workflow_dispatch:
@@ -33,7 +35,10 @@ jobs:
3335
- name: Setup Pages
3436
uses: actions/configure-pages@v5
3537
- name: Build
38+
env:
39+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
3640
run: |
41+
./scripts/get-releases.sh
3742
./scripts/releases-to-pep-503.sh index/whl/cpu '^[v]?[0-9]+\.[0-9]+\.[0-9]+$'
3843
./scripts/releases-to-pep-503.sh index/whl/cu121 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu121$'
3944
./scripts/releases-to-pep-503.sh index/whl/cu122 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu122$'

CHANGELOG.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,52 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.2.90]
11+
12+
- feat: Update llama.cpp to ggerganov/llama.cpp@1d1ccce67613674c75c9c7e3fa4c1e24e428ba48
13+
- feat: Add support for `MiniCPMv26ChatHandler` and `minicpm-v-26` in server by @abetlen in f70df824985d875226793b94dacc0c302a4256b2
14+
15+
## [0.2.89]
16+
17+
- feat: Update llama.cpp to ggerganov/llama.cpp@cfac111e2b3953cdb6b0126e67a2487687646971
18+
- fix: Llama.close didn't free lora adapter by @jkawamoto in #1679
19+
- fix: missing dependencies for test by @jkawamoto in #1680
20+
21+
## [0.2.88]
22+
23+
- feat: Update llama.cpp to ggerganov/llama.cpp@fc4ca27b25464a11b3b86c9dbb5b6ed6065965c2
24+
- fix: only print 'cache saved' in verbose mode by @lsorber in #1668
25+
- fix: Added back from_file method to LlamaGrammar by @ExtReMLapin in #1673
26+
- fix: grammar prints on each call by @abetlen in 0998ea0deea076a547d54bd598d6b413b588ee2b
27+
- feat: Enable recursive search of HFFS.ls when using from_pretrained by @benHeidabetlen in #1656
28+
- feat: Add more detailed log for prefix-match by @xu-song in #1659
29+
30+
## [0.2.87]
31+
32+
- feat: Update llama.cpp to ggerganov/llama.cpp@be55695eff44784a141a863f273661a6bce63dfc
33+
- fix: Include all llama.cpp source files and subdirectories by @abetlen in 9cad5714ae6e7c250af8d0bbb179f631368c928b
34+
- feat(ci): Re-build wheel index automatically when releases are created by @abetlen in 198f47dc1bd202fd2b71b29e041a9f33fe40bfad
35+
36+
## [0.2.86]
37+
38+
- feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
39+
- feat: Ported back new grammar changes from C++ to Python implementation by @ExtReMLapin in (#1637)
40+
- fix: llama_grammar_accept_token arg order by @tc-wolf in (#1649)
41+
42+
## [0.2.85]
43+
44+
- feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
45+
- fix: Missing LoRA adapter after API change by @shamitv in #1630
46+
- fix(docker): Update Dockerfile BLAS options by @olivierdebauche in #1632
47+
- fix(docker): Fix GGML_CUDA param by @olivierdebauche in #1633
48+
- fix(docker): Update Dockerfile build options from `LLAMA_` to `GGML_` by @olivierdebauche in #1634
49+
- feat: FreeBSD compatibility by @yurivict in #1635
50+
51+
## [0.2.84]
52+
53+
- feat: Update llama.cpp to ggerganov/llama.cpp@4730faca618ff9cee0780580145e3cbe86f24876
54+
- fix: fix: Correcting run.sh filepath in Simple Docker implementation by @mashuk999 in #1626
55+
1056
## [0.2.83]
1157

1258
- feat: Update llama.cpp to ggerganov/llama.cpp@081fe431aa8fb6307145c4feb3eed4f48cab19f8

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
# 🦙 Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
1+
<p align="center">
2+
<img src="https://raw.githubusercontent.com/abetlen/llama-cpp-python/main/docs/icon.svg" style="height: 5rem; width: 5rem">
3+
</p>
4+
5+
# Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
26

37
[![Documentation Status](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest)](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)
48
[![Tests](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)
@@ -500,6 +504,7 @@ Below are the supported multi-modal models and their respective chat handlers (P
500504
| [moondream2](https://huggingface.co/vikhyatk/moondream2) | `MoondreamChatHandler` | `moondream2` |
501505
| [nanollava](https://huggingface.co/abetlen/nanollava-gguf) | `NanollavaChatHandler` | `nanollava` |
502506
| [llama-3-vision-alpha](https://huggingface.co/abetlen/llama-3-vision-alpha-gguf) | `Llama3VisionAlphaChatHandler` | `llama-3-vision-alpha` |
507+
| [minicpm-v-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) | `MiniCPMv26ChatHandler` | `minicpm-v-2.6` |
503508

504509
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
505510

docker/cuda_simple/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,13 @@ COPY . .
1515

1616
# setting build related env vars
1717
ENV CUDA_DOCKER_ARCH=all
18-
ENV LLAMA_CUBLAS=1
18+
ENV GGML_CUDA=1
1919

2020
# Install depencencies
2121
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context
2222

2323
# Install llama-cpp-python (build with cuda)
24-
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
24+
RUN CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
2525

2626
# Run the server
2727
CMD python3 -m llama_cpp.server

docker/open_llama/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,13 @@ RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fa
2020

2121
# Perform the conditional installations based on the image
2222
RUN echo "Image: ${IMAGE}" && \
23-
if [ "${IMAGE}" = "python:3-slim-bullseye" ] ; then \
23+
if [ "${IMAGE}" = "python:3-slim-bookworm" ] ; then \
2424
echo "OpenBLAS install:" && \
2525
apt-get install -y --no-install-recommends libopenblas-dev && \
26-
LLAMA_OPENBLAS=1 pip install llama-cpp-python --verbose; \
26+
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --verbose; \
2727
else \
2828
echo "CuBLAS install:" && \
29-
LLAMA_CUBLAS=1 pip install llama-cpp-python --verbose; \
29+
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --verbose; \
3030
fi
3131

3232
# Clean up apt cache

docker/openblas_simple/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ RUN apt update && apt install -y libopenblas-dev ninja-build build-essential pkg
1212

1313
RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context
1414

15-
RUN CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama_cpp_python --verbose
15+
RUN CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama_cpp_python --verbose
1616

1717
# Run the server
1818
CMD python3 -m llama_cpp.server

docker/simple/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,4 +35,4 @@ ENV PORT=8000
3535
EXPOSE 8000
3636

3737
# Run the server start script
38-
CMD ["/bin/sh", "/app/run.sh"]
38+
CMD ["/bin/sh", "/app/docker/simple/run.sh"]

docs/icon.svg

Lines changed: 5 additions & 0 deletions
Loading

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.2.83"
4+
__version__ = "0.2.90"

llama_cpp/_internals.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -179,11 +179,11 @@ def token_eot(self) -> int:
179179
assert self.model is not None
180180
return llama_cpp.llama_token_eot(self.model)
181181

182-
def add_bos_token(self) -> int:
182+
def add_bos_token(self) -> bool:
183183
assert self.model is not None
184184
return llama_cpp.llama_add_bos_token(self.model)
185185

186-
def add_eos_token(self) -> int:
186+
def add_eos_token(self) -> bool:
187187
assert self.model is not None
188188
return llama_cpp.llama_add_eos_token(self.model)
189189

@@ -511,7 +511,7 @@ def sample_token(self, candidates: "_LlamaTokenDataArray") -> int:
511511
def grammar_accept_token(self, grammar: LlamaGrammar, token: int):
512512
assert self.ctx is not None
513513
assert grammar.grammar is not None
514-
llama_cpp.llama_grammar_accept_token(self.ctx, grammar.grammar, token)
514+
llama_cpp.llama_grammar_accept_token(grammar.grammar, self.ctx, token)
515515

516516
def reset_timings(self):
517517
assert self.ctx is not None
@@ -691,8 +691,8 @@ def _detokenize_bpe(model: _LlamaModel, tokens: List[int]) -> str:
691691
def _should_add_bos(model: _LlamaModel) -> bool:
692692
assert model.model is not None
693693
add_bos = llama_cpp.llama_add_bos_token(model.model)
694-
if add_bos != -1:
695-
return add_bos != 0
694+
if add_bos:
695+
return add_bos
696696
else:
697697
return llama_cpp.llama_vocab_type(model.model) == llama_cpp.LLAMA_VOCAB_TYPE_SPM
698698

llama_cpp/llama.py

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ def __init__(
153153
model_path: Path to the model.
154154
n_gpu_layers: Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
155155
split_mode: How to split the model across GPUs. See llama_cpp.LLAMA_SPLIT_* for options.
156-
main_gpu: main_gpu interpretation depends on split_mode: LLAMA_SPLIT_NONE: the GPU that is used for the entire model. LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results. LLAMA_SPLIT_LAYER: ignored
156+
main_gpu: main_gpu interpretation depends on split_mode: LLAMA_SPLIT_MODE_NONE: the GPU that is used for the entire model. LLAMA_SPLIT_MODE_ROW: the GPU that is used for small tensors and intermediate results. LLAMA_SPLIT_MODE_LAYER: ignored
157157
tensor_split: How split tensors should be distributed across GPUs. If None, the model is not split.
158158
rpc_servers: Comma separated list of RPC servers to use for offloading
159159
vocab_only: Only load the vocabulary no weights.
@@ -198,6 +198,7 @@ def __init__(
198198
A Llama instance.
199199
"""
200200
self.verbose = verbose
201+
self._stack = contextlib.ExitStack()
201202

202203
set_verbose(verbose)
203204

@@ -365,8 +366,6 @@ def __init__(
365366
if not os.path.exists(model_path):
366367
raise ValueError(f"Model path does not exist: {model_path}")
367368

368-
self._stack = contextlib.ExitStack()
369-
370369
self._model = self._stack.enter_context(
371370
contextlib.closing(
372371
_LlamaModel(
@@ -420,6 +419,15 @@ def __init__(
420419
raise RuntimeError(
421420
f"Failed to initialize LoRA adapter from lora path: {self.lora_path}"
422421
)
422+
423+
def free_lora_adapter():
424+
if self._lora_adapter is None:
425+
return
426+
llama_cpp.llama_lora_adapter_free(self._lora_adapter)
427+
self._lora_adapter = None
428+
429+
self._stack.callback(free_lora_adapter)
430+
423431
assert self._ctx.ctx is not None
424432
if llama_cpp.llama_lora_adapter_set(
425433
self._ctx.ctx, self._lora_adapter, self.lora_scale
@@ -570,6 +578,8 @@ def tokenize(
570578
571579
Args:
572580
text: The utf-8 encoded string to tokenize.
581+
add_bos: Whether to add a beginning of sequence token.
582+
special: Whether to tokenize special tokens.
573583
574584
Raises:
575585
RuntimeError: If the tokenization failed.
@@ -580,18 +590,19 @@ def tokenize(
580590
return self.tokenizer_.tokenize(text, add_bos, special)
581591

582592
def detokenize(
583-
self, tokens: List[int], prev_tokens: Optional[List[int]] = None
593+
self, tokens: List[int], prev_tokens: Optional[List[int]] = None, special: bool = False
584594
) -> bytes:
585595
"""Detokenize a list of tokens.
586596
587597
Args:
588598
tokens: The list of tokens to detokenize.
589-
prev_tokens: The list of previous tokens. Offset mapping will be performed if provided
599+
prev_tokens: The list of previous tokens. Offset mapping will be performed if provided.
600+
special: Whether to detokenize special tokens.
590601
591602
Returns:
592603
The detokenized string.
593604
"""
594-
return self.tokenizer_.detokenize(tokens, prev_tokens=prev_tokens)
605+
return self.tokenizer_.detokenize(tokens, prev_tokens=prev_tokens, special=special)
595606

596607
def set_cache(self, cache: Optional[BaseLlamaCache]):
597608
"""Set the cache.
@@ -777,11 +788,12 @@ def generate(
777788
else:
778789
break
779790
if longest_prefix > 0:
780-
if self.verbose:
781-
print("Llama.generate: prefix-match hit", file=sys.stderr)
782791
reset = False
783792
tokens = tokens[longest_prefix:]
784793
self.n_tokens = longest_prefix
794+
if self.verbose:
795+
print(f"Llama.generate: {longest_prefix} prefix-match hit, "
796+
f"remaining {len(tokens)} prompt tokens to eval", file=sys.stderr)
785797

786798
# Reset the model state
787799
if reset:
@@ -1057,13 +1069,13 @@ def _create_completion(
10571069

10581070
if (
10591071
(isinstance(prompt, list) and suffix is None)
1060-
or self._model.add_bos_token() == 0
1072+
or not self._model.add_bos_token()
10611073
or bos_tokens[:1] == [-1]
10621074
):
10631075
bos_tokens = []
10641076

10651077
if (isinstance(prompt, list) and suffix is None) or (
1066-
self._model.add_eos_token() != 1 and sep_token_id == -1
1078+
not self._model.add_eos_token() and sep_token_id == -1
10671079
):
10681080
eos_tokens = []
10691081

@@ -1522,7 +1534,8 @@ def logit_bias_processor(
15221534
if self.verbose:
15231535
print("Llama._create_completion: cache save", file=sys.stderr)
15241536
self.cache[prompt_tokens + completion_tokens] = self.save_state()
1525-
print("Llama._create_completion: cache saved", file=sys.stderr)
1537+
if self.verbose:
1538+
print("Llama._create_completion: cache saved", file=sys.stderr)
15261539
return
15271540

15281541
if self.cache:
@@ -2086,8 +2099,6 @@ def close(self) -> None:
20862099
self._stack.close()
20872100

20882101
def __del__(self) -> None:
2089-
if self._lora_adapter is not None:
2090-
llama_cpp.llama_lora_adapter_free(self._lora_adapter)
20912102
self.close()
20922103

20932104
@staticmethod
@@ -2156,7 +2167,7 @@ def from_pretrained(
21562167

21572168
files = [
21582169
file["name"] if isinstance(file, dict) else file
2159-
for file in hffs.ls(repo_id)
2170+
for file in hffs.ls(repo_id, recursive=True)
21602171
]
21612172

21622173
# split each file into repo_id, subfolder, filename

0 commit comments

Comments
 (0)