Skip to content

Commit c15d867

Browse files
committed
add more to embeddings section in README.md
1 parent e22f853 commit c15d867

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -575,7 +575,7 @@ llama = Llama(
575575

576576
### Embeddings
577577

578-
To generate text embeddings use [`create_embedding`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_embedding).
578+
To generate text embeddings use [`create_embedding`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_embedding) or [`embed`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.embed). Note that you must pass `embedding=True` to the constructor upon model creation for these to work properly.
579579

580580
```python
581581
import llama_cpp
@@ -589,6 +589,12 @@ embeddings = llm.create_embedding("Hello, world!")
589589
embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
590590
```
591591

592+
There are two primary notions of embeddings in a Transformer-style model: *token level* and *sequence level*. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.
593+
594+
Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.
595+
596+
It is possible to control pooling behavior in some cases using the `pooling_type` flag on model creation. You can ensure token level embeddings from any model using `LLAMA_POOLING_TYPE_NONE`. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.
597+
592598
### Adjusting the Context Window
593599

594600
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.

0 commit comments

Comments
 (0)