Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(model): use FLP model, update context window and overlap ratio #14

Merged
merged 1 commit into from
Mar 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Model Settings
TRANSFORMER_MODEL_NAME=nomic-ai/modernbert-embed-base
TRANSFORMER_MODEL_VERSION=d556a88e332558790b210f7bdbe87da2fa94a8d8
MAX_TOKENS=8192
OVERLAP_RATIO=0.002
TRANSFORMER_MODEL_NAME=Free-Law-Project/modernbert-embed-base_finetune_512
TRANSFORMER_MODEL_VERSION=main
MAX_TOKENS=512
OVERLAP_RATIO=0.004
MIN_TEXT_LENGTH=1
MAX_TEXT_LENGTH = 10000000
MAX_QUERY_LENGTH = 100
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The service is optimized to handle two main use cases:

## Features

- Specialized text embedding generation for legal documents using the `nomic-ai/modernbert-embed-base`
- Specialized text embedding generation for legal documents using `Free-Law-Project/modernbert-embed-base_finetune_512`, a `sentence_transformer` model finetuned on top of `nomic-ai/modernbert-embed-base`
- Intelligent text chunking optimized for court opinions, based on sentence boundaries
- Dedicated CPU-based processing for search queries, ensuring fast response times
- GPU acceleration support for processing lengthy court opinions
Expand All @@ -33,7 +33,7 @@ cp .env.example .env
Model Settings:
- `TRANSFORMER_MODEL_NAME`

Default: `nomic-ai/modernbert-embed-base`
Default: `Free-Law-Project/modernbert-embed-base_finetune_512`

The name or path of the SentenceTransformer model to use for generating embeddings.

Expand All @@ -45,13 +45,13 @@ Model Settings:

- `MAX_TOKENS`

Default: `8192` (Range: 512–10000)
Default: `512` (Range: 256–10000)

Maximum number of tokens per chunk when splitting text. If the text exceeds this limit, it is split into multiple chunks based on sentence boundaries. If a sentence exceeds this limit, it is truncated. Sentences are defined by `nltk.tokenize.sent_tokenize`, which follows English heuristics to detect sentence boundaries.

- `OVERLAP_RATIO`

Default: `0.002` (Range: 0-0.01)
Default: `0.004` (Range: 0-0.01)

The ratio to calculate the number of sentences to overlap between chunks when splitting text. Sentences are defined by `nltk.tokenize.sent_tokenize`, which follows English heuristics to detect sentence boundaries. `num_overlap_sentences = int(MAX_TOKENS * OVERLAP_RATIO)`.

Expand Down
6 changes: 3 additions & 3 deletions inception/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,18 @@

class Settings(BaseSettings):
transformer_model_name: str = Field(
"nomic-ai/modernbert-embed-base",
"Free-Law-Project/modernbert-embed-base_finetune_512",
description="Name of the transformer model to use",
)
transformer_model_version: str = Field(
"main",
description="Version of the transformer model to use",
)
max_tokens: int = Field(
8192, ge=512, le=10000, description="Maximum tokens per chunk"
512, ge=256, le=10000, description="Maximum tokens per chunk"
)
overlap_ratio: float = Field(
0.002,
0.004,
ge=0,
le=0.01,
description="Ratio to calculate number of sentence overlap between chunks",
Expand Down