Skip to content

Commit f53c6c8

Browse files
authored
Merge pull request #14 from freelawproject/FLP_model
feat(model): use FLP model, update context window and overlap ratio
2 parents 409bd45 + ea94d0a commit f53c6c8

File tree

3 files changed

+11
-11
lines changed

3 files changed

+11
-11
lines changed

.env.example

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Model Settings
2-
TRANSFORMER_MODEL_NAME=nomic-ai/modernbert-embed-base
3-
TRANSFORMER_MODEL_VERSION=d556a88e332558790b210f7bdbe87da2fa94a8d8
4-
MAX_TOKENS=8192
5-
OVERLAP_RATIO=0.002
2+
TRANSFORMER_MODEL_NAME=Free-Law-Project/modernbert-embed-base_finetune_512
3+
TRANSFORMER_MODEL_VERSION=main
4+
MAX_TOKENS=512
5+
OVERLAP_RATIO=0.004
66
MIN_TEXT_LENGTH=1
77
MAX_TEXT_LENGTH = 10000000
88
MAX_QUERY_LENGTH = 100

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The service is optimized to handle two main use cases:
1212

1313
## Features
1414

15-
- Specialized text embedding generation for legal documents using the `nomic-ai/modernbert-embed-base`
15+
- Specialized text embedding generation for legal documents using `Free-Law-Project/modernbert-embed-base_finetune_512`, a `sentence_transformer` model finetuned on top of `nomic-ai/modernbert-embed-base`
1616
- Intelligent text chunking optimized for court opinions, based on sentence boundaries
1717
- Dedicated CPU-based processing for search queries, ensuring fast response times
1818
- GPU acceleration support for processing lengthy court opinions
@@ -33,7 +33,7 @@ cp .env.example .env
3333
Model Settings:
3434
- `TRANSFORMER_MODEL_NAME`
3535

36-
Default: `nomic-ai/modernbert-embed-base`
36+
Default: `Free-Law-Project/modernbert-embed-base_finetune_512`
3737

3838
The name or path of the SentenceTransformer model to use for generating embeddings.
3939

@@ -45,13 +45,13 @@ Model Settings:
4545

4646
- `MAX_TOKENS`
4747

48-
Default: `8192` (Range: 512–10000)
48+
Default: `512` (Range: 256–10000)
4949

5050
Maximum number of tokens per chunk when splitting text. If the text exceeds this limit, it is split into multiple chunks based on sentence boundaries. If a sentence exceeds this limit, it is truncated. Sentences are defined by `nltk.tokenize.sent_tokenize`, which follows English heuristics to detect sentence boundaries.
5151

5252
- `OVERLAP_RATIO`
5353

54-
Default: `0.002` (Range: 0-0.01)
54+
Default: `0.004` (Range: 0-0.01)
5555

5656
The ratio to calculate the number of sentences to overlap between chunks when splitting text. Sentences are defined by `nltk.tokenize.sent_tokenize`, which follows English heuristics to detect sentence boundaries. `num_overlap_sentences = int(MAX_TOKENS * OVERLAP_RATIO)`.
5757

inception/config.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@
44

55
class Settings(BaseSettings):
66
transformer_model_name: str = Field(
7-
"nomic-ai/modernbert-embed-base",
7+
"Free-Law-Project/modernbert-embed-base_finetune_512",
88
description="Name of the transformer model to use",
99
)
1010
transformer_model_version: str = Field(
1111
"main",
1212
description="Version of the transformer model to use",
1313
)
1414
max_tokens: int = Field(
15-
8192, ge=512, le=10000, description="Maximum tokens per chunk"
15+
512, ge=256, le=10000, description="Maximum tokens per chunk"
1616
)
1717
overlap_ratio: float = Field(
18-
0.002,
18+
0.004,
1919
ge=0,
2020
le=0.01,
2121
description="Ratio to calculate number of sentence overlap between chunks",

0 commit comments

Comments
 (0)