Merge pull request #14 from freelawproject/FLP_model

mlissner · web-flow · commit f53c6c8e7071 · 2025-03-06T09:29:13.000-08:00
feat(model): use FLP model, update context window and overlap ratio
diff --git a/.env.example b/.env.example
@@ -1,8 +1,8 @@
 # Model Settings
-TRANSFORMER_MODEL_NAME=nomic-ai/modernbert-embed-base
-TRANSFORMER_MODEL_VERSION=d556a88e332558790b210f7bdbe87da2fa94a8d8
-MAX_TOKENS=8192
-OVERLAP_RATIO=0.002
+TRANSFORMER_MODEL_NAME=Free-Law-Project/modernbert-embed-base_finetune_512
+TRANSFORMER_MODEL_VERSION=main
+MAX_TOKENS=512
+OVERLAP_RATIO=0.004
 MIN_TEXT_LENGTH=1
 MAX_TEXT_LENGTH = 10000000
 MAX_QUERY_LENGTH = 100
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ The service is optimized to handle two main use cases:
 
 ## Features
 
-- Specialized text embedding generation for legal documents using the `nomic-ai/modernbert-embed-base`
+- Specialized text embedding generation for legal documents using `Free-Law-Project/modernbert-embed-base_finetune_512`, a `sentence_transformer` model finetuned on top of `nomic-ai/modernbert-embed-base`
 - Intelligent text chunking optimized for court opinions, based on sentence boundaries
 - Dedicated CPU-based processing for search queries, ensuring fast response times
 - GPU acceleration support for processing lengthy court opinions
@@ -33,7 +33,7 @@ cp .env.example .env
 Model Settings:
 - `TRANSFORMER_MODEL_NAME`
 
-    Default: `nomic-ai/modernbert-embed-base`
+    Default: `Free-Law-Project/modernbert-embed-base_finetune_512`
 
     The name or path of the SentenceTransformer model to use for generating embeddings.
 
@@ -45,13 +45,13 @@ Model Settings:
 
 - `MAX_TOKENS`
 
-    Default: `8192` (Range: 512–10000)
+    Default: `512` (Range: 256–10000)
 
     Maximum number of tokens per chunk when splitting text. If the text exceeds this limit, it is split into multiple chunks based on sentence boundaries. If a sentence exceeds this limit, it is truncated. Sentences are defined by `nltk.tokenize.sent_tokenize`, which follows English heuristics to detect sentence boundaries.
 
 - `OVERLAP_RATIO`
 
-    Default: `0.002` (Range: 0-0.01)
+    Default: `0.004` (Range: 0-0.01)
 
     The ratio to calculate the number of sentences to overlap between chunks when splitting text. Sentences are defined by `nltk.tokenize.sent_tokenize`, which follows English heuristics to detect sentence boundaries. `num_overlap_sentences = int(MAX_TOKENS * OVERLAP_RATIO)`.
 
diff --git a/inception/config.py b/inception/config.py
@@ -4,18 +4,18 @@
 
 class Settings(BaseSettings):
     transformer_model_name: str = Field(
-        "nomic-ai/modernbert-embed-base",
+        "Free-Law-Project/modernbert-embed-base_finetune_512",
         description="Name of the transformer model to use",
     )
     transformer_model_version: str = Field(
         "main",
         description="Version of the transformer model to use",
     )
     max_tokens: int = Field(
-        8192, ge=512, le=10000, description="Maximum tokens per chunk"
+        512, ge=256, le=10000, description="Maximum tokens per chunk"
     )
     overlap_ratio: float = Field(
-        0.002,
+        0.004,
         ge=0,
         le=0.01,
         description="Ratio to calculate number of sentence overlap between chunks",