Skip to content

Commit 7ec7d62

Browse files
committed
Add MultilingualPolicyFilter
1 parent 76cecc3 commit 7ec7d62

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

Diff for: src/datatrove/pipeline/filters/multilingual_policy_filter.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ def filter(self, doc: Document) -> bool | tuple[bool, str]:
8181
if any(p in line_l for p in POLICY_SUBSTRINGS[self.language]):
8282
self.stat_update("line-filter-policy")
8383
continue
84-
num_sentences += len(sent_tokenize(line, language=self.tokenizer_language)) if self.split_paragraph else 1
84+
num_sentences += len(sent_tokenize(line, language=self.langauge)) if self.split_paragraph else 1
8585
kept_lines.append(line)
8686
self.stat_update("line-kept")
8787
if num_sentences < self.min_num_sentences:

0 commit comments

Comments
 (0)