Skip to content

Commit ed9e88d

Browse files
committedApr 15, 2023
feat: add a way to count tokens without encoding the whole text
- adds highly performant `isWithinTokenLimit` to count tokens without encoding the whole text - improve overall performance by removing transitive arrays - include precomputed `bpeRanks` - add type-checking - fix a few minor bugs (thanks to type-checking) - add generator versions of both decoder and encoder
1 parent 9df47fc commit ed9e88d

17 files changed

+5496
-3759
lines changed
 

‎.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
node_modules
2-
.npmrc
2+
*.d.ts
3+
.npmrc

‎.npmignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.npmrc
2+
tsconfig.json

‎Encoder.js

-178
This file was deleted.

‎Encoder.test.js

-44
This file was deleted.

‎data/bpe_ranks.json

+1
Large diffs are not rendered by default.

‎encoder.json ‎data/encoder.json

File renamed without changes.

‎data/getBpe.js

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
const path = require("path");
2+
const fs = require("fs");
3+
const { dictZip, range } = require("../utils");
4+
5+
const bpe_file = fs.readFileSync(path.join(__dirname, "./vocab.bpe"), "utf-8");
6+
const lines = bpe_file.split("\n");
7+
8+
const bpe_merges = lines.slice(1, lines.length - 1).map((x) =>
9+
x
10+
.split(/(\s+)/)
11+
.filter((e) => e.trim().length > 0)
12+
.join(","),
13+
);
14+
15+
const bpe_ranks = dictZip(bpe_merges, range(0, bpe_merges.length));
16+
17+
fs.writeFileSync(
18+
path.join(__dirname, "./bpe_ranks.json"),
19+
JSON.stringify(bpe_ranks),
20+
);

‎vocab.bpe ‎data/vocab.bpe

File renamed without changes.

0 commit comments

Comments
 (0)