Skip to content

Commit f2480a1

Browse files
authoredJun 26, 2024
improve Pre-Tokenized Dataset docs (axolotl-ai-cloud#1684) [skip ci]
Fixes axolotl-ai-cloud#1661
1 parent 559562d commit f2480a1

File tree

1 file changed

+18
-2
lines changed

1 file changed

+18
-2
lines changed
 

‎docs/dataset-formats/tokenized.qmd

+18-2
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,25 @@ description: How to use a custom pre-tokenized dataset.
44
order: 5
55
---
66

7-
- Do not pass a `type:` in your axolotl config.
7+
- Pass an empty `type:` in your axolotl config.
88
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
9+
- To indicate that a token should be ignored during training, set its corresponding label to `-100`.
10+
- Do not add BOS/EOS. Axolotl will add them for you based on the default tokenizer for the model you're using.
11+
- For pretraining, do not truncate/pad documents to the context window length.
12+
- For instruction training, documents must be truncated/padded as desired.
13+
14+
Sample config:
915

1016
```{.yaml filename="config.yml"}
11-
- path: ...
17+
datasets:
18+
- path: /path/to/your/file.jsonl
19+
ds_type: json
20+
type:
21+
```
22+
23+
Sample jsonl:
24+
25+
```jsonl
26+
{"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]}
27+
{"input_ids":[87,227,8383,12],"attention_mask":[1,1,1,1],"labels":[87,227,8383,12]}
1228
```

0 commit comments

Comments
 (0)
Failed to load comments.