diff --git a/README.md b/README.md index 5ade3a0924..49b48bb8f3 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,7 @@ Features: - Advanced Topics - [Multipack](./docs/multipack.qmd) - [RLHF & DPO](./docs/rlhf.qmd) + - [Dataset Pre-Processing](./docs/dataset_preprocessing.qmd) - [Common Errors](#common-errors-) - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training) - [Debugging Axolotl](#debugging-axolotl) diff --git a/docs/dataset_preprocessing.qmd b/docs/dataset_preprocessing.qmd new file mode 100644 index 0000000000..c99fce444e --- /dev/null +++ b/docs/dataset_preprocessing.qmd @@ -0,0 +1,35 @@ +--- +title: Dataset Preprocessing +description: How datasets are processed +--- + +Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside +the (dataset format)[../dataset-formats/] and prompt strategies to: + - parse the dataset based on the *dataset format* + - transform the dataset to how you would interact with the model based on the *prompt strategy* + - tokenize the dataset based on the configured model & tokenizer + - shuffle and merge multiple datasets together if using more than one + +The processing of the datasets can happen one of two ways: + +1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug` +2. When training is started + +What are the benefits of pre-processing? When training interactively or for sweeps +(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly +slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent +training parameters so that it will intelligently pull from its cache when possible. + +The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example +YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data. + +If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a +default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly +setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed +data is in the cache. + +What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined +prompt template. Because the trainer cannot readily detect these changes, we cannot change the +calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set +and change your prompt templating logic, it may not pick up the changes you made and you will be +training over the old prompt.