Training pipeline that supports fine-tuning an LLM on a proprietary Q&A dataset and storing it in a model registry.
The best way to specialize an LLM on your specific task is to fine-tune it on a small dataset coupled to your business use case.
In this case, we will use the finance dataset generated using the q_and_a_dataset_generator
to specialize the LLM in responding to investing questions.
Main dependencies you have to install yourself:
- Python 3.10
- Poetry 1.5.1
- GNU Make 4.3
Installing all the other dependencies is as easy as running:
make install
For developing run:
make install_dev
Prepare credentials:
cp .env.example .env
--> and complete the .env
file with your credentials.
optional step in case you want to use Beam
-> Create a Beam account & configure it.
After you have to upload the dataset to a Beam volume:
make upload_dataset_to_beam
For debugging or to test that everything is working fine, run the following to train the model on a small subset of the dataset:
make dev_train_local
For training on the whole dataset, run the following:
make train_local
Similar to the local training, for debugging or testing, run:
make dev_train_beam
For training on the whole dataset, run:
make train_beam
Testing or debugging:
make dev_infer_local
The whole deal:
make infer_local
Testing or debugging:
make dev_infer_beam
The whole deal:
make infer_beam
Check the code for linting issues:
make lint_check
Fix the code for linting issues (note that some issues can't automatically be fixed, so you might need to solve them manually):
make lint_fix
Check the code for formatting issues:
make format_check
Fix the code for formatting issues:
make format_fix