This toy-keras-project (MNIST classification with CNNs, duh) showcases Weights & Biases capabilities to easily achieve complete MLOps maturity for experimentation and model development in your ML projects. Wandb UI is a powerful tool and you're going to use it while iterate over your model development lifecycle and important passages will be shown below in videos.
In this project you have 3 python scripts living inside the app/pipelines
folder: data_prep.py
, train.py
and evaluate.py
which allows you to:
- prepare your training, validation and test datasets
- train a baseline model
- perform hyperparameters tuning (directly in wandb using sweeps)
- retrain your candidate model
- evaluate your final model before moving it to production
These steps should be executed in this order, at least the first time.
Thanks to wandb, you and your team have:
- version control and lineage of your datasets and models
- experiment tracking by logging almost everything your ❤ desires and thanks to the model registry you can effortlessly transition models over lifecycle and easily hand off models to your teammates
- a very powerful, framework-agnostic and easy-to-use tool for hyperparameters tuning (goodbye keras-tuner my old friend)
- tables to log almost any type of data which allow you to:
- study input distributions and avoid data leakage
- perform error analysis
- the capability to write and share markdown reports directly linked to your tables or other artifacts you logged to wandb
- You need to have a wandb account. It's free for personal use and you have unlimited tracking and 100GB storage for artifacts.
- create a private or open project and give it a name, e.g. "MNIST"
- clone the repository and create a virtual environment from the given yaml file. All the required packages will be installed as well :
conda env create -f environment.yaml
- activate the environment:
conda activate MLOps-with-wandb
- Create a
.env
file (which is git-ignored) in theapp/
folder containing your wandb API key and other information depending on the script you're using (more information later). The file would look like:
WANDB_API_KEY="********************************"
WANDB_ENTITY="fratambot"
WANDB_PROJECT="MNIST"
WANDB_MODEL_RETRAIN="MNIST_CNN:v0"
WANDB_MODEL_EVAL="MNIST_CNN:v1"
To prepare your datasets you can run the app/pipelines/data-prep.py
script: it will sample 10% of the keras MNIST dataset and split it into 70 % training / 20% validation / 10% test with stratification over the classes.
You can change these default values by passing them. For more info, consult:
python app/pipelines/data-prep.py --help
- requirements:
WANDB_API_KEY
WANDB_PROJECT
- inputs: None
- outputs:
- artifact: a training / validation / test dataset collection in a file called
split-data.npz
- media: an histogram showing the labels distribution for the 3 datasets (rescaled with respect to the relative split proportion)
- table: a 2 columns table with the label and the stage (training / validation / test set)
- artifact: a training / validation / test dataset collection in a file called
This is what you'll find on wandb and how to interact with it through the UI:
data_prep.mov
To train a baseline model you can run the app/pipelines/train.py
script: it will train for 5 epochs a CNN (defined in app/models/models.py
) with default hyperparameters which you can change by passing them. For more info, consult:
python app/pipelines/train.py --help
- requirements:
WANDB_API_KEY
WANDB_PROJECT
WANDB_ENTITY
- inputs:
- artifact: the latest version of
split-data.npz
- artifact: the latest version of
- outputs:
- artifact: a trained keras model in a file called
CNN_model.h5
- metrics: automagically logged using wandb keras callback
- artifact: a trained keras model in a file called
This is what you'll find on wandb and how to interact with it through the UI:
training.mov
At the end you can create and share a nice report with your findings and insights for your team.
You can perform hyperparameters tuning using wandb sweeps.
For that you will run the training script with the boolean --tune
flag:
python app/pipelines/train.py --tune
The script will create a wandb agent performing a Bayesian search with a default value (max_sweep=30
) of 30 runs at most over a set of hyperparameters choices defined in the sweep.yaml
file living in the app/pipelines
folder. You can change the sweep configuration according to your preferences and adjust --max_sweep
, --epochs
and other performance parameters according to your infrastructure, time and resources.
- requirements:
WANDB_API_KEY
WANDB_PROJECT
WANDB_ENTITY
- inputs:
- artifact: the latest version of
split-data.npz
- artifact: the latest version of
- outputs:
- artifact: a trained keras model in a file called
CNN_model.h5
for each sweep - metrics: automagically logged using wandb keras callback
- artifact: a trained keras model in a file called
The sweep visualization in wandb is probably the most impressive one allowing you and your team to easily compare models with different hyperparameters, pick up the best candidate and put it in the model registry to moving it further in the model development lifecycle. Moreover you can also evaluate which parameters have the most impact looking at the parameters importance autogenerated plot and, everybody's favourite, the parallel coordinates's plot.
This is how it looks like in the UI:
tuning.mov
Your observations and insights on the hyperparameters tuning step can be shared with your team in a report linked to your sweep runs for further investigation.
Once you have found a candidate model with the best combination of hyperparameters values you should probably retrain it on more epochs before evaluating it. For this task you'll use the train.py
script with the boolean --retrain
flag. The model and its hyperparameters values will be loaded from wandb and you should NOT change them !
python app/pipelines/train.py --retrain --epochs=15
- requirements:
WANDB_API_KEY
WANDB_PROJECT
WANDB_ENTITY
WANDB_MODEL_RETRAIN
(you can find the "candidate" model id in the model registry. See at the end of the previous video)
- inputs:
- artifact: the latest version of
split-data.npz
- artifact: the model you tagged as "candidate" in your model registry (e.g.
WANDB_MODEL_RETRAIN="MNIST_CNN:v0"
)
- artifact: the latest version of
- outputs:
- artifact: a retrained version of your candidate model in a file called
CNN_model.h5
- metrics: automagically logged using wandb keras callback
- artifact: a retrained version of your candidate model in a file called
If your model doesn't overfit you can tag it as "to_evaluate" in your model registry for the next step :
retraining.mov
Now that you have retrained your candidate model for more epochs it's time to evaluate it. For this task you'll use the app/pipelines/evaluate.py
script.
The evaluation will be performed on the validation set (again) and on the test set that your model has never seen before. The comparison allows to estimate the degree of overfit to the validation set, if present.
For this task a table with images generated from the numpy arrays examples (X_val
and X_test
) wil be built. This operation might take a while depending on the size of your validation and test size (if you kept the default parameters in data_prep.py
, there will be 1401 validation examples and 700 test examples to convert into images and it will take ~17MB of hard disk and wandb storage space). You can decide to not generate the examples images using the boolean --no-generate_images
flag:
python app/pipelines/evaluate.py --no-generate-images
But it is strongly suggested to run the script without flags, grab a cup of ☕ and generate the images because they can be really useful for error analysis.
- requirements:
WANDB_API_KEY
WANDB_PROJECT
WANDB_ENTITY
WANDB_MODEL_EVAL
(you can find the model id "to_evaluate" in the model registry. See at the end of the previous video)
- inputs:
- artifact: the latest version of
split-data.npz
- artifact: the model you tagged as "to_evaluate" in your model registry (e.g.
WANDB_MODEL_RETRAIN="MNIST_CNN:v1"
)
- artifact: the latest version of
- outputs:
- metrics: loss and categorical accuracy for both the validation and test set
- media: confusion matrices for both the validation and test sets
- table: a 3 columns table with the MNIST image of the example, the true label and the predicted label for both the validation and test set
This is how it looks like in the UI:
evaluate.mov
And, as usual, you can write and share a detailed report for the production team
If you want to learn more on Weights & Biases, here are some extra resources:
- the wandb documentation which is very rich and points you to examples on github and colab
- the free "Effective MLOps: model development" course by wandb
- the wandb blog "Fully connected"
- the wandb white paper "MLOps: a holistic approach"