-
Notifications
You must be signed in to change notification settings - Fork 197
Feature/add fedavg metric optrimization controller #3506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/add fedavg metric optrimization controller #3506
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbagan
Lydia Anette Schönpflug (USZ) and Ruben Bagan Benavides (Roche)
thank you so much for this contribution. Always love the community to contribute.
There are a couple of issues with this PR:
- the files needed to formatted in certain way to pass the unit tests.
what I usefully do is the following
./runtest.sh -f
which will fix most of formats for me.
then I run
./runtest.sh -s
to check if anything else needs to be fixed.
Behind the scene is basically calls black-check, isort-check etc.
or you can simply run
./runtest.sh
Which will run all the unit test to make sure it passes (which will check the format first)
- The proposed the PR is very similar to this controller. fedavg_early_stopping.py
Which based on the condition expressed such as: "accuracy > 0.8"
https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_opt/pt/fedavg_early_stopping.py
The client script for the this controller is in the example: https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-fedavg/pt_fedavg_early_stopping_script.py
Please see if your work has additional improvement beyond the fedavg_early_stopping controller.
- `task_validation_name`: specifies the name of the validation task | ||
- `task_to_optimize`: indicates whether to apply metric optimization to the training or validation task | ||
- `patience`: defines the number of FL rounds to wait without improvement before stopping the training | ||
* Model Selection: As and alternative to using a IntimeModelSelector componenet for model selection, we instead compare the metrics of the models in the workflow to select the best model each round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: componenet
-> component
@@ -0,0 +1,284 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use 2025 in the license files.
from src.net import Net | ||
|
||
from nvflare import FedJob | ||
from fedavg_metric_optimization import PTFedAvgMetricOptimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be from nvflare.app_opt.pt. fedavg_metric_optimization import PTFedAvgMetricOptimization
# (optional) set a fix place so we don't need to download everytime | ||
CIFAR10_ROOT = "data/cifar10" | ||
# (optional) We change to use GPU to speed things up. | ||
# if you want to use CPU, change DEVICE="cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use this so the code automatically works with CPU or GPU
# If available, we use GPU to speed things up.
DEVICE = "cuda" if torch.cuda.is_available() else "CPU"
Fixes # .
Description
Hi all,
This pull request adds a new controller: FedAvg Metric Optimizaiton. This controller is the result of a joint effort between Roche and Universitätsspital Zurich (USZ) to train a model in a federated manner.
Purpose of the controller:
The goal is to obtain the best possible model by optimizing a specific metric (e.g., minimizing the loss or maximizing the F-Score) and to stop the training if the tracked metric does not improve after a certain number of FL rounds, as defined by the researcher. This approach saves computation time during FL training, especially when the model is large and requires a significant amount of data. This controller has been developed in a paper that is currently under peer review.
Additionally, we wanted to provide the option to choose whether to optimize the metric during training or validation.
I would like to highlight that this contribution is thanks to Roche and the Universitätsspital Zurich (USZ).
Best,
Lydia Anette Schönpflug (USZ) and Ruben Bagan Benavides (Roche)
Types of changes
./runtest.sh
.