Skip to content

Feature/add fedavg metric optrimization controller #3506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rbagan
Copy link

@rbagan rbagan commented May 22, 2025

Fixes # .

Description

Hi all,

This pull request adds a new controller: FedAvg Metric Optimizaiton. This controller is the result of a joint effort between Roche and Universitätsspital Zurich (USZ) to train a model in a federated manner.

Purpose of the controller:
The goal is to obtain the best possible model by optimizing a specific metric (e.g., minimizing the loss or maximizing the F-Score) and to stop the training if the tracked metric does not improve after a certain number of FL rounds, as defined by the researcher. This approach saves computation time during FL training, especially when the model is large and requires a significant amount of data. This controller has been developed in a paper that is currently under peer review.

Additionally, we wanted to provide the option to choose whether to optimize the metric during training or validation.

I would like to highlight that this contribution is thanks to Roche and the Universitätsspital Zurich (USZ).

Best,
Lydia Anette Schönpflug (USZ) and Ruben Bagan Benavides (Roche)

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbagan
Lydia Anette Schönpflug (USZ) and Ruben Bagan Benavides (Roche)
thank you so much for this contribution. Always love the community to contribute.

There are a couple of issues with this PR:

  1. the files needed to formatted in certain way to pass the unit tests.
    what I usefully do is the following
./runtest.sh -f

which will fix most of formats for me.

then I run

./runtest.sh -s

to check if anything else needs to be fixed.

Behind the scene is basically calls black-check, isort-check etc.

or you can simply run

./runtest.sh

Which will run all the unit test to make sure it passes (which will check the format first)

  1. The proposed the PR is very similar to this controller. fedavg_early_stopping.py
    Which based on the condition expressed such as: "accuracy > 0.8"
    https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_opt/pt/fedavg_early_stopping.py
    The client script for the this controller is in the example: https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-fedavg/pt_fedavg_early_stopping_script.py

Please see if your work has additional improvement beyond the fedavg_early_stopping controller.

- `task_validation_name`: specifies the name of the validation task
- `task_to_optimize`: indicates whether to apply metric optimization to the training or validation task
- `patience`: defines the number of FL rounds to wait without improvement before stopping the training
* Model Selection: As and alternative to using a IntimeModelSelector componenet for model selection, we instead compare the metrics of the models in the workflow to select the best model each round.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: componenet -> component

@@ -0,0 +1,284 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 2025 in the license files.

from src.net import Net

from nvflare import FedJob
from fedavg_metric_optimization import PTFedAvgMetricOptimization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be from nvflare.app_opt.pt. fedavg_metric_optimization import PTFedAvgMetricOptimization

# (optional) set a fix place so we don't need to download everytime
CIFAR10_ROOT = "data/cifar10"
# (optional) We change to use GPU to speed things up.
# if you want to use CPU, change DEVICE="cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use this so the code automatically works with CPU or GPU

# If available, we use GPU to speed things up.
DEVICE = "cuda" if torch.cuda.is_available() else "CPU"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants