Add async aggregator #3455

yanchengnv · 2025-04-28T18:56:22Z

Fixes # .

Description

This PR implements a wrapper aggregator that can be used to perform another aggregator's "accept" method in a separate thread. This can drastically reduce the block time of the SAG workflow when processing client submission if the "accept" processing is time-consuming, as reported by some customers.

To use this aggregator, the user simply puts the original aggregator as a component in this wrapper aggregator (AsyncAggregator):

"components": [
{
"id": "aggregator",
"path": "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator",
"args": {
"expected_data_kind": "WEIGHTS"
}
},
{
"id": "aggr_wrapper",
"path": "nvflare.app_common.aggregators.async.AsyncAggregator",
"args": {
"aggregator_id": "aggregator"
}
},
...
],
"workflows": [
{
"id": "scatter_and_gather",
"path": "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather",
"args": {
"min_clients": "{min_clients}",
"num_rounds": 2,
"start_round": 0,
"wait_time_after_min_received": "{wait_time}",
"aggregator_id": "aggr_wrapper",
"persistor_id": "persistor",
"shareable_generator_id": "shareable_generator",
"train_task_name": "train",
"train_timeout": 6000,
"ignore_result_error": true
}
}
]

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

YuanTingHsieh

Mostly LGTM

YuanTingHsieh · 2025-04-30T00:38:53Z

nvflare/apis/fl_context.py

+        """
+        with _update_lock:
+            new_ctx = FLContext()
+            new_ctx.model = self.model


no related to this PR, but are we using this "self.model" anywhere?

nvflare/app_common/aggregators/async.py

YuanTingHsieh · 2025-04-30T01:05:00Z

nvflare/app_common/aggregators/async.py

+        if rc != _AcceptWaitRC.IS_SET:
+            self.log_warning(fl_ctx, f"abnormal result {rc} waiting for accept thread")


if the accept thread has "abnormal" RC, should we just return here? or raise Exception?

This could only happen when the "accept" thread is timed out (e.g. the original aggregator's accept method got stuck). I think you are right that we should raise exception, since there won't be a good way to recover.

add async aggregator

4d88384

yanchengnv requested review from YuanTingHsieh, holgerroth and ZiyueXu77 April 28, 2025 19:02

YuanTingHsieh reviewed Apr 30, 2025

View reviewed changes

Merge branch 'main' into async_aggr

5a5128f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add async aggregator #3455

Add async aggregator #3455

Uh oh!

yanchengnv commented Apr 28, 2025 •

edited

Loading

Uh oh!

YuanTingHsieh left a comment

Uh oh!

YuanTingHsieh Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh Apr 30, 2025

Uh oh!

yanchengnv Apr 30, 2025

Uh oh!

Uh oh!

		if rc != _AcceptWaitRC.IS_SET:
		self.log_warning(fl_ctx, f"abnormal result {rc} waiting for accept thread")

Add async aggregator #3455

Are you sure you want to change the base?

Add async aggregator #3455

Uh oh!

Conversation

yanchengnv commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Uh oh!

YuanTingHsieh left a comment

Choose a reason for hiding this comment

Uh oh!

YuanTingHsieh Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanTingHsieh Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

yanchengnv Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yanchengnv commented Apr 28, 2025 •

edited

Loading