feat: inference with external graph. (#216)

dietervdb-meteo · gmertes · frazane · web-flow · commit 6765316f3f09 · 2025-06-03T14:57:47.000+01:00
## Description We add functionality so that in inference the model can use another graph than the one it was trained with. In this first implementation the graph has to be provided as a file on disk. This PR adds a new runner `runner: external_graph` that is an extension of the default runner. The code is based on a similar feature in bris-inference: https://github.com/metno/bris-inference/blob/main/bris/checkpoint.py#L185 The runner can be selected and set in the config as follows: ```yaml runner: external_graph: graph: path/to/graph.pt ``` For further options for the runner please consult the documentation.  ## Type of Change - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] Documentation update ## Issue Number  Closes #215 .  ## Code Compatibility - [x] I have performed a self-review of my code ### Code Performance and Testing - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I ran the [complete Pytest test](https://anemoi.readthedocs.io/projects/training/en/latest/dev/testing.html) suite locally, and they pass  ### Dependencies - [x] I have ensured that the code is still pip-installable after the changes and runs - [ ] I have tested that new dependencies themselves are pip-installable.  ### Documentation - [ ] My code follows the style guidelines of this project - [x] I have updated the documentation and docstrings to reflect the changes - [x] I have added comments to my code, particularly in hard-to-understand areas  ## Additional Notes   ---- 📚 Documentation preview 📚: https://anemoi-inference--216.org.readthedocs.build/en/216/  --------- Co-authored-by: Gert Mertes <gert.mertes@ecmwf.int> Co-authored-by: Francesco Zanetta <zanetta.francesco@gmail.com>
diff --git a/docs/index.rst b/docs/index.rst
@@ -121,6 +121,7 @@ You may also have to install pandoc on MacOS:
    :caption: Recipe Examples
 
    usage/getting-started
+   usage/external-graph
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/usage/external-graph.rst b/docs/usage/external-graph.rst
@@ -0,0 +1,65 @@
+.. _usage-external-graph:
+
+###################################
+ Inference using an external graph
+###################################
+
+Anemoi is a framework for building and running machine learning models
+based on graph neural networks (GNNs). One of the key features of such
+GNNS is that they can operate on arbitrary graphs. In particular it
+means one can train the model on one graph, but use it in inference on
+another graph. This way one can transfer the model to a different domain
+or dataset, without any fine tuning, or even change the scope of a
+model. For example using a model trained as a stretched grid as a
+limited area model (LAM) with boundary forcings in inference.
+
+We should caution that such transfer of the model from one graph to
+another is not guaranteed to lead to good results. Still, it is a
+powerful tool to explore generalizability of the model or to test
+performance before starting fine tuning through transfer learning.
+
+The ability to do inference with an alternative graph, or more precisely
+one 'external' to the checkpoint created in training, is supported by
+anemoi-inference through the ``external_graph`` runner.
+
+This runner, and the graph it will use, can be specified in the config
+file as follows:
+
+.. literalinclude:: yaml/external-graph1.yaml
+   :language: yaml
+
+In case one wants to run a model trained on a global dataset on a graph
+supported only on a limited area one needs to specify the
+``output_mask`` to be used. This mask selects the region on which the
+model will forecast and triggers boundary forcings to be applied when
+forecasting autoregressively towards later lead times. As in training,
+also in inference the output mask orginates from an attribute of the
+output nodes of the graph. It can be specified in the config file as
+follows:
+
+.. literalinclude:: yaml/external-graph2.yaml
+   :language: yaml
+
+For LAM models the limited area among the input nodes of a larger
+dataset is often specified by the ``indices_connected_nodes`` attribute
+of the input nodes. Anemoi-inference will automatically update the
+dataloader to load only data in the limited area in case the external
+graph contains this attribute and was build using the same dataset as
+the one in the checkpoint.
+
+In case one wants to work with a graph that was built on another dataset
+than that used in training, on should specify this in the config file as
+well:
+
+.. literalinclude:: yaml/external-graph3.yaml
+   :language: yaml
+
+It should be emphasized that by using this runner the model will be
+rebuilt and for this reason will differ from the model stored in the
+checkpoint. To avoid unexpected results, there is a default check that
+ensures the model used in inference has the same weights, biases and
+normalizer values as that stored in the checkpoint. In case of a more
+adventurous use-case this check can be disabled through the config as:
+
+.. literalinclude:: yaml/external-graph4.yaml
+   :language: yaml
diff --git a/docs/usage/yaml/external-graph1.yaml b/docs/usage/yaml/external-graph1.yaml
@@ -0,0 +1,3 @@
+runner:
+  external_graph:
+    graph: path/to/graph.pt
diff --git a/docs/usage/yaml/external-graph2.yaml b/docs/usage/yaml/external-graph2.yaml
@@ -0,0 +1,6 @@
+runner:
+  external_graph:
+    graph: path/to/graph.pt
+    output_mask:
+      nodes_name: data # name of the output nodes of the graph
+      attribute_name: cutout_mask # mask specifying the limited area among the output nodes
diff --git a/docs/usage/yaml/external-graph3.yaml b/docs/usage/yaml/external-graph3.yaml
@@ -0,0 +1,6 @@
+runner:
+  external_graph:
+    graph: path/to/graph.pt
+    graph_dataset: path/to/graph_dataset.zarr
+    # the above can be an anemoi-datasets.open_dataset argument as well,
+    # rather than simply a path
diff --git a/docs/usage/yaml/external-graph4.yaml b/docs/usage/yaml/external-graph4.yaml
@@ -0,0 +1,4 @@
+runner:
+  external_graph:
+    graph: path/to/graph.pt
+    check_state_dict: False
diff --git a/src/anemoi/inference/config/run.py b/src/anemoi/inference/config/run.py
@@ -31,7 +31,7 @@ class RunConfiguration(Configuration):
     checkpoint: Union[str, Dict[Literal["huggingface"], Union[Dict[str, Any], str]]]
     """A path to an Anemoi checkpoint file."""
 
-    runner: str = "default"
+    runner: Union[str, Dict[str, Any]] = "default"
     """The runner to use."""
 
     date: Union[str, int, datetime.datetime, None] = None
diff --git a/src/anemoi/inference/runner.py b/src/anemoi/inference/runner.py
@@ -401,7 +401,7 @@ def prepare_input_tensor(self, input_state: State, dtype: DTypeLike = np.float32
             shape=(
                 self.checkpoint.multi_step_input,
                 self.checkpoint.number_of_input_features,
-                self.checkpoint.number_of_grid_points,
+                input_state["latitudes"].size,
             ),
             fill_value=np.nan,
             dtype=dtype,
diff --git a/src/anemoi/inference/runners/__init__.py b/src/anemoi/inference/runners/__init__.py
@@ -30,4 +30,4 @@ def create_runner(config: Configuration, **kwargs: Any) -> Any:
     Any
         The created runner instance.
     """
-    return runner_registry.create(config.runner, config, **kwargs)
+    return runner_registry.from_config(config.runner, config, **kwargs)
diff --git a/src/anemoi/inference/runners/external_graph.py b/src/anemoi/inference/runners/external_graph.py
@@ -0,0 +1,188 @@
+import logging
+import os
+from copy import deepcopy
+from functools import cached_property
+from typing import Any
+
+import torch
+from anemoi.datasets import open_dataset
+
+from ..runners.default import DefaultRunner
+from . import runner_registry
+
+LOG = logging.getLogger(__name__)
+
+# Possibly move the function(s) below to anemoi-models or anemoi-utils since it could be used in transfer learning.
+
+
+def contains_any(key, specifications):
+    contained = False
+    for specification in specifications:
+        if specification in key:
+            contained = True
+            break
+    return contained
+
+
+def update_state_dict(
+    model, external_state_dict, keywords="", ignore_mismatched_layers=False, ignore_additional_layers=False
+):
+    """Update the model's stated_dict with entries from an external state_dict. Only entries whose keys contain the specified keywords are considered."""
+
+    LOG.info("Updating model state dictionary.")
+
+    if isinstance(keywords, str):
+        keywords = [keywords]
+
+    # select relevant part of external_state_dict
+    reduced_state_dict = {k: v for k, v in external_state_dict.items() if contains_any(k, keywords)}
+    model_state_dict = model.state_dict()
+
+    # check layers and their shapes
+    for key in list(reduced_state_dict):
+        if key not in model_state_dict:
+            if ignore_additional_layers:
+                LOG.info("Skipping injection of %s, which is not in the model.", key)
+                del reduced_state_dict[key]
+            else:
+                raise AssertionError(f"Layer {key} not in model. Consider setting 'ignore_additional_layers = True'.")
+        elif reduced_state_dict[key].shape != model_state_dict[key].shape:
+            if ignore_mismatched_layers:
+                LOG.info("Skipping injection of %s due to shape mismatch.", key)
+                LOG.info("Model shape: %s", model_state_dict[key].shape)
+                LOG.info("Provided shape: %s", reduced_state_dict[key].shape)
+                del reduced_state_dict[key]
+            else:
+                raise AssertionError(
+                    "Mismatch in shape of %s. Consider setting 'ignore_mismatched_layers = True'.", key
+                )
+
+    # update
+    model.load_state_dict(reduced_state_dict, strict=False)
+    return model
+
+
+@runner_registry.register("external_graph")
+class ExternalGraphRunner(DefaultRunner):
+    """Runner where the graph saved in the checkpoint is replaced by an externally provided one.
+    Currently only supported as an extension of the default runner.
+    """
+
+    def __init__(
+        self,
+        config: dict,
+        graph: str,
+        output_mask: dict | None = {},
+        graph_dataset: Any | None = None,
+        check_state_dict: bool | None = True,
+    ) -> None:
+        """Initialize the ExternalGraphRunner.
+
+        Parameters
+        ----------
+        config : Configuration
+            Configuration for the runner.
+        graph : str
+            Path to the external graph.
+        output_mask : dict | None
+            Dictionary specifying the output mask.
+        graph_dataset : Any | None
+            Argument to open_dataset of anemoi-datasets that recreates the dataset used to build the data nodes of the graph.
+        check_state_dict: bool | None
+            Boolean specifying if reconstruction of statedict happens as expeceted.
+        """
+        super().__init__(config)
+        self.check_state_dict = check_state_dict
+        self.graph_path = graph
+
+        # If graph was build on other dataset, we need to adapt the dataloader
+        if graph_dataset is not None:
+            graph_ds = open_dataset(graph_dataset)
+            LOG.info(
+                "The external graph was built using a different anemoi-dataset than that in the checkpoint. "
+                "Patching metadata to ensure correct data loading."
+            )
+            self.checkpoint._metadata.patch(
+                {
+                    "config": {"dataloader": {"dataset": graph_dataset}},
+                    "dataset": {"shape": graph_ds.shape},
+                }
+            )
+
+            # had to use private attributes because cached properties cause problems
+            self.checkpoint._metadata._supporting_arrays = graph_ds.supporting_arrays()
+            if "grid_indices" in self.checkpoint._metadata._supporting_arrays:
+                num_grid_points = len(self.checkpoint._metadata._supporting_arrays["grid_indices"])
+            else:
+                num_grid_points = graph_ds.shape[-1]
+            self.checkpoint._metadata.number_of_grid_points = num_grid_points
+
+        # Check if the external graph has the 'indices_connected_nodes' attribute
+        # If so adapt dataloader and add supporting array
+        data = self.checkpoint._metadata._config.graph.data
+        assert data in self.graph.node_types, f"Node type {data} not found in external graph."
+        if "indices_connected_nodes" in self.graph[data]:
+            LOG.info(
+                "The external graph has the 'indices_connected_nodes' attribute."
+                "Patching metadata with MaskedGrid 'grid_indices' to ensure correct data loading."
+            )
+            self.checkpoint._metadata.patch(
+                {
+                    "config": {
+                        "dataloader": {
+                            "grid_indices": {
+                                "_target_": "anemoi.training.data.grid_indices.MaskedGrid",
+                                "nodes_name": data,
+                                "node_attribute_name": "indices_connected_nodes",
+                            }
+                        }
+                    }
+                }
+            )
+            LOG.info("Moving 'indices_connected_nodes' from external graph to supporting arrays as 'grid_indices'.")
+            indices_connected_nodes = self.graph[data]["indices_connected_nodes"].numpy()
+            self.checkpoint._supporting_arrays["grid_indices"] = indices_connected_nodes.squeeze()
+
+        if output_mask:
+            nodes = output_mask["nodes_name"]
+            attribute = output_mask["attribute_name"]
+            self.checkpoint._supporting_arrays["output_mask"] = self.graph[nodes][attribute].numpy().squeeze()
+            LOG.info(
+                "Moving attribute '%s' of nodes '%s' from external graph to supporting arrays as 'output_mask'.",
+                attribute,
+                nodes,
+            )
+
+    @cached_property
+    def graph(self):
+        graph_path = self.graph_path
+        assert os.path.isfile(
+            graph_path
+        ), f"No graph found at {graph_path}. An external graph needs to be specified in the config file for this runner."
+        LOG.info("Loading external graph from path %s.", graph_path)
+        return torch.load(graph_path, map_location="cpu", weights_only=False)
+
+    @cached_property
+    def model(self):
+        # load the model from the checkpoint
+        device = self.device
+        self.device = "cpu"
+        model_instance = super().model
+        state_dict_ckpt = deepcopy(model_instance.state_dict())
+
+        # rebuild the model with the new graph
+        model_instance.graph_data = self.graph
+        model_instance.config = self.checkpoint._metadata._config
+        model_instance._build_model()
+
+        # reinstate the weights, biases and normalizer from the checkpoint
+        # reinstating the normalizer is necessary for checkpoints that were created
+        # using transfer learning, where the statistics as stored in the checkpoint
+        # do not match the statistics used to build the normalizer in the checkpoint.
+        model_instance = update_state_dict(
+            model_instance, state_dict_ckpt, keywords=["bias", "weight", "processors.normalizer"]
+        )
+
+        LOG.info("Successfully built model with external graph and reassigned model weights!")
+        self.device = device
+        return model_instance.to(self.device)

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+runner:`
	`2`	`+ external_graph:`
	`3`	`+ graph: path/to/graph.pt`