🙋 Nodes as mirrors for data redundancy #39

mishaschwartz · 2025-03-04T16:25:48Z

Topic category

Select which category your topic relates to:

Topic summary

Should the data hosted on nodes in the network be available from at least two nodes to ensure data availability and redundancy across the network.

This would likely require that:

an exact copy of a data file hosted at node A also be available on one of the other nodes in the network
catalogs should refer to file assets on all nodes where a copy is hosted
access permissions be synchronized between nodes for copies of the same data file
file copies can be easily verified as being identical

Some possible solutions for the above:

we have to decide if all data or just some data should be copied. When adding new data, node administrators can coordinate this.
possible updates to the stac-populator to handle this
could be handled by creating accounts using Magpie's network mode
checksums (possibly stored in the catalog)

To decide:

Is this something we want to encourage?
If yes, do we want to require copies for all data, most data, some data?

Supporting documentation

Additional information

fmigneault · 2025-03-05T21:27:10Z

Copy: Sure.
Need to consider however that the data storage/nesting might differ between instances depending on their specific configurations.
Refs: Yes.
Can use both rel: alternate in links to link to corresponding STAC Items, and https://github.com/stac-extensions/alternate-assets for cross-referencing specific Asset data files respectively.
Access: We can indicate the auth requirements using https://github.com/stac-extensions/authentication However, it could be hard to guarantee access unless all nodes use network mode.
That being said, even without that mode, replication could be possible. Only the access would have the prerequisite to authenticate on respective nodes with alternate users.
Verify: Yes.
Annotate using checksum/size from https://github.com/stac-extensions/file

tlvu · 2025-03-06T16:58:45Z

@huard @tlogan2000 your input on this?

Ouranos has a lot of data. I am guessing you do not want full replication of all our data?!

I think all need to agree on the quantity of data because it could involve buying disks.

tlogan2000 · 2025-03-06T18:28:40Z

I agree it would be difficult to have a full copy of all datasets at least in the short to midterm. PAVICS has disk-space issues on the horizon so it would be hard to image large volumes being transferred from the UofT (or other) nodes right now ... I think that a strategy of identifying the most sought after datasets (and perhaps subset of variables) and mirroring those would be more realistic. Probably more complicated to do it this way but i think it's the reality

fmigneault · 2025-03-06T19:01:47Z

I agree with both comments.

The selection of collections to replicate data should be smart, and they might not be replicated across all nodes. However, it is still possible to synchronize STAC metadata while leaving the actual data hosting on the original node for cases of "too large data".

@mishaschwartz Maybe you can share which ones are considered more critical for full meta + data sync?

On CRIM's end, there are not really any "critical" ones. The data we employ is very sporadic depending on our project opportunities and use cases at OGC, but I wouldn't mind replicating common instances that are not "massive" to demonstrate the mirroring capability.

Something else to consider is that it is possible that some STAC collections will be used to publish processing results from Weaver in the future. Those should also not be automatically synced between nodes.

Another aspect that I am considering is to work on syncing other nodes that are not "DACCS" per se, but that are using similar services/data under the hood, such as https://climatedata.ca/, to have properly defined STAC collections about its hosted variables, stations, etc. rather than its current poorly-documented custom API.

Therefore, we should consider some kind of configuration file where we can easily "plug" data-providers to sync, and the implementation using this config (leveraging stac-populator or pypgstac?) can manage how to sync the items.

mishaschwartz added the meeting-topic Proposed topic for a future meeting label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🙋 Nodes as mirrors for data redundancy #39

🙋 Nodes as mirrors for data redundancy #39

mishaschwartz commented Mar 4, 2025

fmigneault commented Mar 5, 2025

Uh oh!

tlvu commented Mar 6, 2025

Uh oh!

tlogan2000 commented Mar 6, 2025

Uh oh!

fmigneault commented Mar 6, 2025

Uh oh!

🙋 Nodes as mirrors for data redundancy #39

🙋 Nodes as mirrors for data redundancy #39

Comments

mishaschwartz commented Mar 4, 2025

Topic category

Topic summary

Supporting documentation

Additional information

fmigneault commented Mar 5, 2025

Uh oh!

tlvu commented Mar 6, 2025

Uh oh!

tlogan2000 commented Mar 6, 2025

Uh oh!

fmigneault commented Mar 6, 2025

Uh oh!