Skip to content

🙋 Nodes as mirrors for data redundancy #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 5 tasks
mishaschwartz opened this issue Mar 4, 2025 · 4 comments
Open
1 of 5 tasks

🙋 Nodes as mirrors for data redundancy #39

mishaschwartz opened this issue Mar 4, 2025 · 4 comments
Labels
meeting-topic Proposed topic for a future meeting

Comments

@mishaschwartz
Copy link
Collaborator

Topic category

Select which category your topic relates to:

  • software architecture
  • potential risks
  • federation decisions
  • opportunities for growth
  • other

Topic summary

Should the data hosted on nodes in the network be available from at least two nodes to ensure data availability and redundancy across the network.

This would likely require that:

  1. an exact copy of a data file hosted at node A also be available on one of the other nodes in the network
  2. catalogs should refer to file assets on all nodes where a copy is hosted
  3. access permissions be synchronized between nodes for copies of the same data file
  4. file copies can be easily verified as being identical

Some possible solutions for the above:

  1. we have to decide if all data or just some data should be copied. When adding new data, node administrators can coordinate this.
  2. possible updates to the stac-populator to handle this
  3. could be handled by creating accounts using Magpie's network mode
  4. checksums (possibly stored in the catalog)

To decide:

  • Is this something we want to encourage?
  • If yes, do we want to require copies for all data, most data, some data?

Supporting documentation

Additional information

@mishaschwartz mishaschwartz added the meeting-topic Proposed topic for a future meeting label Mar 4, 2025
@fmigneault
Copy link
Member

  1. Copy: Sure.
    Need to consider however that the data storage/nesting might differ between instances depending on their specific configurations.
  2. Refs: Yes.
    Can use both rel: alternate in links to link to corresponding STAC Items, and https://github.com/stac-extensions/alternate-assets for cross-referencing specific Asset data files respectively.
  3. Access: We can indicate the auth requirements using https://github.com/stac-extensions/authentication However, it could be hard to guarantee access unless all nodes use network mode.
    That being said, even without that mode, replication could be possible. Only the access would have the prerequisite to authenticate on respective nodes with alternate users.
  4. Verify: Yes.
    Annotate using checksum/size from https://github.com/stac-extensions/file

@tlvu
Copy link

tlvu commented Mar 6, 2025

@huard @tlogan2000 your input on this?

Ouranos has a lot of data. I am guessing you do not want full replication of all our data?!

I think all need to agree on the quantity of data because it could involve buying disks.

@tlogan2000
Copy link

I agree it would be difficult to have a full copy of all datasets at least in the short to midterm. PAVICS has disk-space issues on the horizon so it would be hard to image large volumes being transferred from the UofT (or other) nodes right now ... I think that a strategy of identifying the most sought after datasets (and perhaps subset of variables) and mirroring those would be more realistic. Probably more complicated to do it this way but i think it's the reality

@fmigneault
Copy link
Member

I agree with both comments.

The selection of collections to replicate data should be smart, and they might not be replicated across all nodes. However, it is still possible to synchronize STAC metadata while leaving the actual data hosting on the original node for cases of "too large data".

@mishaschwartz Maybe you can share which ones are considered more critical for full meta + data sync?

On CRIM's end, there are not really any "critical" ones. The data we employ is very sporadic depending on our project opportunities and use cases at OGC, but I wouldn't mind replicating common instances that are not "massive" to demonstrate the mirroring capability.

Something else to consider is that it is possible that some STAC collections will be used to publish processing results from Weaver in the future. Those should also not be automatically synced between nodes.

Another aspect that I am considering is to work on syncing other nodes that are not "DACCS" per se, but that are using similar services/data under the hood, such as https://climatedata.ca/, to have properly defined STAC collections about its hosted variables, stations, etc. rather than its current poorly-documented custom API.

Therefore, we should consider some kind of configuration file where we can easily "plug" data-providers to sync, and the implementation using this config (leveraging stac-populator or pypgstac?) can manage how to sync the items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meeting-topic Proposed topic for a future meeting
Projects
None yet
Development

No branches or pull requests

4 participants