-
Notifications
You must be signed in to change notification settings - Fork 1
🙋 Nodes as mirrors for data redundancy #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
@huard @tlogan2000 your input on this? Ouranos has a lot of data. I am guessing you do not want full replication of all our data?! I think all need to agree on the quantity of data because it could involve buying disks. |
I agree it would be difficult to have a full copy of all datasets at least in the short to midterm. PAVICS has disk-space issues on the horizon so it would be hard to image large volumes being transferred from the UofT (or other) nodes right now ... I think that a strategy of identifying the most sought after datasets (and perhaps subset of variables) and mirroring those would be more realistic. Probably more complicated to do it this way but i think it's the reality |
I agree with both comments. The selection of collections to replicate data should be smart, and they might not be replicated across all nodes. However, it is still possible to synchronize STAC metadata while leaving the actual data hosting on the original node for cases of "too large data". @mishaschwartz Maybe you can share which ones are considered more critical for full meta + data sync? On CRIM's end, there are not really any "critical" ones. The data we employ is very sporadic depending on our project opportunities and use cases at OGC, but I wouldn't mind replicating common instances that are not "massive" to demonstrate the mirroring capability. Something else to consider is that it is possible that some STAC collections will be used to publish processing results from Weaver in the future. Those should also not be automatically synced between nodes. Another aspect that I am considering is to work on syncing other nodes that are not "DACCS" per se, but that are using similar services/data under the hood, such as https://climatedata.ca/, to have properly defined STAC collections about its hosted variables, stations, etc. rather than its current poorly-documented custom API. Therefore, we should consider some kind of configuration file where we can easily "plug" data-providers to sync, and the implementation using this config (leveraging |
Topic category
Select which category your topic relates to:
Topic summary
Should the data hosted on nodes in the network be available from at least two nodes to ensure data availability and redundancy across the network.
This would likely require that:
Some possible solutions for the above:
To decide:
Supporting documentation
Additional information
The text was updated successfully, but these errors were encountered: