-
-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Cache GRASS sample data for tests #5124
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
Do we know if the extracted folder can be changed when running tests (or a bad test that does so, that might end up on main too)? If so, wouldn't it be better to cache the compressed file that is downloaded (downloading it ourselves), and use that when available? Can g.download.project use a local URL, like a |
Yes and yes. Bad test can definitely change the dataset, plus I assume it runs in the project, so the project is modified. What should not be modified is the PERMANENT mapset. (It would be nice to test for that actually.) The "download" can be from a local file. |
If you can instead cache the downloaded file (not extracted), then take a look at cross OS cache. Knowing that this test data rarely (doesn't) change, and tests from all OSes uses the same data, it is worth it. See enableCrossOsArchive in https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cross-os-cache You might want to test what is faster on restores and takes less of the 10GB allowance: extract the .gz and but keeping the .tar, or simply caching the .tar.gz. The cross OS cache will use zstd, so compressing an already compressed file might not give good compression ratios (currently, the cache entry in this PR takes 160 MB). |
Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
Comparison of Caching FormatsTested in forked repo NishantBansal2003#4:
|
Wow great analysis here @NishantBansal2003! In that case, just keep the simplest of the choices! |
I think that as a first step, it is perfectly fine like this. But the step to do right after would be to apply to macOS tests, as they work the same. But that means that the So, what do think about changing the - name: Run tests
run: .github/workflows/test_thorough.sh --config .gunittest.extra.cfg
env:
SOME_VAR: gfhjdlkfjhlfgkdj
ANOTHER_VAR: rdhdstruyytu I agree that for the batch script on Windows, it might be a bit harder, so it can be in a next iteration without problem. |
I think g.download.project can do it without file://. I hope I put it into the doc. Check the source code in the worst case. |
For me, its needed. Even if I used $HOME too. Working: $ grass --tmp-project XY --exec g.download.location url=file:///home/vscode/nc_spm_full_v2alpha2.tar.gz path=$HOME
Downloading and extracting...
Path to the project now </home/vscode/nc_spm_full_v2alpha2>
$ not working: $ grass --tmp-project XY --exec g.download.location url=/home/vscode/nc_spm_full_v2alpha2.tar.gz path=$HOME
Downloading and extracting...
Traceback (most recent call last):
File "/usr/local/grass85/scripts/g.download.project", line 153, in <module>
main(*gs.parser())
File "/usr/local/grass85/scripts/g.download.project", line 112, in main
directory = download_and_extract(url)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/grass85/etc/python/grass/utils/download.py", line 195, in download_and_extract
urlretrieve(source, archive_name, reporthook)
File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 240, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
^^^^^^^^^^^^^^^^^^
File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 499, in open
req = Request(fullurl, data)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 318, in __init__
self.full_url = url
^^^^^^^^^^^^^
File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 344, in full_url
self._parse()
File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 373, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '/home/vscode/nc_spm_full_v2alpha2.tar.gz'
Traceback (most recent call last):
File "/usr/local/grass85/scripts/g.download.location", line 54, in <module>
main(*gs.parser())
File "/usr/local/grass85/scripts/g.download.location", line 50, in main
gs.run_command("g.download.project", **options)
File "/usr/local/grass85/etc/python/grass/script/core.py", line 493, in run_command
return handle_errors(returncode, result=None, args=args, kwargs=kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/grass85/etc/python/grass/script/core.py", line 372, in handle_errors
raise CalledModuleError(module=module, code=code, returncode=returncode)
grass.exceptions.CalledModuleError: Module run `g.download.project url=/home/vscode/nc_spm_full_v2alpha2.tar.gz name= path=/home/vscode` ended with an error.
The subprocess ended with a non-zero return code: 1. See errors above the traceback or in the error output.
$ |
I may not have fully understood diff --git a/.github/workflows/macos.yml b/.github/workflows/macos.yml
index 0bf2158a7e..566307b00a 100644
--- a/.github/workflows/macos.yml
+++ b/.github/workflows/macos.yml
@@ -108,18 +108,30 @@ jobs:
-ra . \
-m 'needs_solo_run'
+ - name: Cache GRASS Sample Dataset
+ id: cached-data
+ uses: actions/cache@v4
+ with:
+ path: sample-data/nc_spm_full_v2alpha2.tar.gz
+ key: nc_spm_full_v2alpha2.tar.gz
+ enableCrossOsArchive: true
+
+ - name: Download GRASS Sample Dataset
+ if: steps.cached-data.outputs.cache-hit != 'true'
+ run: |
+ mkdir -p sample-data
+ curl -L "$SAMPLE_DATA" -o sample-data/nc_spm_full_v2alpha2.tar.gz
+ env:
+ SAMPLE_DATA: "https://grass.osgeo.org/sampledata/north_carolina/\
+ nc_spm_full_v2alpha2.tar.gz"
+
- name: Run gunittest tests
shell: micromamba-shell {0}
- run: |
- grass --tmp-project XY --exec \
- g.download.project url=${{ env.SampleData }} path=$HOME
- grass --tmp-project XY --exec \
- python3 -m grass.gunittest.main \
- --grassdata $HOME --location nc_spm_full_v2alpha2 --location-type nc \
- --min-success 100 --config .github/workflows/macos_gunittest.cfg
+ run: .github/workflows/test_thorough.sh --config .github/workflows/macos_gunittest.cfg
env:
- SampleData: "https://grass.osgeo.org/sampledata/north_carolina/\
+ SAMPLE_DATA_URL: "file://${{ github.workspace }}/sample-data/\
nc_spm_full_v2alpha2.tar.gz"
+
- name: Make HTML test report available
if: ${{ !cancelled() }}
uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
diff --git a/.github/workflows/test_thorough.sh b/.github/workflows/test_thorough.sh
index 6ed7d22078..57c545133d 100755
--- a/.github/workflows/test_thorough.sh
+++ b/.github/workflows/test_thorough.sh
@@ -3,8 +3,10 @@
# fail on non-zero return code from a subprocess
set -e
+SAMPLE_DATA_URL=${SAMPLE_DATA_URL:-"https://grass.osgeo.org/sampledata/north_carolina/nc_spm_full_v2alpha2.tar.gz"}
+
grass --tmp-project XY --exec \
- g.download.project url=https://grass.osgeo.org/sampledata/north_carolina/nc_spm_full_v2alpha2.tar.gz path=$HOME
+ g.download.project url=$SAMPLE_DATA_URL path=$HOME
grass --tmp-project XY --exec \
python3 -m grass.gunittest.main \
diff --git a/.github/workflows/ubuntu.yml b/.github/workflows/ubuntu.yml
index 5f130a0081..3936682bb3 100644
--- a/.github/workflows/ubuntu.yml
+++ b/.github/workflows/ubuntu.yml
@@ -146,8 +146,28 @@ jobs:
- name: Test executing of the grass command
run: .github/workflows/test_simple.sh
+ - name: Cache GRASS Sample Dataset
+ id: cached-data
+ uses: actions/cache@v4
+ with:
+ path: sample-data/nc_spm_full_v2alpha2.tar.gz
+ key: nc_spm_full_v2alpha2.tar.gz
+ enableCrossOsArchive: true
+
+ - name: Download GRASS Sample Dataset
+ if: steps.cached-data.outputs.cache-hit != 'true'
+ run: |
+ mkdir -p sample-data
+ curl -L "$SAMPLE_DATA" -o sample-data/nc_spm_full_v2alpha2.tar.gz
+ env:
+ SAMPLE_DATA: "https://grass.osgeo.org/sampledata/north_carolina/\
+ nc_spm_full_v2alpha2.tar.gz"
+
- name: Run tests
run: .github/workflows/test_thorough.sh --config .gunittest.extra.cfg
+ env:
+ SAMPLE_DATA_URL: "file://${{ github.workspace }}/sample-data/\
+ nc_spm_full_v2alpha2.tar.gz"
- name: Make HTML test report available
if: ${{ always() }}
But the thing is, the absolute location will be different across OSes, so the same cache will not be retrieved. So, the only thing that happens is that now macOS and Ubuntu maintain their own separate caches. Am I understanding this correctly, or am I wrong? |
grass/scripts/g.extension/g.extension.py Line 2717 in d8d8d84
There is no such condition, hence |
@NishantBansal2003 Sorry for not answering before, I didn't see the message with the question dated two days ago, was it edited since? |
No worries, I thought the same, which is why I mentioned you now. |
@NishantBansal2003 Ok, I took a fresh look, and a did a fresh read of the actions docs, in case something changed since last time or I missed something. First of all, your diff shown was spot on! Maybe for the final version to use the same kind of version pinning of actions like the other actions, otherwise it would be ready as is. Returning to using the bash script on macOS should now be fine, except if there was some additional problems with passing env vars to inside the process, but should be ok now. Now we use the custom micromamba shell, our test script passes the extra arguments (to change the config file on the fly), all things that weren't possible at the time that needed to have a copy of the script inline. Then for your worry of the absolute paths not being the same: I think it's going to work as is shown on your diff. If you tried that and didn't work, I have another idea with the tilde ( |
Lastly, if you want to take more time to try it out and do a second PR, you could add the changes you shown to the test script+env var in Ubuntu, and we merge only Ubuntu. But as said before, we should end up with the exact same steps, character by character (except the Great work! |
Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
I tried using different paths for caching across macOS and Ubuntu:
However, none of these approaches worked; both macOS and Ubuntu maintain their own separate caches and use them. You can view the caches here. |
When you state and show that they maintain their own cache, did you clear them before doing the tests? Also, if no cache exists, and both jobs would run, both would try to save it later. If you want to really go deep, there's a list cache api as noted in that section: https://github.com/actions/cache?tab=readme-ov-file#cache-version Otherwise, delete one of the caches, then rerun one of them at the time, I expect it to work as you think. Later this afternoon or tomorrow I'll be able to check the individual workflow links. |
Consider that everything is fine for now, and I'll approve when I can, but even if it's possible to be duplicated , it wouldn't be a problem for now. Great investigation and clear to follow! |
Yes, before every new run, I cleared all the previous cache.
No. If you check the Ubuntu workflow, you can see that if there is no cache hit in two jobs, both will download the data but only one of them will be allowed to save it in cache (provided that the cache version is the same). We get a warning like this: Warning: Failed to save: Failed to CreateCacheEntry: Received non-retryable error: Failed request: (409) Conflict: cache entry with the same key, version, and scope already exists
Also, I listed the cache metadata using the link you shared and got the following: {
"id": 15848788,
"ref": "refs/pull/4/merge",
"key": "nc_spm_full_v2alpha2.tar.gz",
"version": "56102a7d1ea9f02ee8c8e4b1f612aa18b41652546b460cb2131c9da990db291e",
"last_accessed_at": "2025-02-20T05:46:11.611697000Z",
"created_at": "2025-02-20T05:46:11.611697000Z",
"size_in_bytes": 164898380
},
{
"id": 15848090,
"ref": "refs/pull/4/merge",
"key": "nc_spm_full_v2alpha2.tar.gz",
"version": "2eab4ad73500ce1a42538573f40755f0e1a69e81964e5d5b67bb4b4a9224acbd",
"last_accessed_at": "2025-02-20T05:45:38.409387000Z",
"created_at": "2025-02-20T05:45:19.404486000Z",
"size_in_bytes": 165704195
} So the versions for macOS and Ubuntu are different, which is likely the reason for the cache miss.
Understood! |
Did you try clearing only one of them at a time, and know if the cached file from one OS is used for another OS? I'll try this this weekend, to understand the best thing to do. |
Fixes: #2304
This PR introduces caching for the GRASS sample dataset used in Ubuntu tests, reducing redundant downloads by using GitHub Actions caching.