Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Cache GRASS sample data for tests #5124

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NishantBansal2003
Copy link
Contributor

Fixes: #2304

This PR introduces caching for the GRASS sample dataset used in Ubuntu tests, reducing redundant downloads by using GitHub Actions caching.

Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
@github-actions github-actions bot added the CI Continuous integration label Feb 15, 2025
@echoix
Copy link
Member

echoix commented Feb 15, 2025

Do we know if the extracted folder can be changed when running tests (or a bad test that does so, that might end up on main too)? If so, wouldn't it be better to cache the compressed file that is downloaded (downloading it ourselves), and use that when available?

Can g.download.project use a local URL, like a file:// URL?

@wenzeslaus
Copy link
Member

Yes and yes.

Bad test can definitely change the dataset, plus I assume it runs in the project, so the project is modified. What should not be modified is the PERMANENT mapset. (It would be nice to test for that actually.)

The "download" can be from a local file.

@echoix
Copy link
Member

echoix commented Feb 16, 2025

If you can instead cache the downloaded file (not extracted), then take a look at cross OS cache. Knowing that this test data rarely (doesn't) change, and tests from all OSes uses the same data, it is worth it.

See enableCrossOsArchive in https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cross-os-cache

You might want to test what is faster on restores and takes less of the 10GB allowance: extract the .gz and but keeping the .tar, or simply caching the .tar.gz. The cross OS cache will use zstd, so compressing an already compressed file might not give good compression ratios (currently, the cache entry in this PR takes 160 MB).

Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
@NishantBansal2003
Copy link
Contributor Author

  • Use a cached archive (.tar.gz or .tar), not an extracted folder, to ensure each run starts with a fresh dataset.
  • Update g.download.project command to use a file:// URL.
  • Use enableCrossOsArchive: true since test data rarely changes and is shared across OSes.

Comparison of Caching Formats

Tested in forked repo NishantBansal2003#4:

Metric .tar.gz .tar
Storage Size ~158 MB ~162 MB
Restore Time ~3s ~3s

@echoix
Copy link
Member

echoix commented Feb 16, 2025

Wow great analysis here @NishantBansal2003! In that case, just keep the simplest of the choices!

@echoix
Copy link
Member

echoix commented Feb 16, 2025

I think that as a first step, it is perfectly fine like this. But the step to do right after would be to apply to macOS tests, as they work the same. But that means that the test_thorough.sh script wouldn't be used in both, as they got inlined here. I think having one place to change the test script has a value.

So, what do think about changing the test_thorough.sh file to use one or two values coming from an env var if available (set/not empty), otherwise use its defaults? (Like for the URL and name)
The test step in both Ubuntu and macOS would remain the same, with the addition of env vars for the step. Something like:

- name: Run tests
  run: .github/workflows/test_thorough.sh --config .gunittest.extra.cfg
  env:
    SOME_VAR: gfhjdlkfjhlfgkdj
    ANOTHER_VAR: rdhdstruyytu

I agree that for the batch script on Windows, it might be a bit harder, so it can be in a next iteration without problem.

@wenzeslaus
Copy link
Member

I think g.download.project can do it without file://. I hope I put it into the doc. Check the source code in the worst case.

@echoix
Copy link
Member

echoix commented Feb 16, 2025

I think g.download.project can do it without file://. I hope I put it into the doc. Check the source code in the worst case.

For me, its needed. Even if I used $HOME too.

Working:

$ grass --tmp-project XY --exec g.download.location url=file:///home/vscode/nc_spm_full_v2alpha2.tar.gz path=$HOME
Downloading and extracting...
Path to the project now </home/vscode/nc_spm_full_v2alpha2>
$ 

not working:

$ grass --tmp-project XY --exec     g.download.location url=/home/vscode/nc_spm_full_v2alpha2.tar.gz path=$HOME
Downloading and extracting...
Traceback (most recent call last):
  File "/usr/local/grass85/scripts/g.download.project", line 153, in <module>
    main(*gs.parser())
  File "/usr/local/grass85/scripts/g.download.project", line 112, in main
    directory = download_and_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/grass85/etc/python/grass/utils/download.py", line 195, in download_and_extract
    urlretrieve(source, archive_name, reporthook)
  File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 240, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 499, in open
    req = Request(fullurl, data)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 318, in __init__
    self.full_url = url
    ^^^^^^^^^^^^^
  File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 344, in full_url
    self._parse()
  File "/usr/local/python/current/lib/python3.12/urllib/request.py", line 373, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '/home/vscode/nc_spm_full_v2alpha2.tar.gz'
Traceback (most recent call last):
  File "/usr/local/grass85/scripts/g.download.location", line 54, in <module>
    main(*gs.parser())
  File "/usr/local/grass85/scripts/g.download.location", line 50, in main
    gs.run_command("g.download.project", **options)
  File "/usr/local/grass85/etc/python/grass/script/core.py", line 493, in run_command
    return handle_errors(returncode, result=None, args=args, kwargs=kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/grass85/etc/python/grass/script/core.py", line 372, in handle_errors
    raise CalledModuleError(module=module, code=code, returncode=returncode)
grass.exceptions.CalledModuleError: Module run `g.download.project url=/home/vscode/nc_spm_full_v2alpha2.tar.gz name= path=/home/vscode` ended with an error.
The subprocess ended with a non-zero return code: 1. See errors above the traceback or in the error output.
$

@NishantBansal2003
Copy link
Contributor Author

NishantBansal2003 commented Feb 17, 2025

I think that as a first step, it is perfectly fine like this. But the step to do right after would be to apply to macOS tests, as they work the same. But that means that the test_thorough.sh script wouldn't be used in both, as they got inlined here. I think having one place to change the test script has a value.

So, what do think about changing the test_thorough.sh file to use one or two values coming from an env var if available (set/not empty), otherwise use its defaults? (Like for the URL and name) The test step in both Ubuntu and macOS would remain the same, with the addition of env vars for the step. Something like:

- name: Run tests
  run: .github/workflows/test_thorough.sh --config .gunittest.extra.cfg
  env:
    SOME_VAR: gfhjdlkfjhlfgkdj
    ANOTHER_VAR: rdhdstruyytu

I agree that for the batch script on Windows, it might be a bit harder, so it can be in a next iteration without problem.

I may not have fully understood enableCrossOsArchive, but what I understand is that you're referring to doing something like this:

diff --git a/.github/workflows/macos.yml b/.github/workflows/macos.yml
index 0bf2158a7e..566307b00a 100644
--- a/.github/workflows/macos.yml
+++ b/.github/workflows/macos.yml
@@ -108,18 +108,30 @@ jobs:
             -ra . \
             -m 'needs_solo_run'
 
+      - name: Cache GRASS Sample Dataset
+        id: cached-data
+        uses: actions/cache@v4
+        with:
+          path: sample-data/nc_spm_full_v2alpha2.tar.gz
+          key: nc_spm_full_v2alpha2.tar.gz
+          enableCrossOsArchive: true
+
+      - name: Download GRASS Sample Dataset
+        if: steps.cached-data.outputs.cache-hit != 'true'
+        run: |
+          mkdir -p sample-data
+          curl -L "$SAMPLE_DATA" -o sample-data/nc_spm_full_v2alpha2.tar.gz
+        env:
+          SAMPLE_DATA: "https://grass.osgeo.org/sampledata/north_carolina/\
+            nc_spm_full_v2alpha2.tar.gz"
+
       - name: Run gunittest tests
         shell: micromamba-shell {0}
-        run: |
-          grass --tmp-project XY --exec \
-              g.download.project url=${{ env.SampleData }} path=$HOME
-          grass --tmp-project XY --exec \
-              python3 -m grass.gunittest.main \
-              --grassdata $HOME --location nc_spm_full_v2alpha2 --location-type nc \
-              --min-success 100 --config .github/workflows/macos_gunittest.cfg
+        run: .github/workflows/test_thorough.sh --config .github/workflows/macos_gunittest.cfg
         env:
-          SampleData: "https://grass.osgeo.org/sampledata/north_carolina/\
+          SAMPLE_DATA_URL: "file://${{ github.workspace }}/sample-data/\
             nc_spm_full_v2alpha2.tar.gz"
+
       - name: Make HTML test report available
         if: ${{ !cancelled() }}
         uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
diff --git a/.github/workflows/test_thorough.sh b/.github/workflows/test_thorough.sh
index 6ed7d22078..57c545133d 100755
--- a/.github/workflows/test_thorough.sh
+++ b/.github/workflows/test_thorough.sh
@@ -3,8 +3,10 @@
 # fail on non-zero return code from a subprocess
 set -e
 
+SAMPLE_DATA_URL=${SAMPLE_DATA_URL:-"https://grass.osgeo.org/sampledata/north_carolina/nc_spm_full_v2alpha2.tar.gz"}
+
 grass --tmp-project XY --exec \
-    g.download.project url=https://grass.osgeo.org/sampledata/north_carolina/nc_spm_full_v2alpha2.tar.gz path=$HOME
+    g.download.project url=$SAMPLE_DATA_URL path=$HOME
 
 grass --tmp-project XY --exec \
     python3 -m grass.gunittest.main \
diff --git a/.github/workflows/ubuntu.yml b/.github/workflows/ubuntu.yml
index 5f130a0081..3936682bb3 100644
--- a/.github/workflows/ubuntu.yml
+++ b/.github/workflows/ubuntu.yml
@@ -146,8 +146,28 @@ jobs:
       - name: Test executing of the grass command
         run: .github/workflows/test_simple.sh
 
+      - name: Cache GRASS Sample Dataset
+        id: cached-data
+        uses: actions/cache@v4
+        with:
+          path: sample-data/nc_spm_full_v2alpha2.tar.gz
+          key: nc_spm_full_v2alpha2.tar.gz
+          enableCrossOsArchive: true
+
+      - name: Download GRASS Sample Dataset
+        if: steps.cached-data.outputs.cache-hit != 'true'
+        run: |
+          mkdir -p sample-data
+          curl -L "$SAMPLE_DATA" -o sample-data/nc_spm_full_v2alpha2.tar.gz
+        env:
+          SAMPLE_DATA: "https://grass.osgeo.org/sampledata/north_carolina/\
+            nc_spm_full_v2alpha2.tar.gz"
+
       - name: Run tests
         run: .github/workflows/test_thorough.sh --config .gunittest.extra.cfg
+        env:
+          SAMPLE_DATA_URL: "file://${{ github.workspace }}/sample-data/\
+            nc_spm_full_v2alpha2.tar.gz"
 
       - name: Make HTML test report available
         if: ${{ always() }}

But the thing is, the absolute location will be different across OSes, so the same cache will not be retrieved. So, the only thing that happens is that now macOS and Ubuntu maintain their own separate caches. Am I understanding this correctly, or am I wrong?
Is this what you are suggesting I do?
cc: @echoix

@neteler
Copy link
Member

neteler commented Feb 17, 2025

I think g.download.project can do it without file://. I hope I put it into the doc. Check the source code in the worst case.

g.extension:

# Catch corner case if local URL is given starting with file://

g.download.project:

There is no such condition, hence file:// is needed (streamlining would be desired, though).

@echoix
Copy link
Member

echoix commented Feb 19, 2025

@NishantBansal2003 Sorry for not answering before, I didn't see the message with the question dated two days ago, was it edited since?
I'll get back to you tomorrow

@NishantBansal2003
Copy link
Contributor Author

@NishantBansal2003 Sorry for not answering before, I didn't see the message with the question dated two days ago, was it edited since?
I'll get back to you tomorrow

No worries, I thought the same, which is why I mentioned you now.

@echoix
Copy link
Member

echoix commented Feb 19, 2025

@NishantBansal2003 Ok, I took a fresh look, and a did a fresh read of the actions docs, in case something changed since last time or I missed something.

First of all, your diff shown was spot on! Maybe for the final version to use the same kind of version pinning of actions like the other actions, otherwise it would be ready as is.

Returning to using the bash script on macOS should now be fine, except if there was some additional problems with passing env vars to inside the process, but should be ok now. Now we use the custom micromamba shell, our test script passes the extra arguments (to change the config file on the fly), all things that weren't possible at the time that needed to have a copy of the script inline.

Then for your worry of the absolute paths not being the same: I think it's going to work as is shown on your diff. If you tried that and didn't work, I have another idea with the tilde (~). But for the restore part, it shouldn't be a problem, the paths are relative to the working dir. If you look at the third bullet point of this section: https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cross-os-cache, your worry might be true if we were trying to store files outside of the workspace, at arbitrary locations on the runners. The choice of inside our source tree is reasonable and portable.

@echoix
Copy link
Member

echoix commented Feb 19, 2025

Lastly, if you want to take more time to try it out and do a second PR, you could add the changes you shown to the test script+env var in Ubuntu, and we merge only Ubuntu. But as said before, we should end up with the exact same steps, character by character (except the shell: micromamba {0} on macOS).

Great work!

Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
@NishantBansal2003
Copy link
Contributor Author

Then for your worry of the absolute paths not being the same: I think it's going to work as is shown on your diff. If you tried that and didn't work, I have another idea with the tilde (~). But for the restore part, it shouldn't be a problem, the paths are relative to the working dir. If you look at the third bullet point of this section: https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cross-os-cache, your worry might be true if we were trying to store files outside of the workspace, at arbitrary locations on the runners. The choice of inside our source tree is reasonable and portable.

I tried using different paths for caching across macOS and Ubuntu:

  • Caching directory (sample-data) instead of a file:
    • macOS Workflow: link
    • Ubuntu Workflow: link
  • Using ~/sample-data:
    • macOS Workflow: link
    • Ubuntu Workflow: link
  • Using ${{ github.workspace }}/sample-data:
    • macOS Workflow: link
    • Ubuntu Workflow: link
  • Using ./sample-data:
    • macOS Workflow: link
    • Ubuntu Workflow: link
  • Using sample-data/nc_spm_full_v2alpha2.tar.gz: (currently running inside the PR)

However, none of these approaches worked; both macOS and Ubuntu maintain their own separate caches and use them. You can view the caches here.

@echoix
Copy link
Member

echoix commented Feb 20, 2025

When you state and show that they maintain their own cache, did you clear them before doing the tests? Also, if no cache exists, and both jobs would run, both would try to save it later.

If you want to really go deep, there's a list cache api as noted in that section: https://github.com/actions/cache?tab=readme-ov-file#cache-version

Otherwise, delete one of the caches, then rerun one of them at the time, I expect it to work as you think.

Later this afternoon or tomorrow I'll be able to check the individual workflow links.

@echoix
Copy link
Member

echoix commented Feb 20, 2025

Consider that everything is fine for now, and I'll approve when I can, but even if it's possible to be duplicated , it wouldn't be a problem for now.

Great investigation and clear to follow!

@NishantBansal2003
Copy link
Contributor Author

When you state and show that they maintain their own cache, did you clear them before doing the tests?

Yes, before every new run, I cleared all the previous cache.

Also, if no cache exists, and both jobs would run, both would try to save it later.

No. If you check the Ubuntu workflow, you can see that if there is no cache hit in two jobs, both will download the data but only one of them will be allowed to save it in cache (provided that the cache version is the same). We get a warning like this:

Warning: Failed to save: Failed to CreateCacheEntry: Received non-retryable error: Failed request: (409) Conflict: cache entry with the same key, version, and scope already exists

If you want to really go deep, there's a list cache api as noted in that section: https://github.com/actions/cache?tab=readme-ov-file#cache-version

Also, I listed the cache metadata using the link you shared and got the following:

{
  "id": 15848788,
  "ref": "refs/pull/4/merge",
  "key": "nc_spm_full_v2alpha2.tar.gz",
  "version": "56102a7d1ea9f02ee8c8e4b1f612aa18b41652546b460cb2131c9da990db291e",
  "last_accessed_at": "2025-02-20T05:46:11.611697000Z",
  "created_at": "2025-02-20T05:46:11.611697000Z",
  "size_in_bytes": 164898380
},
{
  "id": 15848090,
  "ref": "refs/pull/4/merge",
  "key": "nc_spm_full_v2alpha2.tar.gz",
  "version": "2eab4ad73500ce1a42538573f40755f0e1a69e81964e5d5b67bb4b4a9224acbd",
  "last_accessed_at": "2025-02-20T05:45:38.409387000Z",
  "created_at": "2025-02-20T05:45:19.404486000Z",
  "size_in_bytes": 165704195
}

So the versions for macOS and Ubuntu are different, which is likely the reason for the cache miss.

Consider that everything is fine for now, and I'll approve when I can, but even if it's possible to be duplicated , it wouldn't be a problem for now.

Understood!

@echoix
Copy link
Member

echoix commented Feb 21, 2025

Did you try clearing only one of them at a time, and know if the cached file from one OS is used for another OS? I'll try this this weekend, to understand the best thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feat] Cache NC sample data for tests in CI
4 participants