Skip to content

Weird issues trying to run on GPU with v1.1.0 #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tbenavi1 opened this issue Mar 22, 2025 · 1 comment
Open

Weird issues trying to run on GPU with v1.1.0 #246

tbenavi1 opened this issue Mar 22, 2025 · 1 comment

Comments

@tbenavi1
Copy link

I am using Snakemake version 8.25.5 and plugin version 1.1.0.

I am trying to figure out how to edit the snakemake rule to match this sbatch command (which works correctly):

sbatch -A tgen-332000 -t 96:00:00 --nodes=1 -p gpu-a100 --ntasks=1 --gres=gpu:A100:2 --cpus-per-gpu 16 --mem 384000 dorado1.sh

The rule I made has these resources:

  resources:
    mem_mb=384000,
    gpu=2,
    gpu_model="a100",
    slurm_partition="gpu-a100",
    runtime=4320,
    cpus_per_gpu=16

However, whenever I ran "snakemake --profile profile" it started the jobs on the "compute" node even though I had requested the "gpu-a100" node. Another oddity I noticed in the log file was that it seemed like it was running everything twice:

host: g-h-1-8-07
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Select jobs to execute...
Execute 1 jobs...

[Sat Mar 22 05:57:01 2025]
rule herro_all_gpu:
    input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
    output: BJ/ONT/BJ.all.ONT.corrected.fasta
    jobid: 0
    reason: Forced execution
    wildcards: sample=BJ
    threads: 32
    resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=<TBD>, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16

host: g-h-1-8-07
host: g-h-1-8-07
Building DAG of jobs...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Select jobs to execute...
Select jobs to execute...
Execute 1 jobs...

[Sat Mar 22 05:57:03 2025]
Execute 1 jobs...
localrule herro_all_gpu:
    input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
    output: BJ/ONT/BJ.all.ONT.corrected.fasta
    jobid: 0
    reason: Forced execution
    wildcards: sample=BJ
    threads: 16
    resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=/tmp, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16


[Sat Mar 22 05:57:03 2025]
localrule herro_all_gpu:
    input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
    output: BJ/ONT/BJ.all.ONT.corrected.fasta
    jobid: 0
    reason: Forced execution
    wildcards: sample=BJ
    threads: 16
    resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=/tmp, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16

[2025-03-22 05:57:15.892] [info] Running: "correct" "BJ/ONT/BJ.all.ONT.fastq" "--from-paf" "BJ/ONT/BJ.all.ONT.overlaps.paf"
[2025-03-22 05:57:15.892] [info] Running: "correct" "BJ/ONT/BJ.all.ONT.fastq" "--from-paf" "BJ/ONT/BJ.all.ONT.overlaps.paf"
[2025-03-22 05:57:15.944] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2025-03-22 05:57:15.945] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2025-03-22 05:57:16.000] [info]  - downloading herro-v1 with httplib
[2025-03-22 05:57:16.000] [info]  - downloading herro-v1 with httplib
[2025-03-22 05:57:16.110] [error] Failed to download herro-v1: SSL server verification failed
[2025-03-22 05:57:16.110] [info]  - downloading herro-v1 with curl
[2025-03-22 05:57:16.110] [error] Failed to download herro-v1: SSL server verification failed
[2025-03-22 05:57:16.110] [info]  - downloading herro-v1 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100 22.3M  100 22.3M    0     0  52.6M      0 --:--:-- --:--:-- --:--:-- 52.6M
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100 22.3M  100 22.3M    0     0  52.5M      0 --:--:-- --:--:-- --:--:-- 52.6M
[2025-03-22 05:57:17.454] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2025-03-22 05:57:17.499] [info] Starting
[2025-03-22 05:57:17.506] [info] Starting
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 18555518 ON g-h-1-8-07 CANCELLED AT 2025-03-22T06:01:31 ***
slurmstepd: error: *** STEP 18555518.0 ON g-h-1-8-07 CANCELLED AT 2025-03-22T06:01:31 ***
Will exit after finishing currently running jobs (scheduler).
Will exit after finishing currently running jobs (scheduler).

Perhaps because I requested 2 gpus there is a bug and it is trying to run the rule twice? Please let me know what advice you have. Thank you.

@cmeesters
Copy link
Member

Thank you for reporting this issue.

Snakemake submits itself. Hence, the joblog appears_ to contain a double execution.

As for the issue itself:

Please show your global Snakemake configuration (if any) and the command line. If you have a minimal workflow to test, that would be appreciated, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants