Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add telemetry for payu run #558

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jo-basevi
Copy link
Collaborator

This PR follows on from @charles-turner-1 Pull Request ACCESS-NRI#1 for adding telemetry for payu runs.

Configuring telemetry

As telemetry should only be enabled for ACCESS-NRI deployed versions of payu, the plan is to store an external configuration file for telemetry which contains fields such as server_url. So this filepath will be stored in environment variable set by released environments, e.g. PAYU_TELEMETRY_CONFIG_PATH. When this environment variable is set and configuration file is present, then payu will attempt to post run job information using access-py-telemetry module. I've also set a payu config.yaml over-ride option to disable telemetry, e.g.

telemetry:
  enable: false # Default is true

An example telemetry file could look something like the following:

{
  "server_url": "http://tracking-services.jb4202.tm70.ps.gadi.nci.org.au:8000",
  "hostname": "gadi"
}

Server url is currently a persistent session hostname that is accessible in PBS jobs that do not have internet access.

Adding scheduler job information

Payu queries the scheduler (using qstat for PBS) to obtain job information such as resource usage. As this job information only gets updated periodically, I've moved posting the job information later to after model archive has run, or just before the program exits if there is a model run error. So hopefully this will pick up resource usage information closer to the final value.

Telemetry class instance

Added a Telemetry class in telemetry.py to keep track of payu run state information (e.g. model runtime, counters, n_runs) - which is run just after the model is run in Experiment.run. Once the experiment run has finished (e.g. after archive), then the scheduler information is queried and the extra fields are added and an api request is sent using access_py_telemetry. The extra fields - so the payu run state, metadata and scheduler fields are also logged always to a job.json file. This file either ends up in an error logs directory if model exited with an error, work directory if the archive step is not enabled, or in the archive directory.

An example record is following:

{
    "id": 10,
    "timestamp": "2025-02-10T03:15:52.409758Z",
    "name": "jb4202",
    "function": "payu.subcommands.run_cmd.runscript",
    "args": {},
    "kwargs": {},
    // Below are fields from metadata.yaml
    "experiment_uuid": "1d00c5dd-b222-4f8a-8c00-1ef53bee6d66",
    "experiment_created": "2025-01-21", 
    "experiment_name": "mom6-double-grye-base-test-payu-tracking-object-1d00c5dd", 
    "model": "MOM6",
    // Fields filtered from the scheduler:
    "scheduler_job_info": {
        "resources_used_cpupercent": "0",
        "resources_used_cput": "00:00:00",
        "resources_used_mem": "0b",
        "resources_used_ncpus": "4",
        "resources_used_vmem": "0b",
        "resources_used_walltime": "00:00:00",
        "job_state": "R",
        "queue": "express-exec",
        "mtime": "Mon Feb 10 14:15:34 2025",
        "qtime": "Mon Feb 10 14:15:18 2025",
        "resource_list_jobfs": "104857600b",
        "resource_list_mem": "8589934592b",
        "resource_list_mpiprocs": "4",
        "resource_list_ncpus": "4",
        "resource_list_nodect": "1",
        "resource_list_select": "1:ncpus=4:mpiprocs=4:mem=8589934592:job_tags=express:jobfs=104857600",
        "resource_list_storage": "scratch/tm70+gdata/tm70",
        "resource_list_walltime": "00:30:00",
        "stime": "Mon Feb 10 14:15:31 2025",
        "project": "tm70",
        "job_id": "134856068"
    },
    "scheduler_job_info_version": "1.0",
    "scheduler_type": "pbs",
    "hostname": "gadi", // From the external telemetry config file
    // Payu run state fields:
    "payu_run_id": "93b1ac2ebc66255a8a784851097d9a7149eeac72",
    "payu_current_run": 88,
    "payu_n_runs": 1,
    "payu_job_status": 0,
    "payu_start_time": "2025-02-10T14:15:34.216724Z",
    "payu_finish_time": "2025-02-10T14:15:38.238044Z",
    "payu_walltime_seconds": 4.02132,
    "payu_version": "1.1.6+12.g11bad42.dirty",
    "payu_path": "/scratch/tm70/jb4202/payu-telemetry-venv/bin",
    "payu_control_dir": "/home/189/jb4202/test-payu/mom6-double-grye-base",
    "payu_archive_dir": "/scratch/tm70/jb4202/mom6/archive/mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "payu_remote_archive_dir": ""
}

Questions/TODOs

  • Should telemtry for failed jobs be logged? Currently telemetry is posted if model exits with an error, but not if post run userscripts/archive exits with an error.
  • Should the full job scheduler information be logged to file, e.g. job.json and then just post a filtered version for the telemetry? I'm currently saving to job.json the same extra fields that are being added to telemetry.
Example job.json files

job.json with filtered scheduler fields:

{
    "experiment_uuid": "1d00c5dd-b222-4f8a-8c00-1ef53bee6d66",
    "experiment_created": "2025-01-21",
    "experiment_name": "mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "model": "MOM6",
    "payu_run_id": "93b1ac2ebc66255a8a784851097d9a7149eeac72",
    "payu_current_run": 88,
    "payu_n_runs": 1,
    "payu_job_status": 0,
    "payu_start_time": "2025-02-10T14:15:34.216724",
    "payu_finish_time": "2025-02-10T14:15:38.238044",
    "payu_walltime_seconds": 4.02132,
    "payu_version": "1.1.6+12.g11bad42.dirty",
    "payu_path": "/scratch/tm70/jb4202/payu-telemetry-venv/bin",
    "payu_control_dir": "/home/189/jb4202/test-payu/mom6-double-grye-base",
    "payu_archive_dir": "/scratch/tm70/jb4202/mom6/archive/mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "scheduler_job_info": {
        "resources_used_cpupercent": "0",
        "resources_used_cput": "00:00:00",
        "resources_used_mem": "0b",
        "resources_used_ncpus": "4",
        "resources_used_vmem": "0b",
        "resources_used_walltime": "00:00:00",
        "job_state": "R",
        "queue": "express-exec",
        "mtime": "Mon Feb 10 14:15:34 2025",
        "qtime": "Mon Feb 10 14:15:18 2025",
        "resource_list_jobfs": "104857600b",
        "resource_list_mem": "8589934592b",
        "resource_list_mpiprocs": "4",
        "resource_list_ncpus": "4",
        "resource_list_nodect": "1",
        "resource_list_select": "1:ncpus=4:mpiprocs=4:mem=8589934592:job_tags=express:jobfs=104857600",
        "resource_list_storage": "scratch/tm70+gdata/tm70",
        "resource_list_walltime": "00:30:00",
        "stime": "Mon Feb 10 14:15:31 2025",
        "project": "tm70",
        "job_id": "134856068"
    },
    "scheduler_job_info_version": "1.0",
    "scheduler_type": "pbs",
    "scheduler_job_id": "134856068.gadi-pbs"
}

job.json with all the scheduler fields:

{
    "experiment_uuid": "1d00c5dd-b222-4f8a-8c00-1ef53bee6d66",
    "experiment_created": "2025-01-21",
    "experiment_name": "mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "model": "MOM6",
    "payu_run_id": "9b06e58faeace704392f3a1231e5abd0446bd817",
    "payu_current_run": 89,
    "payu_n_runs": 1,
    "payu_job_status": 0,
    "payu_start_time": "2025-02-10T14:24:28.591909",
    "payu_finish_time": "2025-02-10T14:24:32.081931",
    "payu_walltime_seconds": 3.490022,
    "payu_version": "1.1.6+12.g11bad42.dirty",
    "payu_path": "/scratch/tm70/jb4202/payu-telemetry-venv/bin",
    "payu_control_dir": "/home/189/jb4202/test-payu/mom6-double-grye-base",
    "payu_archive_dir": "/scratch/tm70/jb4202/mom6/archive/mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "scheduler_job_info": {
        "job_name": "double_gyre",
        "job_owner": "jb4202@gadi-login-02.gadi.nci.org.au",
        "resources_used_cpupercent": "0",
        "resources_used_cput": "00:00:00",
        "resources_used_mem": "0b",
        "resources_used_ncpus": "4",
        "resources_used_vmem": "0b",
        "resources_used_walltime": "00:00:00",
        "job_state": "R",
        "queue": "express-exec",
        "server": "gadi-pbs-01.gadi.nci.org.au",
        "checkpoint": "u",
        "ctime": "Mon Feb 10 14:23:49 2025",
        "error_path": "gadi.nci.org.au:/home/189/jb4202/test-payu/mom6-double-grye-base/double_gyre.e134857143",
        "exec_host": "gadi-cpu-clx-0773/19*4",
        "exec_vnode": "(gadi-cpu-clx-0773:ncpus=4:mem=8388608kb:jobfs=102400kb)",
        "group_list": "tm70",
        "hold_types": "n",
        "join_path": "n",
        "keep_files": "n",
        "mail_points": "a",
        "mtime": "Mon Feb 10 14:24:29 2025",
        "output_path": "gadi.nci.org.au:/home/189/jb4202/test-payu/mom6-double-grye-base/double_gyre.o134857143",
        "priority": "0",
        "qtime": "Mon Feb 10 14:23:49 2025",
        "rerunable": "False",
        "resource_list_jobfs": "104857600b",
        "resource_list_mem": "8589934592b",
        "resource_list_mpiprocs": "4",
        "resource_list_ncpus": "4",
        "resource_list_nodect": "1",
        "resource_list_place": "free",
        "resource_list_select": "1:ncpus=4:mpiprocs=4:mem=8589934592:job_tags=express:jobfs=104857600",
        "resource_list_storage": "scratch/tm70+gdata/tm70",
        "resource_list_walltime": "00:30:00",
        "resource_list_wd": "1",
        "stime": "Mon Feb 10 14:24:23 2025",
        "session_id": "361071",
        "jobdir": "/home/189/jb4202",
        "substate": "42",
        "variable_list": "PBS_O_HOME=/home/189/jb4202,PBS_O_LANG=en_AU.UTF-8,PBS_O_LOGNAME=jb4202,PBS_O_PATH=/scratch/tm70/jb4202/payu-telemetry-venv/bin:/home/189/jb4202/.local/bin:/home/189/jb4202/bin:/opt/pbs/default/bin:/opt/nci/bin:/opt/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin,PBS_O_MAIL=/var/spool/mail/jb4202,PBS_O_SHELL=/bin/bash,PBS_O_TZ=:/etc/localtime,PBS_O_INTERACTIVE_AUTH_METHOD=resvport,PBS_O_HOST=gadi-login-02.gadi.nci.org.au,PBS_O_WORKDIR=/home/189/jb4202/test-payu/mom6-double-grye-base,PBS_O_SYSTEM=Linux,LD_LIBRARY_PATH=/usr/lib64,PAYU_PATH=/scratch/tm70/jb4202/payu-telemetry-venv/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles,PAYU_TELEMETRY_CONFIG_PATH=/home/189/jb4202/test-telemetry/payu_telemetry_config.json,PBS_NCI_HT=0,PBS_NCI_STORAGE=scratch/tm70+gdata/tm70,PBS_NCI_IMAGE=,PBS_NCPUS=4,PBS_NGPUS=0,PBS_NNODES=1,PBS_NCI_NCPUS_PER_NODE=48,PBS_NCI_NUMA_PER_NODE=4,PBS_NCI_NCPUS_PER_NUMA=12,PROJECT=tm70,PBS_VMEM=8589934592,PBS_NCI_WD=1,PBS_NCI_JOBFS=104857600b,PBS_NCI_LAUNCH_COMPATIBILITY=0,PBS_NCI_FS_GDATA1=0,PBS_NCI_FS_GDATA1A=0,PBS_NCI_FS_GDATA1B=0,PBS_NCI_FS_GDATA2=0,PBS_NCI_FS_GDATA3=0,PBS_NCI_FS_GDATA4=0,PBS_O_QUEUE=express,PBS_JOBFS=/jobfs/134857143.gadi-pbs",
        "comment": "Job run at Mon Feb 10 at 14:24 on (gadi-cpu-clx-0773:ncpus=4:mem=8388608kb:jobfs=102400kb)",
        "etime": "Mon Feb 10 14:23:49 2025",
        "run_count": "1",
        "submit_arguments": "-q express -P tm70 -l walltime=0:30:00 -l ncpus=4 -l mem=8GB -N double_gyre -l wd -j n -v LD_LIBRARY_PATH=/usr/lib64,PAYU_PATH=/scratch/tm70/jb4202/payu-telemetry-venv/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles,PAYU_TELEMETRY_CONFIG_PATH=/home/189/jb4202/test-telemetry/payu_telemetry_config.json -l storage=gdata/tm70+scratch/tm70 -- /scratch/tm70/jb4202/payu-telemetry-venv/bin/python3 /scratch/tm70/jb4202/payu-telemetry-venv/bin/payu-run",
        "executable": "<jsdl-hpcpa:Executable>/scratch/tm70/jb4202/payu-telemetry-venv/bin/python3</jsdl-hpcpa:Executable>",
        "argument_list": "<jsdl-hpcpa:Argument>/scratch/tm70/jb4202/payu-telemetry-venv/bin/payu-run</jsdl-hpcpa:Argument>",
        "project": "tm70",
        "submit_host": "gadi-login-02.gadi.nci.org.au",
        "job_id": "134857143"
    },
    "scheduler_job_info_version": "1.0",
    "scheduler_type": "pbs",
    "scheduler_job_id": "134857143.gadi-pbs"
}
  • There's a payu_walltime_seconds value - At the moment, it's the time from initialised experiment to just after the model run command (e.g. mpirun). Should instead cover the whole initialise-setup-run-archive loop and should there be a separate time just the mpirun command? I'm currently resetting the payu_start_time after a model run, for when multiple runs are running in one submit job so to have an idea of how long each one took.

  • Qstat can output in json using -F json in command - This gives a pbs_version and parses resourced_used/Resource_List/Variable list into dictionaries. The pbs_version could be useful for the scheduler version?

  • Model runtime: Model-driver code to parse out the model run time for ACCESS-OM2 and ESM1.5 config files?

  • Check scheduler job information is accessible on a Slurm job on Setonix?

@jo-basevi jo-basevi changed the title 546 telemetry Add telemetry for payu run Feb 10, 2025
charles-turner-1 and others added 7 commits February 11, 2025 09:12
- Added Telemetry class to store run state information
- Move posting the telemetry to later to payu run job - to after archive or before payu exits with a model run error
- Add logic for writing job.json file at different payu stages - e.g. to work directory, archive, error logs directory
@jo-basevi
Copy link
Collaborator Author

Note tests are failing due to Endpoint for 'api_payu_run' not found in a released access_py_telemetry package. I'll need to open a PR on that repository to update the endpoint.

@charles-turner-1
Copy link

charles-turner-1 commented Feb 11, 2025

IIRC, I think it should be just payu_run, not api_payu_run - the API bit should be attached to ApiHandler.SERVER_URL.. I think?

Note tests are failing due to Endpoint for 'api_payu_run' not found in a released access_py_telemetry package. I'll need to open a PR on that repository to update the endpoint.

Edit: Not necessarily true but I think it will make life easier.

@jo-basevi
Copy link
Collaborator Author

I haven't opened a PR in tracking-services to link to but the endpoints in develop branch have /api/* -
Using payu_run with that pattern led to endpoints of /payu/run - So should access-py-telemetry automatically format the endpoint with /api/? Otherwise I was going to just modify the config.yaml in access-py-telemetry to have:

api:
   payu:
       run:

@jo-basevi
Copy link
Collaborator Author

Oops didn't see the latest edit, yeah modifying the server_url with api at the end would also work - I'll revert to payu_run

@charles-turner-1
Copy link

Yeah, just saves having a

api:
  service_1:
    subtree:
api:
  service_2:
    subtree:

situation where we have to put api: at the start of each service

@jo-basevi jo-basevi self-assigned this Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants