Add telemetry for payu run #558

jo-basevi · 2025-02-10T04:08:15Z

This PR follows on from @charles-turner-1 Pull Request ACCESS-NRI#1 for adding telemetry for payu runs.

Configuring telemetry

As telemetry should only be enabled for ACCESS-NRI deployed versions of payu, the plan is to store an external configuration file for telemetry which contains fields such as server_url. So this filepath will be stored in environment variable set by released environments, e.g. PAYU_TELEMETRY_CONFIG_PATH. When this environment variable is set and configuration file is present, then payu will attempt to post run job information using access-py-telemetry module. I've also set a payu config.yaml over-ride option to disable telemetry, e.g.

telemetry:
  enable: false # Default is true

An example telemetry file could look something like the following:

{
  "server_url": "http://tracking-services.jb4202.tm70.ps.gadi.nci.org.au:8000",
  "hostname": "gadi"
}

Server url is currently a persistent session hostname that is accessible in PBS jobs that do not have internet access.

Adding scheduler job information

Payu queries the scheduler (using qstat for PBS) to obtain job information such as resource usage. As this job information only gets updated periodically, I've moved posting the job information later to after model archive has run, or just before the program exits if there is a model run error. So hopefully this will pick up resource usage information closer to the final value.

Telemetry class instance

Added a Telemetry class in telemetry.py to keep track of payu run state information (e.g. model runtime, counters, n_runs) - which is run just after the model is run in Experiment.run. Once the experiment run has finished (e.g. after archive), then the scheduler information is queried and the extra fields are added and an api request is sent using access_py_telemetry. The extra fields - so the payu run state, metadata and scheduler fields are also logged always to a job.json file. This file either ends up in an error logs directory if model exited with an error, work directory if the archive step is not enabled, or in the archive directory.

An example record is following:

{
    "id": 10,
    "timestamp": "2025-02-10T03:15:52.409758Z",
    "name": "jb4202",
    "function": "payu.subcommands.run_cmd.runscript",
    "args": {},
    "kwargs": {},
    // Below are fields from metadata.yaml
    "experiment_uuid": "1d00c5dd-b222-4f8a-8c00-1ef53bee6d66",
    "experiment_created": "2025-01-21", 
    "experiment_name": "mom6-double-grye-base-test-payu-tracking-object-1d00c5dd", 
    "model": "MOM6",
    // Fields filtered from the scheduler:
    "scheduler_job_info": {
        "resources_used_cpupercent": "0",
        "resources_used_cput": "00:00:00",
        "resources_used_mem": "0b",
        "resources_used_ncpus": "4",
        "resources_used_vmem": "0b",
        "resources_used_walltime": "00:00:00",
        "job_state": "R",
        "queue": "express-exec",
        "mtime": "Mon Feb 10 14:15:34 2025",
        "qtime": "Mon Feb 10 14:15:18 2025",
        "resource_list_jobfs": "104857600b",
        "resource_list_mem": "8589934592b",
        "resource_list_mpiprocs": "4",
        "resource_list_ncpus": "4",
        "resource_list_nodect": "1",
        "resource_list_select": "1:ncpus=4:mpiprocs=4:mem=8589934592:job_tags=express:jobfs=104857600",
        "resource_list_storage": "scratch/tm70+gdata/tm70",
        "resource_list_walltime": "00:30:00",
        "stime": "Mon Feb 10 14:15:31 2025",
        "project": "tm70",
        "job_id": "134856068"
    },
    "scheduler_job_info_version": "1.0",
    "scheduler_type": "pbs",
    "hostname": "gadi", // From the external telemetry config file
    // Payu run state fields:
    "payu_run_id": "93b1ac2ebc66255a8a784851097d9a7149eeac72",
    "payu_current_run": 88,
    "payu_n_runs": 1,
    "payu_job_status": 0,
    "payu_start_time": "2025-02-10T14:15:34.216724Z",
    "payu_finish_time": "2025-02-10T14:15:38.238044Z",
    "payu_walltime_seconds": 4.02132,
    "payu_version": "1.1.6+12.g11bad42.dirty",
    "payu_path": "/scratch/tm70/jb4202/payu-telemetry-venv/bin",
    "payu_control_dir": "/home/189/jb4202/test-payu/mom6-double-grye-base",
    "payu_archive_dir": "/scratch/tm70/jb4202/mom6/archive/mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "payu_remote_archive_dir": ""
}

Questions/TODOs

Should telemtry for failed jobs be logged? Currently telemetry is posted if model exits with an error, but not if post run userscripts/archive exits with an error.
Should the full job scheduler information be logged to file, e.g. job.json and then just post a filtered version for the telemetry? I'm currently saving to job.json the same extra fields that are being added to telemetry.

Example job.json files

job.json with filtered scheduler fields:

{
    "experiment_uuid": "1d00c5dd-b222-4f8a-8c00-1ef53bee6d66",
    "experiment_created": "2025-01-21",
    "experiment_name": "mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "model": "MOM6",
    "payu_run_id": "93b1ac2ebc66255a8a784851097d9a7149eeac72",
    "payu_current_run": 88,
    "payu_n_runs": 1,
    "payu_job_status": 0,
    "payu_start_time": "2025-02-10T14:15:34.216724",
    "payu_finish_time": "2025-02-10T14:15:38.238044",
    "payu_walltime_seconds": 4.02132,
    "payu_version": "1.1.6+12.g11bad42.dirty",
    "payu_path": "/scratch/tm70/jb4202/payu-telemetry-venv/bin",
    "payu_control_dir": "/home/189/jb4202/test-payu/mom6-double-grye-base",
    "payu_archive_dir": "/scratch/tm70/jb4202/mom6/archive/mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "scheduler_job_info": {
        "resources_used_cpupercent": "0",
        "resources_used_cput": "00:00:00",
        "resources_used_mem": "0b",
        "resources_used_ncpus": "4",
        "resources_used_vmem": "0b",
        "resources_used_walltime": "00:00:00",
        "job_state": "R",
        "queue": "express-exec",
        "mtime": "Mon Feb 10 14:15:34 2025",
        "qtime": "Mon Feb 10 14:15:18 2025",
        "resource_list_jobfs": "104857600b",
        "resource_list_mem": "8589934592b",
        "resource_list_mpiprocs": "4",
        "resource_list_ncpus": "4",
        "resource_list_nodect": "1",
        "resource_list_select": "1:ncpus=4:mpiprocs=4:mem=8589934592:job_tags=express:jobfs=104857600",
        "resource_list_storage": "scratch/tm70+gdata/tm70",
        "resource_list_walltime": "00:30:00",
        "stime": "Mon Feb 10 14:15:31 2025",
        "project": "tm70",
        "job_id": "134856068"
    },
    "scheduler_job_info_version": "1.0",
    "scheduler_type": "pbs",
    "scheduler_job_id": "134856068.gadi-pbs"
}

job.json with all the scheduler fields:

{
    "experiment_uuid": "1d00c5dd-b222-4f8a-8c00-1ef53bee6d66",
    "experiment_created": "2025-01-21",
    "experiment_name": "mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "model": "MOM6",
    "payu_run_id": "9b06e58faeace704392f3a1231e5abd0446bd817",
    "payu_current_run": 89,
    "payu_n_runs": 1,
    "payu_job_status": 0,
    "payu_start_time": "2025-02-10T14:24:28.591909",
    "payu_finish_time": "2025-02-10T14:24:32.081931",
    "payu_walltime_seconds": 3.490022,
    "payu_version": "1.1.6+12.g11bad42.dirty",
    "payu_path": "/scratch/tm70/jb4202/payu-telemetry-venv/bin",
    "payu_control_dir": "/home/189/jb4202/test-payu/mom6-double-grye-base",
    "payu_archive_dir": "/scratch/tm70/jb4202/mom6/archive/mom6-double-grye-base-test-payu-tracking-object-1d00c5dd",
    "scheduler_job_info": {
        "job_name": "double_gyre",
        "job_owner": "jb4202@gadi-login-02.gadi.nci.org.au",
        "resources_used_cpupercent": "0",
        "resources_used_cput": "00:00:00",
        "resources_used_mem": "0b",
        "resources_used_ncpus": "4",
        "resources_used_vmem": "0b",
        "resources_used_walltime": "00:00:00",
        "job_state": "R",
        "queue": "express-exec",
        "server": "gadi-pbs-01.gadi.nci.org.au",
        "checkpoint": "u",
        "ctime": "Mon Feb 10 14:23:49 2025",
        "error_path": "gadi.nci.org.au:/home/189/jb4202/test-payu/mom6-double-grye-base/double_gyre.e134857143",
        "exec_host": "gadi-cpu-clx-0773/19*4",
        "exec_vnode": "(gadi-cpu-clx-0773:ncpus=4:mem=8388608kb:jobfs=102400kb)",
        "group_list": "tm70",
        "hold_types": "n",
        "join_path": "n",
        "keep_files": "n",
        "mail_points": "a",
        "mtime": "Mon Feb 10 14:24:29 2025",
        "output_path": "gadi.nci.org.au:/home/189/jb4202/test-payu/mom6-double-grye-base/double_gyre.o134857143",
        "priority": "0",
        "qtime": "Mon Feb 10 14:23:49 2025",
        "rerunable": "False",
        "resource_list_jobfs": "104857600b",
        "resource_list_mem": "8589934592b",
        "resource_list_mpiprocs": "4",
        "resource_list_ncpus": "4",
        "resource_list_nodect": "1",
        "resource_list_place": "free",
        "resource_list_select": "1:ncpus=4:mpiprocs=4:mem=8589934592:job_tags=express:jobfs=104857600",
        "resource_list_storage": "scratch/tm70+gdata/tm70",
        "resource_list_walltime": "00:30:00",
        "resource_list_wd": "1",
        "stime": "Mon Feb 10 14:24:23 2025",
        "session_id": "361071",
        "jobdir": "/home/189/jb4202",
        "substate": "42",
        "variable_list": "PBS_O_HOME=/home/189/jb4202,PBS_O_LANG=en_AU.UTF-8,PBS_O_LOGNAME=jb4202,PBS_O_PATH=/scratch/tm70/jb4202/payu-telemetry-venv/bin:/home/189/jb4202/.local/bin:/home/189/jb4202/bin:/opt/pbs/default/bin:/opt/nci/bin:/opt/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin,PBS_O_MAIL=/var/spool/mail/jb4202,PBS_O_SHELL=/bin/bash,PBS_O_TZ=:/etc/localtime,PBS_O_INTERACTIVE_AUTH_METHOD=resvport,PBS_O_HOST=gadi-login-02.gadi.nci.org.au,PBS_O_WORKDIR=/home/189/jb4202/test-payu/mom6-double-grye-base,PBS_O_SYSTEM=Linux,LD_LIBRARY_PATH=/usr/lib64,PAYU_PATH=/scratch/tm70/jb4202/payu-telemetry-venv/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles,PAYU_TELEMETRY_CONFIG_PATH=/home/189/jb4202/test-telemetry/payu_telemetry_config.json,PBS_NCI_HT=0,PBS_NCI_STORAGE=scratch/tm70+gdata/tm70,PBS_NCI_IMAGE=,PBS_NCPUS=4,PBS_NGPUS=0,PBS_NNODES=1,PBS_NCI_NCPUS_PER_NODE=48,PBS_NCI_NUMA_PER_NODE=4,PBS_NCI_NCPUS_PER_NUMA=12,PROJECT=tm70,PBS_VMEM=8589934592,PBS_NCI_WD=1,PBS_NCI_JOBFS=104857600b,PBS_NCI_LAUNCH_COMPATIBILITY=0,PBS_NCI_FS_GDATA1=0,PBS_NCI_FS_GDATA1A=0,PBS_NCI_FS_GDATA1B=0,PBS_NCI_FS_GDATA2=0,PBS_NCI_FS_GDATA3=0,PBS_NCI_FS_GDATA4=0,PBS_O_QUEUE=express,PBS_JOBFS=/jobfs/134857143.gadi-pbs",
        "comment": "Job run at Mon Feb 10 at 14:24 on (gadi-cpu-clx-0773:ncpus=4:mem=8388608kb:jobfs=102400kb)",
        "etime": "Mon Feb 10 14:23:49 2025",
        "run_count": "1",
        "submit_arguments": "-q express -P tm70 -l walltime=0:30:00 -l ncpus=4 -l mem=8GB -N double_gyre -l wd -j n -v LD_LIBRARY_PATH=/usr/lib64,PAYU_PATH=/scratch/tm70/jb4202/payu-telemetry-venv/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles,PAYU_TELEMETRY_CONFIG_PATH=/home/189/jb4202/test-telemetry/payu_telemetry_config.json -l storage=gdata/tm70+scratch/tm70 -- /scratch/tm70/jb4202/payu-telemetry-venv/bin/python3 /scratch/tm70/jb4202/payu-telemetry-venv/bin/payu-run",
        "executable": "<jsdl-hpcpa:Executable>/scratch/tm70/jb4202/payu-telemetry-venv/bin/python3</jsdl-hpcpa:Executable>",
        "argument_list": "<jsdl-hpcpa:Argument>/scratch/tm70/jb4202/payu-telemetry-venv/bin/payu-run</jsdl-hpcpa:Argument>",
        "project": "tm70",
        "submit_host": "gadi-login-02.gadi.nci.org.au",
        "job_id": "134857143"
    },
    "scheduler_job_info_version": "1.0",
    "scheduler_type": "pbs",
    "scheduler_job_id": "134857143.gadi-pbs"
}

There's a payu_walltime_seconds value - At the moment, it's the time from initialised experiment to just after the model run command (e.g. mpirun). Should instead cover the whole initialise-setup-run-archive loop and should there be a separate time just the mpirun command? I'm currently resetting the payu_start_time after a model run, for when multiple runs are running in one submit job so to have an idea of how long each one took.
Qstat can output in json using -F json in command - This gives a pbs_version and parses resourced_used/Resource_List/Variable list into dictionaries. The pbs_version could be useful for the scheduler version?
Model runtime: Model-driver code to parse out the model run time for ACCESS-OM2 and ESM1.5 config files?
Check scheduler job information is accessible on a Slurm job on Setonix?

- Added Telemetry class to store run state information - Move posting the telemetry to later to payu run job - to after archive or before payu exits with a model run error - Add logic for writing job.json file at different payu stages - e.g. to work directory, archive, error logs directory

jo-basevi · 2025-02-10T22:45:16Z

Note tests are failing due to Endpoint for 'api_payu_run' not found in a released access_py_telemetry package. I'll need to open a PR on that repository to update the endpoint.

charles-turner-1 · 2025-02-11T00:05:18Z

IIRC, I think it should be just payu_run, not api_payu_run - the API bit should be attached to ApiHandler.SERVER_URL.. I think?

Note tests are failing due to Endpoint for 'api_payu_run' not found in a released access_py_telemetry package. I'll need to open a PR on that repository to update the endpoint.

Edit: Not necessarily true but I think it will make life easier.

jo-basevi · 2025-02-11T00:15:40Z

I haven't opened a PR in tracking-services to link to but the endpoints in develop branch have /api/* -
Using payu_run with that pattern led to endpoints of /payu/run - So should access-py-telemetry automatically format the endpoint with /api/? Otherwise I was going to just modify the config.yaml in access-py-telemetry to have:

api:
   payu:
       run:

jo-basevi · 2025-02-11T00:17:33Z

Oops didn't see the latest edit, yeah modifying the server_url with api at the end would also work - I'll revert to payu_run

charles-turner-1 · 2025-02-11T00:23:28Z

Yeah, just saves having a

api:
  service_1:
    subtree:
api:
  service_2:
    subtree:

situation where we have to put api: at the start of each service

jo-basevi changed the title ~~546 telemetry~~ Add telemetry for payu run Feb 10, 2025

charles-turner-1 and others added 7 commits February 11, 2025 09:12

Working send on payu run (bit janky)

7f924b0

Collect scheduler information and metadata in telemetry

1234fce

Telemetry: Filter metadata fields added to run info

e03cd5c

Telemetry: Filter scheduler fields added to run info

83777b2

Pass along telemetry config environment variable to run job submissions

203c28c

Add tests for telemetry.py

39daedd

jo-basevi force-pushed the 546-telemetry branch from 9712e49 to 39daedd Compare February 10, 2025 22:20

Add access-py-telemetry as optional dependency and to test dependencies

06084a4

Revert telemetry service name to payu_run

781ce8b

jo-basevi self-assigned this Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add telemetry for payu run #558

Add telemetry for payu run #558

jo-basevi commented Feb 10, 2025

jo-basevi commented Feb 10, 2025

charles-turner-1 commented Feb 11, 2025 •

edited

Loading

jo-basevi commented Feb 11, 2025

jo-basevi commented Feb 11, 2025

charles-turner-1 commented Feb 11, 2025

Add telemetry for payu run #558

Are you sure you want to change the base?

Add telemetry for payu run #558

Conversation

jo-basevi commented Feb 10, 2025

jo-basevi commented Feb 10, 2025

charles-turner-1 commented Feb 11, 2025 • edited Loading

jo-basevi commented Feb 11, 2025

jo-basevi commented Feb 11, 2025

charles-turner-1 commented Feb 11, 2025

charles-turner-1 commented Feb 11, 2025 •

edited

Loading