-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add telemetry for payu run #558
base: master
Are you sure you want to change the base?
Conversation
- Added Telemetry class to store run state information - Move posting the telemetry to later to payu run job - to after archive or before payu exits with a model run error - Add logic for writing job.json file at different payu stages - e.g. to work directory, archive, error logs directory
9712e49
to
39daedd
Compare
Note tests are failing due to |
IIRC, I think it should be just
Edit: Not necessarily true but I think it will make life easier. |
I haven't opened a PR in tracking-services to link to but the endpoints in api:
payu:
run: |
Oops didn't see the latest edit, yeah modifying the server_url with api at the end would also work - I'll revert to |
Yeah, just saves having a api:
service_1:
subtree:
api:
service_2:
subtree: situation where we have to put |
This PR follows on from @charles-turner-1 Pull Request ACCESS-NRI#1 for adding telemetry for payu runs.
Configuring telemetry
As telemetry should only be enabled for
ACCESS-NRI
deployed versions of payu, the plan is to store an external configuration file for telemetry which contains fields such asserver_url
. So this filepath will be stored in environment variable set by released environments, e.g.PAYU_TELEMETRY_CONFIG_PATH
. When this environment variable is set and configuration file is present, then payu will attempt to post run job information usingaccess-py-telemetry
module. I've also set a payuconfig.yaml
over-ride option to disable telemetry, e.g.An example telemetry file could look something like the following:
Server url is currently a persistent session hostname that is accessible in PBS jobs that do not have internet access.
Adding scheduler job information
Payu queries the scheduler (using
qstat
for PBS) to obtain job information such as resource usage. As this job information only gets updated periodically, I've moved posting the job information later to after model archive has run, or just before the program exits if there is a model run error. So hopefully this will pick up resource usage information closer to the final value.Telemetry class instance
Added a
Telemetry
class intelemetry.py
to keep track of payu run state information (e.g. model runtime, counters, n_runs) - which is run just after the model is run inExperiment.run
. Once the experiment run has finished (e.g. after archive), then the scheduler information is queried and the extra fields are added and an api request is sent usingaccess_py_telemetry
. The extra fields - so the payu run state, metadata and scheduler fields are also logged always to ajob.json
file. This file either ends up in an error logs directory if model exited with an error, work directory if the archive step is not enabled, or in the archive directory.An example record is following:
Questions/TODOs
job.json
and then just post a filtered version for the telemetry? I'm currently saving tojob.json
the same extra fields that are being added to telemetry.Example job.json files
job.json
with filtered scheduler fields:job.json
with all the scheduler fields:There's a
payu_walltime_seconds
value - At the moment, it's the time from initialised experiment to just after the model run command (e.g.mpirun
). Should instead cover the whole initialise-setup-run-archive loop and should there be a separate time just thempirun
command? I'm currently resetting thepayu_start_time
after a model run, for when multiple runs are running in one submit job so to have an idea of how long each one took.Qstat can output in json using
-F json
in command - This gives a pbs_version and parses resourced_used/Resource_List/Variable list into dictionaries. The pbs_version could be useful for the scheduler version?Model runtime: Model-driver code to parse out the model run time for ACCESS-OM2 and ESM1.5 config files?
Check scheduler job information is accessible on a Slurm job on Setonix?