Track jobs status in the `runs` table #350

jeanetteclark · 2023-04-05T23:17:39Z

related to: #327

If we use a preemptive ack to keep unclosed connections from piling up, we need to keep track of job status. The code should:

look for an entry in the runs table with that metadata pid using Model.Run.getRun() [Worker:469]
if none exists, insert an entry to the runs postgres table with a status of "running" and run_count to 1 when the worker receives a rabbitMQ message from the controller [TODO Worker:480]. if an entry already exists, set run_count to n+1. If run_count > 10 (or some other number) update status to "failed" and exit usingWorker.Run.save()
run the checks
update the runs table entry so the status is "success" when the worker finishes
periodically sweep the runs table for entries where the status is "processing" and the timestamp is 24 hours old (or some similar timeframe)
requeue the running jobs, sending the flow back to the worker which starts at step 0 above

The text was updated successfully, but these errors were encountered:

jeanetteclark · 2023-04-24T17:50:25Z

after talking to Matt last week we decided that the audits for dangling jobs should be done in the controller class, either using quartz or by creating a new thread. The method to retrieve pids for dangling jobs is okay where it is but should be defined for all of the stores (local and fileSystem).

jeanetteclark · 2023-05-10T22:39:07Z

so, so far we have:

a new controller method montitor() which calls the MonitorJob quartz job subclass.
MonitorJob queries the DB for runs where the status is "processing" and timestamp is > 24 hours old
any runs returned are then submitted to processQualityRequest() from the Controller class

things not yet done:

add the processing status insert - before ack in the worker
implementing the try counter
better documentation
making it configurable

jeanetteclark · 2023-05-12T00:01:16Z

got a working version of the monitor method with MonitorJob that successfully gets a run stuck in "processing" for more than 24 hours in the sql db, resubmits it to the worker, which (in this test) re-ran the job and changed the status in the db to success. yay!

still to do:

- add the processing status insert - before ack in the worker
- implementing the try counter
- better documentation
- making it configurable
- writing another test?

mbjones · 2023-05-12T01:00:19Z

🏆 Nicely done.

jeanetteclark · 2023-05-19T18:32:20Z

I think I've tested everything I can test as a unit test. without building an integration test framework, the best testing I can do is setting up some local scenarios which I'll describe below. Its very manual and a bit of a hack unfortunately.

To confirm that the RMQ bug is fixed, recreate it by:

set RMQ timeout to something really short (5 seconds) in /opt/homebrew/etc/rabbitmq/rabbitmq.conf

consumer_timeout = 10000
log.console = true

use a test check that includes a sys.sleep(60)
move the basicAck on Worker:193 to Worker:403
start a worker and controller, and run bin/sendAssessmentTest.py
observe the timeout error
change the Worker code back to the tip of the branch
run bin/sendAssessmentTest.py again
observe no error and successful insert in DB

To confirm that the quartz job picks up pids stuck in processing (scenario when worker dies after it acks the message from controller), while the controller and worker are running, trick the controller into finding an old run stuck in processing:

update runs set status='processing',timestamp='2022-05-16 11:26:38.932-07' where status='success';
observe the controller pick up the job, it gets passed to the worker, status is updated in DB
confirm that the run_count is incremented correctly here as well

All of this is working for me. @mbjones I know my tests are a hack but after talking Wednesday this is the path forward to release I think. Let me know if you want to see anything else before a PR to develop

mbjones · 2023-05-19T23:07:45Z

LGTM. Let's discuss getting db testing into the framework to simplify your future testing.

#327 rabbitmq connections and #350 track run status

jeanetteclark · 2023-06-30T19:35:49Z

This is finished, and working correctly (deployed on dev cluster in the snapshot release)

jeanetteclark added the enhancement label Apr 5, 2023

jeanetteclark mentioned this issue Jun 5, 2023

Release 2.4.1 #357

Closed

jeanetteclark added a commit that referenced this issue Jun 7, 2023

Merge pull request #353 from NCEAS/bugfix-#327-rabbitmq-connections

756bc2c

#327 rabbitmq connections and #350 track run status

jeanetteclark closed this as completed Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track jobs status in the `runs` table #350

Track jobs status in the `runs` table #350

jeanetteclark commented Apr 5, 2023 •

edited

Loading

jeanetteclark commented Apr 24, 2023

jeanetteclark commented May 10, 2023 •

edited

Loading

jeanetteclark commented May 12, 2023 •

edited

Loading

mbjones commented May 12, 2023

jeanetteclark commented May 19, 2023 •

edited

Loading

mbjones commented May 19, 2023

jeanetteclark commented Jun 30, 2023 •

edited

Loading

Track jobs status in the runs table #350

Track jobs status in the runs table #350

Comments

jeanetteclark commented Apr 5, 2023 • edited Loading

jeanetteclark commented Apr 24, 2023

jeanetteclark commented May 10, 2023 • edited Loading

jeanetteclark commented May 12, 2023 • edited Loading

mbjones commented May 12, 2023

jeanetteclark commented May 19, 2023 • edited Loading

mbjones commented May 19, 2023

jeanetteclark commented Jun 30, 2023 • edited Loading

Track jobs status in the `runs` table #350

Track jobs status in the `runs` table #350

jeanetteclark commented Apr 5, 2023 •

edited

Loading

jeanetteclark commented May 10, 2023 •

edited

Loading

jeanetteclark commented May 12, 2023 •

edited

Loading

jeanetteclark commented May 19, 2023 •

edited

Loading

jeanetteclark commented Jun 30, 2023 •

edited

Loading