Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track jobs status in the runs table #350

Closed
jeanetteclark opened this issue Apr 5, 2023 · 7 comments
Closed

Track jobs status in the runs table #350

jeanetteclark opened this issue Apr 5, 2023 · 7 comments

Comments

@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Apr 5, 2023

related to: #327

If we use a preemptive ack to keep unclosed connections from piling up, we need to keep track of job status. The code should:

  1. look for an entry in the runs table with that metadata pid using Model.Run.getRun() [Worker:469]
  2. if none exists, insert an entry to the runs postgres table with a status of "running" and run_count to 1 when the worker receives a rabbitMQ message from the controller [TODO Worker:480]. if an entry already exists, set run_count to n+1. If run_count > 10 (or some other number) update status to "failed" and exit usingWorker.Run.save()
  3. run the checks
  4. update the runs table entry so the status is "success" when the worker finishes
  5. periodically sweep the runs table for entries where the status is "processing" and the timestamp is 24 hours old (or some similar timeframe)
  6. requeue the running jobs, sending the flow back to the worker which starts at step 0 above
@jeanetteclark
Copy link
Collaborator Author

after talking to Matt last week we decided that the audits for dangling jobs should be done in the controller class, either using quartz or by creating a new thread. The method to retrieve pids for dangling jobs is okay where it is but should be defined for all of the stores (local and fileSystem).

@jeanetteclark
Copy link
Collaborator Author

jeanetteclark commented May 10, 2023

so, so far we have:

a new controller method montitor() which calls the MonitorJob quartz job subclass.
MonitorJob queries the DB for runs where the status is "processing" and timestamp is > 24 hours old
any runs returned are then submitted to processQualityRequest() from the Controller class

things not yet done:

  • add the processing status insert - before ack in the worker
  • implementing the try counter
  • better documentation
  • making it configurable

@jeanetteclark
Copy link
Collaborator Author

jeanetteclark commented May 12, 2023

got a working version of the monitor method with MonitorJob that successfully gets a run stuck in "processing" for more than 24 hours in the sql db, resubmits it to the worker, which (in this test) re-ran the job and changed the status in the db to success. yay!

still to do:

  • - add the processing status insert - before ack in the worker
  • - implementing the try counter
  • - better documentation
  • - making it configurable
  • - writing another test?

@mbjones
Copy link
Member

mbjones commented May 12, 2023

🏆 Nicely done.

@jeanetteclark
Copy link
Collaborator Author

jeanetteclark commented May 19, 2023

I think I've tested everything I can test as a unit test. without building an integration test framework, the best testing I can do is setting up some local scenarios which I'll describe below. Its very manual and a bit of a hack unfortunately.

To confirm that the RMQ bug is fixed, recreate it by:

  • set RMQ timeout to something really short (5 seconds) in /opt/homebrew/etc/rabbitmq/rabbitmq.conf
consumer_timeout = 10000
log.console = true
  • use a test check that includes a sys.sleep(60)
  • move the basicAck on Worker:193 to Worker:403
  • start a worker and controller, and run bin/sendAssessmentTest.py
  • observe the timeout error
  • change the Worker code back to the tip of the branch
  • run bin/sendAssessmentTest.py again
  • observe no error and successful insert in DB

To confirm that the quartz job picks up pids stuck in processing (scenario when worker dies after it acks the message from controller), while the controller and worker are running, trick the controller into finding an old run stuck in processing:

  • update runs set status='processing',timestamp='2022-05-16 11:26:38.932-07' where status='success';
  • observe the controller pick up the job, it gets passed to the worker, status is updated in DB
  • confirm that the run_count is incremented correctly here as well

All of this is working for me. @mbjones I know my tests are a hack but after talking Wednesday this is the path forward to release I think. Let me know if you want to see anything else before a PR to develop

@mbjones
Copy link
Member

mbjones commented May 19, 2023

LGTM. Let's discuss getting db testing into the framework to simplify your future testing.

jeanetteclark added a commit that referenced this issue Jun 7, 2023
@jeanetteclark
Copy link
Collaborator Author

jeanetteclark commented Jun 30, 2023

This is finished, and working correctly (deployed on dev cluster in the snapshot release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants