-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track jobs status in the runs
table
#350
Comments
after talking to Matt last week we decided that the audits for dangling jobs should be done in the controller class, either using quartz or by creating a new thread. The method to retrieve pids for dangling jobs is okay where it is but should be defined for all of the stores (local and fileSystem). |
so, so far we have: a new controller method things not yet done:
|
got a working version of the still to do:
|
🏆 Nicely done. |
I think I've tested everything I can test as a unit test. without building an integration test framework, the best testing I can do is setting up some local scenarios which I'll describe below. Its very manual and a bit of a hack unfortunately. To confirm that the RMQ bug is fixed, recreate it by:
To confirm that the quartz job picks up pids stuck in processing (scenario when worker dies after it acks the message from controller), while the controller and worker are running, trick the controller into finding an old run stuck in processing:
All of this is working for me. @mbjones I know my tests are a hack but after talking Wednesday this is the path forward to release I think. Let me know if you want to see anything else before a PR to |
LGTM. Let's discuss getting db testing into the framework to simplify your future testing. |
This is finished, and working correctly (deployed on dev cluster in the snapshot release) |
related to: #327
If we use a preemptive ack to keep unclosed connections from piling up, we need to keep track of job status. The code should:
runs
table with that metadata pid usingModel.Run.getRun()
[Worker:469]runs
postgres table with astatus
of "running" andrun_count
to 1 when the worker receives a rabbitMQ message from the controller [TODO Worker:480]. if an entry already exists, setrun_count
to n+1. Ifrun_count
> 10 (or some other number) update status to "failed" and exit usingWorker.Run.save()
runs
table entry so the status is "success" when the worker finishesruns
table for entries where the status is "processing" and the timestamp is 24 hours old (or some similar timeframe)The text was updated successfully, but these errors were encountered: