-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unclosed connections and channels in rabbitmq #327
Comments
it looks like there is a I think that this method should be called within a other locations where it needs to be implemented:
let me know if this sounds right @mbjones. If so I can implement those changes. I'll need help with deployment most likely |
okay I pushed some things to |
So, we determined that the fix above isn't gonna work. The connections have to stay open because they are all both sending and receiving messages, so if we close them they won't be able to receive. I had a look at the k8s logs for a worker pod and see this:
This gets triggered in a catch block at Worker:362 similarly, I see in the rabbitmq pod logs:
From the docs:
I'm not sure exactly the sequence of events, but I think messages are going unacknowledged, causing channels to get shutdown with this timeout. When the callback method is run by a new report, a new connection and channel are opened when the catch block (Worker:362) is hit, probably because the channel got closed on the previous timeout. Note that this catch block (unlike the try above it) does NOT have a call to I'm also not sure of the solution, but definitely a would love to discuss whether this make sense @mbjones @iannesbitt. I will also see if I can recreate this fun behavior in the mini-program I made to try and mimic it. |
Nice writeup. You've definitely hit a core issue. As we discussed the other day, the whole design of using 2 queues is to get around these timeout problems. Clearly it was not sufficient. I think a sequence diagram will help conceptualize the flow so that it is clear what is happening. Similar issues might be latent in the new Metacat queue processing code, as illustrated in this sequence diagram. Let's discuss, maybe next week while we are in Santa Barbara -- a little whiteboard session might go a long way. |
Huh. Maybe this explains why the toy program always seems to have 1 unacked message whenever I feed the Consumer queue a bunch (10+) of things at a time... |
I agree Jeanette that the |
Now that I think about it, I think the 1 unacked that I see when I fill up the toy program's queue may just represent the message that's currently in processing by the Worker, so there will always be one as long as the queue has unprocessed items. I was confusing this behavior with the behavior we saw in prod-metadig's queue when we saw the admin interface report 3 unacked messages and no load. I thought this from the tutorial was interesting as it also appears to describe a symptom of the issue:
|
@jeanetteclark @iannesbitt poking around, I realized Peter did generate some sequence diagrams -- the plant UML is in the docs folder, and here's a couple examples: |
Last night I removed the One thing we should consider is that there is a potential to lose jobs if the solution to this problem is to simply put a If we decide this is the solution, we will need some logic that removes the job from the queue if it fails continuously, but this should at least prevent unwarranted positive acks. We could also use this type of rejection with |
I just added jeanetteclark/rabbit-learning@9a6b691 to show an example of this. I think this type of fix would at least help keep the number of connections due to unacked messages from continuing to climb as happens at the moment. The three of us should discuss what should happen to queue messages that fail twice in a row (requeue or discard) because it's possible that this logic would just keep failures in a loop forever, which we obviously don't want. The other option is to keep track of failed message tags and discard the ones that go through the loop more than once (a total of 4 failures in the case of the toy logic). |
@jeanetteclark I left comments on the commit 12d75eb The deploy process is basically a helm upgrade, but first we have to build and publish a new image. I think we should use this opportunity to change where images are hosted, as peter was using dockerhub which is not our working model. So I could review this change with you to use the GHCR instead. |
so initially, things are getting closed because of long processing jobs. see from the logs:
I think the solution might be to move the |
@mbjones pointed out on slack that if we run things in the order:
we don't have a way of recovering if things fail in step 3. In doing a little googling around, I found this SO thread which has an interesting idea, though it really seems like it is (again) rewriting a task queue that should be handling all that stuff for us. There is also the workaround of just increasing the timeout to something really long, which is by far the easiest solution but also the least satisfying. The only other solution would be figuring out how to make the controller anticipate a message from the completed queue for every message it sends to the quality queue. If it doesn't receive a completed message, then it would re-queue its message to the quality queue. Again though, isn't this the whole point of a work queue to not have to write all that stuff yourself? |
I went through that exact same internal debate yesterday. That strategy is called "premature acknowledgement" or "preemptive acknowledgement". I also spent a while poking around the interwebs about solutions to this - it seems to be a common problem/complaint. While mostly this will be a rare concern for us, the one place it could happen somewhat regularly is if k8s moves a worker from one node to another in the middle of processing a job, causing it to never finish. Our one saving grace is that report generation should be entirely idempotent. So, I think a periodic controller sweep to check for reports that were queued and acked but never reported back would be a decent solution - all it would have to do is resubmit the task to rmq again. If for some reason the first task did eventually complete, the second job should just be producing identical output -- so no harm, no foul. The only concern to watch here are infinite cycles, where we just keep regenerating the same task over and over and it never completes in the window between sweeps (possibly due to a v. long process or bug). I see two fairly simple ways to guard against that:
In either case, once a job hit a certain number of retries, we'll want to shunt that off into a dead letter exchange for manually dealing with it. This all assumes the Controller is managing the submission of quality messages, and that there is a way for it to track final completion of those -- maybe that's just a 'success' status for the run in the runs table. From a precursory inspection of the db, I don't see a way that quality messages are being tracked in the controller -- I think they are fire-and-forget right now. Happy to discuss. |
With a preemptive ack, we will definitely need to do some kind of sweep to make sure we don't lose jobs. I was just able to recreate the initial timeout issue locally (in a toy program). I fixed it using the preemptive ack, however if I kill the worker mid-process that message (as expected) gets lost and the job never completes. Other jobs queued up behind it are still completed and finished as normal. You are correct that it is fire and forget on the scheduler side. One way to implement a sweep could be to insert it into the RequestReportJob.java class (linked above). If we use the This keeps RMQ out of it, but I think it might be the easiest way. Otherwise I think we would to need to figure out a way to store RMQ message payloads temporarily in the Controller, and that doesn't really make sense to me since we already do that in the runs table. Thoughts? |
Glad you were able to replicate the issue! Your approach sounds reasonable, although possibly complicated in that you are recalculating the PID/SUITE combo for everything as you check it, which requires querying the dataone servers, which change state frequently. So reconstructing things might not be simple. Another way might be thinking about inserting an entry in the runs table when the first job ack comes back that has basic fields filled in and a status of 'running' or something like that, along with an 'run_count' counter. When the completed queue fires and runs, the run row would be updated with full run info and status='success' as it is now. So then, checking for dangling jobs is a matter of a simple sql query for runs with status 'running' that were started delta time ago or longer (e.g., 24 hours or something). The info in the table would be sufficient to requeue the job and increment the |
Why would I need to recalculate everything? As far as I can tell, an object only gets inserted into the runs table if it succeeds. So if the query to Solr says pid0, pid1, pid2 were all updated between harvest0 and harvest1, and I do think that your way is more robust though, since it actually tracks the status of the jobs. I think this maybe was the original intent of the runs table but it doesn't seem to do that |
I think you need more than just the pid updates. The determination of which runs need to happen is the intersection of PID updates and SUITE schedule updates. And it is that intersection that you would have to recalculate each time to figure out what was missed. and it may not even be possible. For example, if the regex to match formatIds changes in the schedule, then you will no longer know if some PIDs might be missing from a previously requested schedule. For example, imagine a schedule change from this:
to this:
With that change, any ISO runs that would have been in the previous request are no longer calculatable from the schedule. The approach I am outlining records the fact that a PID/SUITE combo was requested at some point in time, and makes sure that it eventually succeeds even if the schedule changes. While its unlikely that we will reduce the scope of these schedule regexes, I always like it when systems behave consistently -- much easier to debug. |
ohhh I see, that makes sense. okay I'll start wrapping my head around the change with the modifications to the runs table - and create a separate issue for it since I think we are well beyond just unclosed connections at this point. |
After lots of testing, the solution described above seems to be working correctly on the dev cluster using the snapshot release. To test, I submitted batches of 100 of the largest metadata records on the ADC a few times to the test server and observed both the logs and the RMQ dashboard. The only error I am seeing is this: #360 which seems unrelated to any of the code changed here |
The current implementation seems to leave large numbers of connections and channels open. Generally we only want one or a few connections open for each producer and consumer, and they would have typically one channel per thread. See https://www.rabbitmq.com/connections.html#basics
Looking at the current management interface, in production I see:
I also note that in the code, I see only 4 places where these are created:
However, it also seems like they are not closed properly when the class is destroyed. At a minimum this would happen when new Worker and Scorer consumer pods are created.
The text was updated successfully, but these errors were encountered: