Queue for faulty webhooks #391

SeanStayn · 2019-12-02T18:21:18Z

Queue for faulty webhooks

Problem

If a webhook is fired with the transaction setting "All the Webhooks must succeed" and the webhook endpoint is not reachable, the user is blocked.

The following image shows the blocking message while login:

We have several use cases, where we use webhooks. For example for a custom audit log system or for creating shadow user in another database system for additional user information.

Solution

Developing a queue for webhooks that could not be sent would solve the problem.

The queue is a buffer that sends the unsent webhooks in as soon as the webhook endpoint is available again.

Perhaps it is possible to add a settings option for enable the webhook queue.

For our system it is necessary that all events are successfully transmitted and that users are never blocked on the basis of faulty webhooks.

So it would be great if the concept similar to that of a broker with QoS were integrated into FusionAuth.

Alternatives/workarounds

As workaround it would be possible to build a redundant system.
For example, three web servers as webhook endpoints hosted at different locations. Setting the transaction setting to "Any single Webhook must succeed" for three Webhooks.

However, this workaround only works for one use case (e.g. audit logs).

How to vote

Please give us a thumbs up or thumbs down as a reaction to help us prioritize this feature. Feel free to comment if you have a particular need or comment on how this feature should work.

robotdan · 2019-12-02T20:26:57Z

If a webhook is fired with the transaction setting "All the Webhooks must succeed" and the webhook endpoint is not reachable, the user is blocked.

This is working as designed. If you have configured the webhook to require each webhook to succeed, and at least one fails - we fail the request when we cannot contact your webhook.

Are you asking for us to complete the request, but then continue to send the event until it succeeds? To be truly transactional, we have to ensure everything is successful in a synchronous fashion.

If you reduce the TX level to "some must succeed" for example, then we will allow the request to succeed and we'll queue any failed attempts for retry. The retry queue will try try up to 3 times before giving up.

You can also use a Kafka integration instead of a Webhook to receive the event, and then you could leverage the Kafka service which may provide you additional redundancies

https://fusionauth.io/docs/v1/tech/integrations/kafka
.

SeanStayn · 2019-12-03T10:41:14Z

This is working as designed. If you have configured the webhook to require each webhook to succeed, and at least one fails - we fail the request when we cannot contact your webhook.

Yes, this is currently implemented, but blocks the User, which is not acceptable for us.

Are you asking for us to complete the request, but then continue to send the event until it succeeds?

Yes, this is exactly what we need.

If you reduce the TX level to "some must succeed" for example, then we will allow the request to succeed and we'll queue any failed attempts for retry. The retry queue will try try up to 3 times before giving up.

Cool, good to know. But with this TX level we can lose some events. If we could change the retries from 3 times to unlimited it could be solve our problem.

You can also use a Kafka integration instead of a Webhook to receive the event, and then you could leverage the Kafka service which may provide you additional redundancies

Kafka is an option as well, but if the Kafka broker goes down for hours, we lose events or the UI will blocked as well.

We want to make sure that we receive every webhook event successfully, even if our log server fails for a few hours. It is important that although the log server is down, users can use FusionAuth as usual.

In a nutshell: We need extremely high availability of FusionAuth, but we still need to ensure that all webhook events are successfully transmitted.

For your part, is there another idea for this scenario?

robotdan · 2019-12-03T20:46:53Z

We could add an additional TX level that says "all must succeed... eventually" - and in this mode we would not block on the request but queue "forever" until success.

SeanStayn · 2019-12-05T10:45:29Z

That would be a very good solution for us.

Does "forever" mean the permanent persistence of the faulty webhook events? So that after a restart of FusionAuth the faulty webhook events are still available?

robotdan · 2019-12-06T16:44:48Z

Does "forever" mean the permanent persistence of the faulty webhook events? So that after a restart of FusionAuth the faulty webhook events are still available?

That is correct. Off the top of my head, we'd persist the events and then have nodes work off of that queue based upon the TX level until we can complete the request.

Once we start persisting these events for these types of scenarios, we may also add a webhook event log so that there is visibility into the sent webhooks, and pending events that have not yet been successfull sent, retry counts, etc.

SeanStayn · 2019-12-09T08:59:15Z

Wow, that sounds amazing. Great and very helpful idea! :)

robotdan added the enhancement New feature or request label Dec 30, 2019

robotdan self-assigned this Dec 30, 2019

This was referenced Nov 22, 2021

Add a webhook event log #1314

Closed

Add thread pool metrics for Webhooks to the health check / status endpoint #1499

Closed

robotdan mentioned this issue Jan 7, 2022

[Category] Webhooks #1543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Queue for faulty webhooks #391

Queue for faulty webhooks #391

SeanStayn commented Dec 2, 2019 •

edited by robotdan

Loading

robotdan commented Dec 2, 2019

Uh oh!

SeanStayn commented Dec 3, 2019

Uh oh!

robotdan commented Dec 3, 2019

Uh oh!

SeanStayn commented Dec 5, 2019

Uh oh!

robotdan commented Dec 6, 2019

Uh oh!

SeanStayn commented Dec 9, 2019

Uh oh!

Queue for faulty webhooks #391

Queue for faulty webhooks #391

Comments

SeanStayn commented Dec 2, 2019 • edited by robotdan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!