`cancelled_flow_query` very slow #17080

mattijsdp · 2025-02-10T20:51:46Z

Bug summary

subflow_query in the cancellation_cleanup service queries all columns in the flow_run table. This has a HUGE effect on the query runtime. From what I can see, only ~4 columns are actually required though to cancel the subflows. I believe sqlalchemy supports partial objects so the code change would be minimal. On our Prefect database*, a select * query takes 33s vs 0.7s for select id, state_type, parent_task_run_id, deployment_id. I think this might also help issue #16299.

Current query (as I understand it):

SELECT *
FROM flow_run
WHERE state_type IN ('PENDING', 'SCHEDULED', 'RUNNING', 'PAUSED', 'CANCELLING')
    AND id > '00000000-0000-0000-0000-000000000000'
    AND parent_task_run_id IS NOT NULL
ORDER BY id
LIMIT 200;

Proposed query:

SELECT id, state_type, parent_task_run_id, deployment_id  -- Only essential columns
FROM flow_run
WHERE state_type IN ('PENDING', 'SCHEDULED', 'RUNNING', 'PAUSED', 'CANCELLING')
    AND id > '00000000-0000-0000-0000-000000000000'
    AND parent_task_run_id IS NOT NULL
ORDER BY id
LIMIT 200;

Background: we have been struggling with our Postgres database being under heavy load and as a result both RecentDeploymentsScheduler and CancellationCleanup have been taking longer than their loop interval (~5-8s and ~50-70s vs 5s and 20s default loop intervals respectively). Rather than just beefing up our database, it seemed there was some potential for efficiency improvements. When looking at top SQL statements the above query is the heaviest by an order of magnitude. Disclaimer: I don't have much experience with databases.

*Our Prefect database is probably moderately sized: ~400k flow runs

Version info

Version:             3.0.2
API version:         0.8.4
Python version:      3.11.11
Git commit:          c846de02
Built:               Fri, Sep 13, 2024 10:48 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         server
Pydantic version:    2.10.5

Additional context

No response

The text was updated successfully, but these errors were encountered:

zzstoatzz · 2025-02-11T19:48:16Z

hi @mattijsdp - thanks for the well-written issue!

The suggestion makes sense to me, so I've taken a crack at this optimization in #17095.

If this isn't already merged in the main by the time you read this, you can install directly from the branch to test it out

pip install git+https://github.com/prefecthq/prefect.git@cancelled-flow-query

mattijsdp · 2025-02-12T11:03:24Z

@zzstoatzz great, thanks for the quick response and fix!

I tried to install from the branch but that then didn't build the dashboard... I'll update when it's released.

mattijsdp · 2025-02-14T17:15:08Z

Small update: I think the above would fix our problem but the underlying issue was actually because we had a few (~40) flow runs which had a big matrix (~50MB) as input parameter. This then caused a query to the flow_run table to be very slow.

mattijsdp added the bug Something isn't working label Feb 10, 2025

zzstoatzz linked a pull request Feb 11, 2025 that will close this issue

try improved cancelled_flow_query #17095

Open

zzstoatzz added the great writeup This is a wonderful example of our standards label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cancelled_flow_query` very slow #17080

`cancelled_flow_query` very slow #17080

mattijsdp commented Feb 10, 2025

zzstoatzz commented Feb 11, 2025

mattijsdp commented Feb 12, 2025

mattijsdp commented Feb 14, 2025

cancelled_flow_query very slow #17080

cancelled_flow_query very slow #17080

Comments

mattijsdp commented Feb 10, 2025

Bug summary

Version info

Additional context

zzstoatzz commented Feb 11, 2025

mattijsdp commented Feb 12, 2025

mattijsdp commented Feb 14, 2025

`cancelled_flow_query` very slow #17080

`cancelled_flow_query` very slow #17080