Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cancelled_flow_query very slow #17080

Open
mattijsdp opened this issue Feb 10, 2025 · 3 comments · May be fixed by #17095
Open

cancelled_flow_query very slow #17080

mattijsdp opened this issue Feb 10, 2025 · 3 comments · May be fixed by #17095
Labels
bug Something isn't working great writeup This is a wonderful example of our standards

Comments

@mattijsdp
Copy link

Bug summary

subflow_query in the cancellation_cleanup service queries all columns in the flow_run table. This has a HUGE effect on the query runtime. From what I can see, only ~4 columns are actually required though to cancel the subflows. I believe sqlalchemy supports partial objects so the code change would be minimal. On our Prefect database*, a select * query takes 33s vs 0.7s for select id, state_type, parent_task_run_id, deployment_id. I think this might also help issue #16299.

Current query (as I understand it):

SELECT *
FROM flow_run
WHERE state_type IN ('PENDING', 'SCHEDULED', 'RUNNING', 'PAUSED', 'CANCELLING')
    AND id > '00000000-0000-0000-0000-000000000000'
    AND parent_task_run_id IS NOT NULL
ORDER BY id
LIMIT 200;

Proposed query:

SELECT id, state_type, parent_task_run_id, deployment_id  -- Only essential columns
FROM flow_run
WHERE state_type IN ('PENDING', 'SCHEDULED', 'RUNNING', 'PAUSED', 'CANCELLING')
    AND id > '00000000-0000-0000-0000-000000000000'
    AND parent_task_run_id IS NOT NULL
ORDER BY id
LIMIT 200;

Background: we have been struggling with our Postgres database being under heavy load and as a result both RecentDeploymentsScheduler and CancellationCleanup have been taking longer than their loop interval (~5-8s and ~50-70s vs 5s and 20s default loop intervals respectively). Rather than just beefing up our database, it seemed there was some potential for efficiency improvements. When looking at top SQL statements the above query is the heaviest by an order of magnitude. Disclaimer: I don't have much experience with databases.

*Our Prefect database is probably moderately sized: ~400k flow runs

Version info

Version:             3.0.2
API version:         0.8.4
Python version:      3.11.11
Git commit:          c846de02
Built:               Fri, Sep 13, 2024 10:48 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         server
Pydantic version:    2.10.5

Additional context

No response

@mattijsdp mattijsdp added the bug Something isn't working label Feb 10, 2025
@zzstoatzz zzstoatzz linked a pull request Feb 11, 2025 that will close this issue
@zzstoatzz zzstoatzz added the great writeup This is a wonderful example of our standards label Feb 11, 2025
@zzstoatzz
Copy link
Collaborator

hi @mattijsdp - thanks for the well-written issue!

The suggestion makes sense to me, so I've taken a crack at this optimization in #17095.

If this isn't already merged in the main by the time you read this, you can install directly from the branch to test it out

pip install git+https://github.com/prefecthq/prefect.git@cancelled-flow-query

@mattijsdp
Copy link
Author

@zzstoatzz great, thanks for the quick response and fix!

I tried to install from the branch but that then didn't build the dashboard... I'll update when it's released.

@mattijsdp
Copy link
Author

Small update: I think the above would fix our problem but the underlying issue was actually because we had a few (~40) flow runs which had a big matrix (~50MB) as input parameter. This then caused a query to the flow_run table to be very slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working great writeup This is a wonderful example of our standards
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants