Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a retry mechanism for migrate_with_timeouts #20

Merged
merged 4 commits into from
Sep 19, 2024

Conversation

marcelofern
Copy link
Contributor

@marcelofern marcelofern commented Sep 3, 2024

This branch adds a retry mechanism for migrate_with_timeouts.

Before

  ./manage.py migrate_with_timeouts \
    --lock-timeout-in-ms=10000 \
    --statement-timeout-in-ms=10000

After

  ./manage.py migrate_with_timeouts \
    --lock-timeout-in-ms=10000 \
    --lock-timeout-max-retries=3 \
    --lock-timeout-retry-exp=2 \
    --lock-timeout-retry-min-wait-in-ms=3000 \
    --lock-timeout-retry-max-wait-in-ms=10000 \
    --statement-timeout-in-ms=10000 \
    --retry-callback--path="dotted.path.to.callback.function"

@marcelofern marcelofern marked this pull request as draft September 3, 2024 02:27
@marcelofern marcelofern force-pushed the add-retry-for-migrate branch 2 times, most recently from e23f421 to 7eefab0 Compare September 9, 2024 05:00
Copy link

github-actions bot commented Sep 9, 2024

Coverage Report Results

Name Stmts Miss Branch BrPart Cover
src/django_pg_migration_tools/management/commands/migrate_with_timeouts.py 114 0 38 0 100%
src/django_pg_migration_tools/operations.py 67 0 6 0 100%
src/django_pg_migration_tools/timeouts.py 73 5 36 2 92%
TOTAL 254 5 80 2 97%

3 empty files skipped.

@marcelofern marcelofern force-pushed the add-retry-for-migrate branch 5 times, most recently from 89a50a5 to 87b7234 Compare September 12, 2024 02:52
@marcelofern marcelofern marked this pull request as ready for review September 12, 2024 02:55
Copy link

@delfick delfick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

I've added a suggestion that moves the code around a little.

super().handle(*args, **options)
def _migrate(self, *django_migrate_args: Any, **django_migrate_kwargs: Any) -> None:
migration_applied: bool = False
while self._can_migrate() and (migration_applied is False):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while self._can_migrate() and (migration_applied is False):
while self._can_migrate():

and then do a return instead of migration_applied = True

And then the raising of the exception does need the if not migration_applied guard cause the only time that would be reached would be if the migration didn't happen.

# This raises ModuleNotFoundError, which gives a good explanation
# of the error already (see tests). We don't have to wrap this into
# our own exception.
*module, callback = callback_path.split(".")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my suggestion I do a module, attr_name = callaback_path.rsplit(".") here to avoid needing the ".".join(module). I also make sure the path has at least one dot in it and I assert callable(callback)

Copy link

@timb07 timb07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of the need for retries on statement timeout — please convince me that the extra complexity they add is worthwhile.

Comment on lines 66 to 251
def _parse_non_required_ms_option_to_timedelta(
options: Any, key: str
) -> datetime.timedelta | None:
value = options.pop(key, None)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 The word "parse" in the function name suggests only parsing, not mutating the options dict. Naming things is hard, but perhaps "pop" or "extract" would be better.

Copy link
Contributor Author

@marcelofern marcelofern Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[addressed] I've picked up the pop prefix instead.

Comment on lines 24 to 28
statement_timeouts: int
statement_timeouts_max_retries: int
statement_timeout_retry_exp: int
statement_timeout_retry_max_wait: datetime.timedelta
statement_timeout_retry_min_wait: datetime.timedelta
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 There's a strong use case for having retries for lock timeouts, but I'm struggling to see why retrying in case of statement timeouts would be useful. Do we expect a statement that times out the first time to somehow become faster and not to timeout when retried? What's an example of such a statement where that's likely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timb07

Do we expect a statement that times out the first time to somehow become faster and not to timeout when retried?

Good question. I think that this is a possible situation because of temporary load spikes on the database, caused by either background tasks or maintenance tasks starving server resources.

E.g., a vacuum finishing between retries could make the schema change query execute faster and thus pass on a second attempt. Adding a unique constraint to a table with many dead rows is an example of such query. The constraint generation would need to step through many dead-rows and thus have to load more pages from disk to finish its job before the vacuum which could substantially change the results.

I do agree though, that it doesn't feel as useful as lock_timeout retries and its usage might be limited to fewer situations.

It almost makes sense to add a exponential back-off for setting statement_timeout to ever increasing values up to a limit, but that sounds like a different feature.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the examples you've given do fit the scenario I asked about. But they're not terribly convincing as a justification for the need for this feature. Whereas lock timeouts + automatic retries is well-established best practice (e.g. this blogpost) for applying migrations.

And let's go back to why we might want a statement timeout for migrations in the first place. The main reason (as I understand it) is that we don't want migrations to take too long, because they delay the overall deployment process. So if we hit the statement timeout, isn't retrying the wrong thing to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problems, I will remove this part of the feature and if we find better use cases in the future we can add it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in:

@@ -31,6 +54,111 @@ def add_arguments(self, parser: base.CommandParser) -> None:
required=False,
help="Value to set as statement_timeout in milliseconds.",
)
parser.add_argument(
"--retry-callback--path",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra hyphen: ...callback--path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fat finger + auto complete gone wrong:

help=(
"How many times to retry after a lock timeout happened. "
"Defaults to zero, which means no retries - the migration "
"fails immediatelly upon lock timeout."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "immediately"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required=False,
default=2,
help=(
"The value for the exponent for retry backoff propagation. "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 "propagation" doesn't seem like the right word, but it's not clear to me what the right word is. Would "delay" work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if value is None:
return None
assert isinstance(value, int)
return datetime.timedelta(seconds=int(value / 1_000))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have the argument in milliseconds if it's going to be rounded down to a whole number of seconds anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Addressed] That's a problem. I wasn't aware that timedelta had a miliseconds argument as I was biased by the author's preference for microseconds:

image

)
self._attempt_callback(exc)

if migration_applied is False:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pet peeve: comparison of booleans.

Prefer: if not migration_applied:

(and also above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 323 to 232
# current_attempt is an integer, but it is turned into a float here
# because a huge exponentiation in Python between integers
# **never** overflows. Instead, the CPU is left trying to calculate
# the result forever and it will eventually return a memory error
# instead. Which we absolutely do not want. Please see:
# https://docs.python.org/3.12/library/exceptions.html#OverflowError
result = exp ** (float(current_attempt))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many retries do we expect users to attempt? More than thousands???

Counterpoint: this code will cope with any number of retries, and with no particular downsides apart from the need to explain why the cast to float is done.

Copy link
Contributor Author

@marcelofern marcelofern Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the thousands. Hopefully not even in the hundreds.
You can, of course, run into overflow errors in less than a thousand attempts for a high exponential value - but I also hope this not to be a common scenario. Giving that I am "hoping for many things" I agree that it's best to make the code work for any number of retries lest an arbitrary limiting number is incorrectly chosen and we have to go through a soft deprecation process.

retry callback:
+++++++++++++++

* ``--retry-callback--path``: Sets a callback to be called after a timeout
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste: remove double hyphen here: ...callback--path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to #20 (comment)

--statement-timeout-retry-exp=2 \
--statement-timeout-retry-min-wait-in-ms=3000 \
--statement-timeout-retry-max-wait-in-ms=10000 \
--retry-callback--path="dotted.path.to.callback.function"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to #20 (comment)

@marcelofern marcelofern force-pushed the add-retry-for-migrate branch 8 times, most recently from e56c8fc to a32c8a3 Compare September 17, 2024 03:40
Comment on lines 25 to 35
retry_callback: Callable[[RetryState], None] | None

statement_timeout: datetime.timedelta | None

lock_timeout: datetime.timedelta | None
lock_timeouts: int
lock_timeouts_max_retries: int
lock_timeout_retry_exp: int
lock_timeout_retry_max_wait: datetime.timedelta
lock_timeout_retry_min_wait: datetime.timedelta

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't needed anymore right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes yes, too many stashes. Fixup: d33f59d

statement_timeout=timeout_options.statement_timeout,
):
super().handle(*args, **options)
migration_applied = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
migration_applied = True
return

And then remove the migration_applied flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout db207cd

@marcelofern marcelofern force-pushed the add-retry-for-migrate branch 5 times, most recently from b39a90d to c482024 Compare September 19, 2024 02:20
This new dataclass is responsible for capturing the options necessary to
run a migration with timeouts.

It extracts the logic out of the Command class, so that once we add
further configuration (a retry mechanism), the Command class can remain
the same, at the expense of this dataclass growing instead.
This new class is responsible for parsing some of the input fields to
the migrate_with_timeouts command that cannot be parsed with the
`parser` class provided by Django.

This also removes the parsing logic out of the MigrationTimeoutOptions
class into a dedicated _Parser class.
Comment on lines +194 to +200
class MigrateRetryStrategy:
timeout_options: MigrationTimeoutOptions
retries: int

def __init__(self, timeout_options: MigrationTimeoutOptions):
self.timeout_options = timeout_options
self.retries = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not an dataclass.dataclass(frozen=True)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's something strange about putting that much logic behind a dataclass

"migrate_with_timeouts",
lock_timeout_in_ms=50_000,
lock_timeout_max_retries=2,
retry_callback_path="tests.django_pg_migration_tools.management.commands.test_migrate_with_timeouts.example_callback",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retry_callback_path="tests.django_pg_migration_tools.management.commands.test_migrate_with_timeouts.example_callback",
retry_callback_path=f"{__name__}.example_callback",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

fixup: 9c0784b

Copy link

@delfick delfick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice :)

The `migrate_with_timeouts` MC currently supports two flags:

1. --lock-timeout-in-ms
2. --statement-timeout-in-ms

This commit will add changes to this command, so that we can embed retry
mechanisms into it. The command will be able to be called as following:

```python
from django.core.management import call_command

call_command(
    "migrate_with_timeouts",
    # Original flags
    lock_timeout_in_ms=1_000,
    statement_timeout_in_ms=2_000,

    # A callback function to be called back between retries.
    retry_callback_path="dotted.path.to.callback.function",

    # Lock retry configuration
    lock_timeout_max_retries=2,                    # Retry twice
    lock_timeout_retry_exp=2,                      # back-off expo base
    lock_timeout_retry_min_wait_in_ms=3_000,       # wait at least 3s
    lock_timeout_retry_max_wait_in_ms=10_000,      # wait no more than 10s
```

During design, adding a retry mechanism for statement timeouts was
discarded as no apparent use for it existed at the time. Therefore,
there's only support for lock retries.
To include input fields for the retry mechanism.
@marcelofern marcelofern merged commit 2cd7df2 into main Sep 19, 2024
10 checks passed
@marcelofern marcelofern deleted the add-retry-for-migrate branch September 19, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants