-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a retry mechanism for migrate_with_timeouts
#20
Conversation
e23f421
to
7eefab0
Compare
Coverage Report Results
3 empty files skipped. |
89a50a5
to
87b7234
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
I've added a suggestion that moves the code around a little.
src/django_pg_migration_tools/management/commands/migrate_with_timeouts.py
Show resolved
Hide resolved
super().handle(*args, **options) | ||
def _migrate(self, *django_migrate_args: Any, **django_migrate_kwargs: Any) -> None: | ||
migration_applied: bool = False | ||
while self._can_migrate() and (migration_applied is False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while self._can_migrate() and (migration_applied is False): | |
while self._can_migrate(): |
and then do a return
instead of migration_applied = True
And then the raising of the exception does need the if not migration_applied
guard cause the only time that would be reached would be if the migration didn't happen.
# This raises ModuleNotFoundError, which gives a good explanation | ||
# of the error already (see tests). We don't have to wrap this into | ||
# our own exception. | ||
*module, callback = callback_path.split(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my suggestion I do a module, attr_name = callaback_path.rsplit(".")
here to avoid needing the ".".join(module). I also make sure the path has at least one dot in it and I assert callable(callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of the need for retries on statement timeout — please convince me that the extra complexity they add is worthwhile.
def _parse_non_required_ms_option_to_timedelta( | ||
options: Any, key: str | ||
) -> datetime.timedelta | None: | ||
value = options.pop(key, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐼 The word "parse" in the function name suggests only parsing, not mutating the options
dict. Naming things is hard, but perhaps "pop" or "extract" would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[addressed] I've picked up the pop
prefix instead.
statement_timeouts: int | ||
statement_timeouts_max_retries: int | ||
statement_timeout_retry_exp: int | ||
statement_timeout_retry_max_wait: datetime.timedelta | ||
statement_timeout_retry_min_wait: datetime.timedelta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 There's a strong use case for having retries for lock timeouts, but I'm struggling to see why retrying in case of statement timeouts would be useful. Do we expect a statement that times out the first time to somehow become faster and not to timeout when retried? What's an example of such a statement where that's likely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect a statement that times out the first time to somehow become faster and not to timeout when retried?
Good question. I think that this is a possible situation because of temporary load spikes on the database, caused by either background tasks or maintenance tasks starving server resources.
E.g., a vacuum finishing between retries could make the schema change query execute faster and thus pass on a second attempt. Adding a unique constraint to a table with many dead rows is an example of such query. The constraint generation would need to step through many dead-rows and thus have to load more pages from disk to finish its job before the vacuum which could substantially change the results.
I do agree though, that it doesn't feel as useful as lock_timeout retries and its usage might be limited to fewer situations.
It almost makes sense to add a exponential back-off for setting statement_timeout
to ever increasing values up to a limit, but that sounds like a different feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the examples you've given do fit the scenario I asked about. But they're not terribly convincing as a justification for the need for this feature. Whereas lock timeouts + automatic retries is well-established best practice (e.g. this blogpost) for applying migrations.
And let's go back to why we might want a statement timeout for migrations in the first place. The main reason (as I understand it) is that we don't want migrations to take too long, because they delay the overall deployment process. So if we hit the statement timeout, isn't retrying the wrong thing to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problems, I will remove this part of the feature and if we find better use cases in the future we can add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -31,6 +54,111 @@ def add_arguments(self, parser: base.CommandParser) -> None: | |||
required=False, | |||
help="Value to set as statement_timeout in milliseconds.", | |||
) | |||
parser.add_argument( | |||
"--retry-callback--path", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra hyphen: ...callback--path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help=( | ||
"How many times to retry after a lock timeout happened. " | ||
"Defaults to zero, which means no retries - the migration " | ||
"fails immediatelly upon lock timeout." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "immediately"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required=False, | ||
default=2, | ||
help=( | ||
"The value for the exponent for retry backoff propagation. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐼 "propagation" doesn't seem like the right word, but it's not clear to me what the right word is. Would "delay" work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if value is None: | ||
return None | ||
assert isinstance(value, int) | ||
return datetime.timedelta(seconds=int(value / 1_000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have the argument in milliseconds if it's going to be rounded down to a whole number of seconds anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
self._attempt_callback(exc) | ||
|
||
if migration_applied is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pet peeve: comparison of booleans.
Prefer: if not migration_applied:
(and also above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# current_attempt is an integer, but it is turned into a float here | ||
# because a huge exponentiation in Python between integers | ||
# **never** overflows. Instead, the CPU is left trying to calculate | ||
# the result forever and it will eventually return a memory error | ||
# instead. Which we absolutely do not want. Please see: | ||
# https://docs.python.org/3.12/library/exceptions.html#OverflowError | ||
result = exp ** (float(current_attempt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many retries do we expect users to attempt? More than thousands???
Counterpoint: this code will cope with any number of retries, and with no particular downsides apart from the need to explain why the cast to float is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the thousands. Hopefully not even in the hundreds.
You can, of course, run into overflow errors in less than a thousand attempts for a high exponential value - but I also hope this not to be a common scenario. Giving that I am "hoping for many things" I agree that it's best to make the code work for any number of retries lest an arbitrary limiting number is incorrectly chosen and we have to go through a soft deprecation process.
docs/usage/management_commands.rst
Outdated
retry callback: | ||
+++++++++++++++ | ||
|
||
* ``--retry-callback--path``: Sets a callback to be called after a timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy-paste: remove double hyphen here: ...callback--path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to #20 (comment)
docs/usage/management_commands.rst
Outdated
--statement-timeout-retry-exp=2 \ | ||
--statement-timeout-retry-min-wait-in-ms=3000 \ | ||
--statement-timeout-retry-max-wait-in-ms=10000 \ | ||
--retry-callback--path="dotted.path.to.callback.function" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
♻️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to #20 (comment)
e56c8fc
to
a32c8a3
Compare
retry_callback: Callable[[RetryState], None] | None | ||
|
||
statement_timeout: datetime.timedelta | None | ||
|
||
lock_timeout: datetime.timedelta | None | ||
lock_timeouts: int | ||
lock_timeouts_max_retries: int | ||
lock_timeout_retry_exp: int | ||
lock_timeout_retry_max_wait: datetime.timedelta | ||
lock_timeout_retry_min_wait: datetime.timedelta | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these aren't needed anymore right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes yes, too many stashes. Fixup: d33f59d
statement_timeout=timeout_options.statement_timeout, | ||
): | ||
super().handle(*args, **options) | ||
migration_applied = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
migration_applied = True | |
return |
And then remove the migration_applied
flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout db207cd
b39a90d
to
c482024
Compare
This new dataclass is responsible for capturing the options necessary to run a migration with timeouts. It extracts the logic out of the Command class, so that once we add further configuration (a retry mechanism), the Command class can remain the same, at the expense of this dataclass growing instead.
This new class is responsible for parsing some of the input fields to the migrate_with_timeouts command that cannot be parsed with the `parser` class provided by Django. This also removes the parsing logic out of the MigrationTimeoutOptions class into a dedicated _Parser class.
c482024
to
2365f1e
Compare
class MigrateRetryStrategy: | ||
timeout_options: MigrationTimeoutOptions | ||
retries: int | ||
|
||
def __init__(self, timeout_options: MigrationTimeoutOptions): | ||
self.timeout_options = timeout_options | ||
self.retries = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not an dataclass.dataclass(frozen=True)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's something strange about putting that much logic behind a dataclass
"migrate_with_timeouts", | ||
lock_timeout_in_ms=50_000, | ||
lock_timeout_max_retries=2, | ||
retry_callback_path="tests.django_pg_migration_tools.management.commands.test_migrate_with_timeouts.example_callback", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retry_callback_path="tests.django_pg_migration_tools.management.commands.test_migrate_with_timeouts.example_callback", | |
retry_callback_path=f"{__name__}.example_callback", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
fixup: 9c0784b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice :)
The `migrate_with_timeouts` MC currently supports two flags: 1. --lock-timeout-in-ms 2. --statement-timeout-in-ms This commit will add changes to this command, so that we can embed retry mechanisms into it. The command will be able to be called as following: ```python from django.core.management import call_command call_command( "migrate_with_timeouts", # Original flags lock_timeout_in_ms=1_000, statement_timeout_in_ms=2_000, # A callback function to be called back between retries. retry_callback_path="dotted.path.to.callback.function", # Lock retry configuration lock_timeout_max_retries=2, # Retry twice lock_timeout_retry_exp=2, # back-off expo base lock_timeout_retry_min_wait_in_ms=3_000, # wait at least 3s lock_timeout_retry_max_wait_in_ms=10_000, # wait no more than 10s ``` During design, adding a retry mechanism for statement timeouts was discarded as no apparent use for it existed at the time. Therefore, there's only support for lock retries.
To include input fields for the retry mechanism.
9c0784b
to
92fd7b3
Compare
This branch adds a retry mechanism for
migrate_with_timeouts
.Before
After
./manage.py migrate_with_timeouts \ --lock-timeout-in-ms=10000 \ --lock-timeout-max-retries=3 \ --lock-timeout-retry-exp=2 \ --lock-timeout-retry-min-wait-in-ms=3000 \ --lock-timeout-retry-max-wait-in-ms=10000 \ --statement-timeout-in-ms=10000 \ --retry-callback--path="dotted.path.to.callback.function"