Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated floodscan backfilling #67

Merged
merged 10 commits into from
Jan 23, 2025
Merged

Conversation

hannahker
Copy link
Collaborator

@hannahker hannahker commented Jan 23, 2025

This PR replaces #66 as a simpler approach to backfilling in any missing dates. We assume that all missing dates can be backfilled by the latest 90-days file (which is currently true for prod).

This PR primarily changes the query_api function to more fully handle the date input argument, by passing it to the modified get_geotiff_from_90_days_file() function. The backfilling will rerun query_api() for each missing date, which redownloads and unzips the latest 90 days file each time. Note this this will be highly inefficient if we have many missing dates -- it seems unlikely that this will ever be the case, but this is something to be aware of.

@hannahker hannahker marked this pull request as ready for review January 23, 2025 16:17
Copy link
Collaborator

@isatotun isatotun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hannah and I went over the changes here a couple of times and I feel like the logic around the backfill works properly and has been tested locally and on databricks. If any odd issues could arise in the future I would guess they might be related to running the historical updates for an extensive amount of data, since we are not planning to do that and we're mostly just keeping with the daily updates and backfilling as necessary this PR should be good to go.
Thanks Hannah for the refactoring done here and doing all the effort in understanding this pipeline!


shutil.rmtree(sfed_dir)
shutil.rmtree(mfed_dir)
for file in os.listdir(self.local_raw_dir):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember adding this cleanup because I was maxing out on space in Databricks when running for all the dates avialble from 1998. I don't think we need to worry right now about re-running for all the dates again so any changes here should be safe and just optimisation.

@hannahker hannahker merged commit a52685a into main Jan 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants