-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated floodscan backfilling #67
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hannah and I went over the changes here a couple of times and I feel like the logic around the backfill works properly and has been tested locally and on databricks. If any odd issues could arise in the future I would guess they might be related to running the historical updates for an extensive amount of data, since we are not planning to do that and we're mostly just keeping with the daily updates and backfilling as necessary this PR should be good to go.
Thanks Hannah for the refactoring done here and doing all the effort in understanding this pipeline!
|
||
shutil.rmtree(sfed_dir) | ||
shutil.rmtree(mfed_dir) | ||
for file in os.listdir(self.local_raw_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember adding this cleanup because I was maxing out on space in Databricks when running for all the dates avialble from 1998. I don't think we need to worry right now about re-running for all the dates again so any changes here should be safe and just optimisation.
This PR replaces #66 as a simpler approach to backfilling in any missing dates. We assume that all missing dates can be backfilled by the latest 90-days file (which is currently true for
prod
).This PR primarily changes the
query_api
function to more fully handle thedate
input argument, by passing it to the modifiedget_geotiff_from_90_days_file()
function. The backfilling will rerunquery_api()
for each missing date, which redownloads and unzips the latest 90 days file each time. Note this this will be highly inefficient if we have many missing dates -- it seems unlikely that this will ever be the case, but this is something to be aware of.