Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regional analysis persistence cleanup #621

Open
ansoncfit opened this issue Oct 9, 2020 · 3 comments
Open

Regional analysis persistence cleanup #621

ansoncfit opened this issue Oct 9, 2020 · 3 comments
Assignees
Labels
optimization p1 Priority level 1: high, but not the top t1 Time level 1: think days

Comments

@ansoncfit
Copy link
Member

How scenarios are saved (from @abyrd)

For a few reasons, we want to save a full scenario for each regional analysis run. This should allow exactly reproducing an old regional analysis on its associated worker version, or rendering a PDF/HTML report of the scenario for a completed regional analysis.

We currently save the full list of R5 modifications in MongoDB inside the regional analysis object. This doesn't meet the above goals for a few reasons: first, our scenario report rendering works on UI/backend modifications, not R5 modifications. Second, we're intentionally trying to retain potentially outdated modification types, or modification types specific to a certain version of R5. When the backend tries to fetch and deserialize such modifications it fails. This is already happening now that we have custom modifications.

Possible solutions are: 1. Store UI/backend modifications in Mongo instead of R5 modifications, avoiding deserialization of the stored objects in case their structure is different than the ones known to the backend. 2. Store R5 modifications but don't store them in Mongo.

Regarding option 1: Storing UI/Backend modification objects in Mongo should make for less problematic migrations in the event that modifications change, because the same migration process could be applied to these stored scenarios as to the main store of scenarios. However this defeats the purpose of storing old scenarios - you might not be able to exactly reproduce the R5 modifications that were effectively used to run a specific regional analysis.

Regarding option 2: We already in fact do this, because every regional analysis saves its R5-formatted scenario to a file on S3. So the only problem here is that there is no way to generate an HTML report on a finished scenario (because the reports are generated from UI/backend modifications). If modification structure may change, then there will never be a reliable way to generate reports after the fact anyway. So maybe we should just generate a report for every regional analysis and store it on S3 alongside the regional results.

In the short term @trevorgerhardt has added exclusions to all the HTTP API endpoints that fetch Regional analyses so they don't fetch the subfield containing the scenario (in conveyal/analysis-backend#214).

Later we may just remove all this data from the MongoDB entirely, since the R5 snapshot of each scenario is already being saved on S3 as a JSON file.

Superfluous copying and duplication of RegionalAnalysis fields (from @abyrd)

As noted in some inline code comments, many of the fields in the RegionalAnalysis Model/Mongo objects are duplicates of fields in the nested RegionalTask that is sent to the workers. When creating a regional analysis, besides this RegionalAnalysis instance, two distinct RegionalTask instances are made, one with an embedded scenario (nested in the RegionalAnalysis and saved in Mongo) and another with the scenario erased and replaced with its ID (saved in the Job as a more compact version for sending over the wire to the workers). Most of the fields in these objects are also duplicates of (or derived from) fields in the AnalysisRequest, which is less problematic and arguably good practice, as it keeps the external API used by the UI separate from internal representations.

It's not clear how many of these copies are actually needed. But this certainly is confusing and unmaintainable: as all these fields are mutable, I'm constantly second guessing whether they're being overwritten, or why writes to the fields are split across different methods. We have already lost hours of time second-guessing what is happening here and it is very unsatisfying to not know why certain statements are there, and whether they have any effect.

Also, it's notable that many fields on the templateTask are set after the Persistence call for the RegionalAnalysis. It's not clear why we'd want to set these fields on the template for the tasks sent to the workers, while not updating them on the version stored in the RegionalAnalysis before it is saved in Mongo. It turns out that setting these fields on the template is probably a no-op, and somewhat by coincidence they end up matching the ones in the nested object in Mongo.

At first I thought this was because the values from AnalysisRequest#populateTask (which initializes the AnalysisWorkerTask) were for single-point analyses, and they were being overwritten for the regional case. But while the bounds do have two different meanings for regional and single point requests, that differing interpretation happens during the calculations on the worker. In the backend, I have verified that all the fields sent over from the UI in the AnalysisRequest then overwritten by templateTaskFromRegionalAnalysis (north/west/width/height/zoom/graphId/workerVersion) are just passed through identically without interpretation into both kinds of AnalysisWorkerTask (regional and single-point).

For a regional analysis, I traced execution and saw the following:

Tracing in RegionalAnalysisController#createRegionalAnalysis:
AnalysisRequest.bounds: 
  north = 45.52559248776562
  east = -122.62939453125
  south = 45.46976215263041
  west = -122.706298828125
  graphId = 5ec4e16683d6a604f47fb3f9
  workerVersion = v5.9.0
RegionalTask from analysisRequest.populateTask:
  zoom = 9
  west = 20860 
  north = 46878
  width = 29
  height = 30
  workerVersion = v5.9.0
RegionalAnalysis model/Mongo document:
  zoom = 9
  width = 29
  height = 30
  north = 46878
  west = 20860
  request = {RegionalTask above}
  bundleId = 5ec4e16683d6a604f47fb3f9
  workerVersion = v5.9.0
RegionalTask from templateTaskFromRegionalAnalysis (the "templateTask")
  zoom = 9
  west = 20860
  north = 46878
  width = 29
  height = 30
  graphId = 5ec4e16683d6a604f47fb3f9 (same as bundleId above, and graphId in task)
  jobId = regionalAnalysis._id (** this is a main difference **)
  workerVersion = v5.9.0 (identical to all objects above)

So, almost all items are identical, being copied from AnalysisRequest -> task (==regionalAnalysis.request) -> regionalAnalysis -> templateTask. For example, tracing the fields representing the width of the analysis grid:

task.width derived from analysisRequest.bounds
regionalAnalysis.request = task;
regionalAnalysis.width = task.width;
templateTask = regionalAnalysis.request.clone();
templateTask.width = regionalAnalysis.width;

So all the non-scenario-related field setting in templateTaskFromRegionalAnalysis (reflected in the last line of the above block) seems superfluous except the jobId, which I believe should be set in the source task being cloned, higher up the call stack.

Another strange thing is that the jobID is only set on the template task, and never on the request inside the RegionalAnalysis, so the Mongo document contains a meaningless JobID value that doesn't seem to ever be used. If we set the JobId on the regionalAnalysis.request, it would naturally be cloned into the templateTask, and the only difference in the templateTask would be scenario erasure (as implied by the documentation on the method that creates the templateTask). Perhaps even scenario erasure could be performed on the same task nested in regionalAnalysis and saved in Mongo, getting rid of all near-duplicate fields and objects.

Questions:

  • Do we use these grid bounds fields on the top-level RegionalAnalysis object at all? Is there any reason to duplicate information that's in the nested RegionalTask at the top level?
  • Do we need the full scenario stored inside the RegionalAnalysis in Mongo, considering that it's written to S3 every time a regional analysis is run, under a key that is saved in the RegionalAnalysis?

A "scenario snapshot table"

A "scenario snapshot table" could store front/back-end representation of scenario when a regional analysis is run. This would also facilitate displaying associated modifications on the map of a completed regional analysis, and allow solving conveyal/analysis-ui#538.

R5 representation would still be .json on S3.

@trevorgerhardt trevorgerhardt self-assigned this May 27, 2021
@trevorgerhardt trevorgerhardt added optimization p1 Priority level 1: high, but not the top t1 Time level 1: think days labels May 27, 2021
@abyrd
Copy link
Member

abyrd commented Jul 13, 2021

Related to conveyal/analysis-ui#1572

@abyrd
Copy link
Member

abyrd commented Sep 28, 2021

We observed that some scenarios stored in the database are huge so have been revisiting this. To clarify a few points: currently we are storing the full expanded JSON of a regional analysis's scenario both in FileStorage (read by workers) and in Mongo (under regional-analyses.request.scenario). Both of these are still the R5 representation - we do not store a snapshot of the UI representation of the scenario, which is needed to produce the PDF-able scenario reports. So these can be produced for any scenario in its current state, but not for the scenario used for a past regional analysis, once that scenario has been changed in the editor.

@abyrd
Copy link
Member

abyrd commented Jan 26, 2024

Some additional considerations are described in the review of #926.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization p1 Priority level 1: high, but not the top t1 Time level 1: think days
Projects
None yet
Development

No branches or pull requests

3 participants