Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment large batch processes #2873

Open
5 tasks done
K8Sewell opened this issue Jun 25, 2024 · 8 comments
Open
5 tasks done

Segment large batch processes #2873

K8Sewell opened this issue Jun 25, 2024 · 8 comments
Assignees

Comments

@K8Sewell
Copy link

K8Sewell commented Jun 25, 2024

Story
As described in a comment in #2859, sometimes a job may time out and fail before all records in a CSV are processed. This causes some jobs to run multiple times. We would like to change the batch process behavior to process CSVs in segments of 50 rows at a time, to prevent process from timing out and re-running.

This behavior should be applied to the following batch processes:

  • DeleteParentObjects
  • ReassociateChildOids
  • RecreateChildOidPtiffs
  • UpdateParentObjects
  • CreateParentObjects

Acceptance
The following jobs run in segments of 50 rows until completion:

  • DeleteParentObjects
  • ReassociateChildOids
  • RecreateChildOidPtiffs
  • UpdateParentObjects
  • CreateParentObjects

Engineering Notes
Jobs that have batching patterns to pull from:

  • SolrReindexAll
  • UpdateAllMetadata
  • UpdateDigitalObjects
  • UpdateManifests
@sshetenhelm sshetenhelm added this to the Batch Process Refactoring milestone Jul 1, 2024
@sshetenhelm sshetenhelm changed the title [NEEDS EDITING] Segment large batch processes [Segment large batch processes Jul 1, 2024
@sshetenhelm sshetenhelm changed the title [Segment large batch processes Segment large batch processes Jul 1, 2024
@jpengst jpengst self-assigned this Jul 11, 2024
@jpengst
Copy link
Collaborator

jpengst commented Jan 21, 2025

I know it's a long shot since it was back in June; but does anyone remember which GoodJob error this delete parent objects job was receiving? (https://collections-uat.library.yale.edu/management/batch_processes/2039).

It would have only been displayed on the main GoodJob Dashboard under the jobs name. Ex:

Image

@K8Sewell
Copy link
Author

PR ready for review - yalelibrary/yul-dc-management#1475

@jpengst
Copy link
Collaborator

jpengst commented Feb 6, 2025

Deployed to Test v2.74.2

@jpengst
Copy link
Collaborator

jpengst commented Feb 11, 2025

Confirmed that this is working on demo.
On Test, solr falls over with this error:

Image

@sshetenhelm
Copy link

I feel like something strange is going on with the 'UpdateParentObjects' batch process. It's taking waaaay longer than I would expect to just update a single metadata field. The first parent received a 'Complete' but the rest have 0 status information.
Batch process -- https://collections-uat.library.yale.edu/management/batch_processes/2560

@sshetenhelm
Copy link

Two objects had "Digital Object Source = None" but got dinged for not having a Preservica UUID. Also, Management tried to run them both like 15 times, and each time wrote an error in the batch process message:

Image

Management also reported not being able to log into to Preservica for a number of objects that have already been uploaded via Preservica. Also, why would it check Preservica if the object is just updating the 'Extent of Digitization' field?

@jpengst
Copy link
Collaborator

jpengst commented Feb 20, 2025

Looking into why those "Digital Object Source = None" objects are being treated like Preservica objects. Thats weird.

For the second issue, we always sync from preservica when we update preservica objects. I just tried updating one of the "Unable to login" objects with a single line CSV upload and it updated the extent_of_digitization successfully with no errors. Im looking into this. Putting back in progress.

@jpengst
Copy link
Collaborator

jpengst commented Feb 21, 2025

This ticket was spawned from this job (https://collections-uat.library.yale.edu/management/batch_processes/2039) that failed and reran multiple times because GoodJob lost connection and timed out. Instead of segmenting the jobs, it would be cleaner to have more robust error handling by rescuing and returning the specific GoodJob error. The main issue with this is that we no longer have the original GoodJob error to reference.

Putting this in Backlog. If a future job fails for a lost GoodJob connection, we will have the error to reference and can implement better error handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants