Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ElasticManager #203

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

oschulz
Copy link
Contributor

@oschulz oschulz commented May 12, 2024

Adds several things to ElasticManager:

  • An callback option - this can be used to automatically run init code on new workers, add them to and remove them from worker pools, add custom logging when workers connect, etc.

  • More debug logging - often necessary to find out what's wrong if workers won't connect.

  • Add a mechanism to forward environment variables to workers. Havent' found a way to set them before the Julia worker process starts up, but at least sets them before it does anything.

I'm field-testing this via a local copy of ElasticManager in ParallelProcessingTools.jl (will release a new version soon) so I can make breaking changes still if necessary, but I'll keep this PR in sync to upstream once it seems fully stable (looking pretty good so far, so hopefully soon).

oschulz added 3 commits May 7, 2024 16:18
* Add callback mechanism. Allows users to automatically initialize
  new workers, add workers to a given worker pool, etc.

* Make it easy to set worker timeout.

* Add debug logging, often necessary to figure out worker connection
  problems.
Revise has Distributed support, workers shouldn't run Revise separately.
@oschulz
Copy link
Contributor Author

oschulz commented May 12, 2024

CC @JBlaschke , thanks for pointing out the potential of ElasticManager to me.

@oschulz
Copy link
Contributor Author

oschulz commented Jul 13, 2024

Will take a bit longer before I upstream the ElasticManager changes from ParallelProcessingTools, I want to see if there's a clean way to handle network device selection and if that requires interface changes.

@oschulz
Copy link
Contributor Author

oschulz commented Jan 2, 2025

@DilumAluthge , sorry, I neglected this a bit, I should really get on with getting this release-ready.

@DilumAluthge
Copy link
Member

@oschulz We currently do not have a maintainer for the ElasticManager functionality in this package.

Do you actively use the ElasticManager functionality? If so, would you be interested in becoming the maintainer for the ElasticManager functionality?

@DilumAluthge
Copy link
Member

Also @oschulz it looks like there are some merge conflicts here.

Could you rebase this PR and fix the merge conflicts?

@oschulz
Copy link
Contributor Author

oschulz commented Feb 10, 2025

Do you actively use the ElasticManager functionality?

Yes, we do, quite actively, but currently the experimental version in ParallelProcessingTools. The plan is still to re-upstream it though.

I'll rebase and test an get on with this - gimme a bit.

If so, would you be interested in becoming the maintainer for the ElasticManager functionality?

Sure, I can take that over.

@DilumAluthge
Copy link
Member

For the other cluster managers (e.g. Slurm and LSF), I've moved the managers out to separate packages (SlurmClusterManager.jl and LSFClusterManager.jl), with the idea being that each manager has different maintainers, tests, CI, etc.

What do you think about moving the elastic manager out to a new standalone package, e.g. ElasticClusterManager.jl?

@oschulz
Copy link
Contributor Author

oschulz commented Feb 12, 2025

What do you think about moving the elastic manager out to a new standalone package, e.g. ElasticClusterManager.jl?

I'd be all for it! We have to release a ClusterManagers v2.0 then though, right?

@DilumAluthge
Copy link
Member

I'd be all for it! We have to release a ClusterManagers v2.0 then though, right?

Yep, which I'll need to do anyway once I remove Slurm from this package.

@oschulz
Copy link
Contributor Author

oschulz commented Feb 12, 2025

Yep, which I'll need to do anyway once I remove Slurm from this package.

Ok, that's perfect then. Because I can then upstream my changes to ElasticClusterManager directly - I was hestiant to do that because I suspected I might need to do more breaking changes. But if ElasticClusterManager has it's own version number, it's easy.

@DilumAluthge
Copy link
Member

DilumAluthge commented Feb 16, 2025

I created the new repo:

@oschulz I've invited you to the repo: https://github.com/JuliaParallel/ElasticClusterManager.jl

You can accept the invitation here: https://github.com/JuliaParallel/ElasticClusterManager.jl/invitations

@oschulz
Copy link
Contributor Author

oschulz commented Feb 16, 2025

@oschulz I've invited you to the repo. You can accept the invitation here:

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants