is javacpp pytorch distribute training aviliable? #1585

mullerhai · 2025-02-27T04:06:51Z

HI,
I find now cpp libtorch is support distribute training ,maybe it just could use mpi , https://github.com/pytorch/examples/blob/main/cpp/distributed/dist-mnist.cpp,
https://github.com/pytorch/examples/blob/main/cpp/distributed/README.md

so I have check the javacpp just have some java file like package org.bytedeco.pytorch; DistributedBackend DistributedBackendOptional DistributedBackendOptions DistributedSampler and Work ，
so if our javacpp-pytorch is support distribute training ? could show me some distribute code demo for us ,because llm need distribute training

but Work class have content

// Please do not use Work API, it is going away, to be
// replaced by ivalue::Future.
// Python binding for this class might change, please do not assume
// this will be bound using pybind.

but do you have try to use Work , I want to know the javacpp pytorch distribute logic. please thanks

The text was updated successfully, but these errors were encountered:

mullerhai · 2025-02-27T05:09:24Z

I also found ProcessGroup ProcessGroupGloo NCCLPreMulSumSupplement RecvWork SendWork ReduceOp CustomClassHolder AsyncWork GlooStore _SupplementBase Store

mullerhai · 2025-02-27T05:32:03Z

I found the public enum BackendType {
UNDEFINED((byte)(0)),
GLOO((byte)(1)),
NCCL((byte)(2)),
UCC((byte)(3)),
MPI((byte)(4)),
CUSTOM((byte)(5));, could it really invoke any backend?

mullerhai · 2025-02-27T05:43:07Z

@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI and UCC ProcessGroupUCC, FileStore : public Store ,MPIStore NcclStore ,please add them @saudet

https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64
https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupNCCL.cpp

and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup
not found AlgorithmEntry AlgorithmKey

mullerhai · 2025-02-27T05:49:46Z

@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI , FileStore : public Store ,MPIStore NcclStore ,please add them @saudet

https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64 https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupNCCL.cpp

and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup not found AlgorithmEntry AlgorithmKey

@HGuillemet @saudet please add them in javacpp-pytorch 2.6 release, on this version ,we need full support distribute pytorch training distribute,thanks

mullerhai · 2025-02-27T05:51:26Z

@HGuillemet hi , now only have ,ProcessGroupGloo class ,can not found class ProcessGroupMPI : public ProcessGroup , class ProcessGroupNCCL : public ProcessGroup NCCL and MPI , FileStore : public Store ,MPIStore NcclStore ,please add them @saudet
https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupMPI.hpp#L64 https://github.com/gpgpu-sim/pytorch-gpgpu-sim/blob/0459e409e2fccbfc4eb908fe8138e1bf5deb4bed/torch/lib/c10d/ProcessGroupNCCL.cpp
and public class ProcessGroupGloo extends DistributedBackend is correctly ? in cpp class ProcessGroupGloo : public ProcessGroup not found AlgorithmEntry AlgorithmKey

@HGuillemet @saudet please add them in javacpp-pytorch 2.6 release, on this version ,we need full support distribute pytorch training distribute,thanks

@sbrunk I think storch need support distribute training pytorch ,would you do more develop for storch distribute code ?

mullerhai · 2025-02-27T07:15:07Z

javacpp also have GradBucket ，Reducer ReduceOp and some @namespace("c10d") class, now just do more for mpi and nccl ucc,now javacpp pytorch will really could run training distribute ,
by the way ,could javacpp create discord group to chat @HGuillemet @saudet ,we need to chat will do better and fast

mullerhai · 2025-02-27T07:41:47Z

the javacpp Work.java we need to make it really work for ddp model, to solve this problem we need do more debug

mullerhai · 2025-02-27T07:46:21Z

how about DistributedDataParallel this class ? do you need to implement in javacpp

mullerhai · 2025-02-27T08:18:29Z

for javacpp ddp class ,I try to use them, but can not full code and can not run

  val options = new DistributedBackendOptions()
  options.timeout(new SecondsFloat())
  val glooStore = new GlooStore()
  val options = new DistributedBackend.Options()
  options.timeout(new SecondsFloat())
  options.store(glooStore)

  val processGroup =new ProcessGroupGloo(glooStore,1,1,options)
  val backend = new DistributedBackend("gloo", options, processGroup)
  val distributedBackend = DistributedBackend.withBackend(backend)
  val rank = processGroup.rank()
  val worldSize = processGroup.size()
  val work =  processGroup.allreduce()
  work.wait()

mullerhai · 2025-02-28T08:36:14Z

hi， I think if javacpp pytorch ，implement ProcessGroupMPI ProcessGroupNCCL ProcessGroupUCC ， MpiStore NcclStore UccStore, we could do more , @HGuillemet @saudet , thanks, please bring them on pytorch-javacpp 2.6 version

HGuillemet · 2025-02-28T10:17:40Z

Only Gloo backend has been mapped. It's the only one that works identically on the 3 platforms.
See this question in one of your issue last year that was not answered and the change list for 2.4.0.

@saudet has taken over the maintenance of the Pytorch presets and I doubt completing the mapping of the distributed framework is a priority for him. Unless maybe your company is willing to fund this work ?

mullerhai · 2025-02-28T10:26:24Z

Only Gloo backend has been mapped. It's the only one that works identically on the 3 platforms. See this question in one of your issue last year that was not answered and the change list for 2.4.0.

@saudet has taken over the maintenance of the Pytorch presets and I doubt completing the mapping of the distributed framework is a priority for him. Unless maybe your company is willing to fund this work ?

thanks for your feedback , but now not company support for me to do it, I just want to do make jvm deeplearning more aviliable , now only storch scala pytorch fronted that dependency javacpp-pytorch ,it also open-source .
@saudet we need your help to mapping ProcessGroupMPI ProcessGroupNCCL ProcessGroupUCC ， MpiStore NcclStore UccStore, if them could use in java /scala ,the llm will be open for jvm env !

saudet · 2025-03-01T01:52:18Z

I'm not aware of anyone who has money to invest in distributed training for Java... @frankfliu Anyone at Amazon?

mullerhai · 2025-03-01T06:51:03Z

I'm not aware of anyone who has money to invest in distributed training for Java... @frankfliu Anyone at Amazon?

if you could help javacpp-pytorch mapping ProcessGroupMPI ProcessGroupNCCL ProcessGroupUCC ， MpiStore NcclStore UccStore, ,I very willing do next

mullerhai · 2025-03-01T06:53:14Z

I'm not aware of anyone who has money to invest in distributed training for Java... @frankfliu Anyone at Amazon?

if you could help javacpp-pytorch mapping ProcessGroupMPI ProcessGroupNCCL ProcessGroupUCC ， MpiStore NcclStore UccStore and so on to make pytorch distribute training on java, ,I very willing do next compile them to scala

now I also make mpi generate scala-native code , https://github.com/mullerhai/sn-bindgen-mpi

mullerhai · 2025-03-01T09:20:57Z

I'm not aware of anyone who has money to invest in distributed training for Java... @frankfliu Anyone at Amazon?

I think you and Amazon could do it, only you have the ability. If you map all these distributed classes, it will be a great promotion to the java jvm family. In the future, java scala will be as important as python in deep learning. Please map these classes. Thank you very much for your hard work.

frankfliu · 2025-03-01T17:09:14Z

I'm not aware of anyone who has money to invest in distributed training for Java... @frankfliu Anyone at Amazon?

No, I'm not aware of any use case that requires distributed training in Java.

mullerhai · 2025-03-01T20:41:10Z

I'm not aware of anyone who has money to invest in distributed training for Java... @frankfliu Anyone at Amazon?

No, I'm not aware of any use case that requires distributed training in Java.

for llm like moe type with transformer or mamba,now only could Python and CPP could do it

saudet added enhancement help wanted question labels Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is javacpp pytorch distribute training aviliable? #1585

is javacpp pytorch distribute training aviliable? #1585

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025 •

edited

Loading

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 28, 2025

HGuillemet commented Feb 28, 2025

mullerhai commented Feb 28, 2025

saudet commented Mar 1, 2025

mullerhai commented Mar 1, 2025

mullerhai commented Mar 1, 2025

mullerhai commented Mar 1, 2025

frankfliu commented Mar 1, 2025

mullerhai commented Mar 1, 2025

is javacpp pytorch distribute training aviliable? #1585

is javacpp pytorch distribute training aviliable? #1585

Comments

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025 • edited Loading

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 27, 2025

mullerhai commented Feb 28, 2025

HGuillemet commented Feb 28, 2025

mullerhai commented Feb 28, 2025

saudet commented Mar 1, 2025

mullerhai commented Mar 1, 2025

mullerhai commented Mar 1, 2025

mullerhai commented Mar 1, 2025

frankfliu commented Mar 1, 2025

mullerhai commented Mar 1, 2025

mullerhai commented Feb 27, 2025 •

edited

Loading