[Bugfix] Pad hidden_states to avoid cross-ring AllGatherV #963

ApsarasX · 2025-05-26T12:52:42Z

What this PR does / why we need it?

On the 910B2C hardware platform, each machine is equipped with 16 NPU cards. When using Tensor Parallelism (TP)=16, if the num_tokens (input sequence length) is not divisible by tp_size, it triggers cross-ring AllGatherV communication operations, which may result in unexpected errors(see figure below).

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>

ganyi1996ppo · 2025-05-28T05:26:21Z

Have you tried torch.distributed.all_gather and cat the data collected from all rank?

ApsarasX · 2025-05-28T09:14:11Z

Have you tried torch.distributed.all_gather and cat the data collected from all rank?

import torch
import torch_npu
import torch.distributed as dist
import torch.multiprocessing as mp

def verify_tp_o_method(rank, world_size: int, hidden_statses: list[torch.Tensor]):
    torch.npu.set_device(rank)

    dist.init_process_group(
        backend='hccl',
        init_method='tcp://127.0.0.1:55223',
        world_size=world_size,
        rank=rank
    )

    hidden_statses = hidden_statses.to(f"npu:{rank}")

    chunk_hidden_states = torch.tensor_split(hidden_statses, world_size, dim=0)

    router_hidden_states = chunk_hidden_states[rank]

    print(f"[rank={rank}] router_hidden_states.shape = {router_hidden_states.shape}")

    dist.all_gather(list(chunk_hidden_states), router_hidden_states)

    dist.barrier()

    dist.destroy_process_group()

def main():
    world_size = 16
    inter_dim = 16369
    # inter_dim = 16384
    output_dim = 7168
    hidden_statses = torch.randn((inter_dim, output_dim), dtype=torch.half)

    mp.spawn(verify_tp_o_method, args=(world_size, hidden_statses,), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

The above code will cause the following error:

ganyi1996ppo · 2025-05-29T04:57:55Z

import torch
import torch_npu
import torch.distributed as dist
import torch.multiprocessing as mp

def verify_tp_o_method(rank, world_size: int, hidden_statses: list[torch.Tensor]):
    torch.npu.set_device(rank)

    dist.init_process_group(
        backend='hccl',
        init_method='tcp://127.0.0.1:55223',
        world_size=world_size,
        rank=rank
    )

    hidden_statses = hidden_statses.to(f"npu:{rank}")

    chunk_hidden_states = torch.tensor_split(hidden_statses, world_size, dim=0)

    router_hidden_states = chunk_hidden_states[rank]

    print(f"[rank={rank}] router_hidden_states.shape = {router_hidden_states.shape}")

    dist.all_gather(list(chunk_hidden_states), router_hidden_states)

    dist.barrier()

    dist.destroy_process_group()

def main():
    world_size = 16
    inter_dim = 16369
    # inter_dim = 16384
    output_dim = 7168
    hidden_statses = torch.randn((inter_dim, output_dim), dtype=torch.half)

    mp.spawn(verify_tp_o_method, args=(world_size, hidden_statses,), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

The above code will cause the following error:

Strange, this works fine on my machine.....

ApsarasX · 2025-06-03T06:33:31Z

@ganyi1996ppo What progress about this PR?

github-actions · 2025-06-04T10:32:07Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

jianzs · 2025-06-14T16:56:33Z

import torch
import torch_npu
import torch.distributed as dist
import torch.multiprocessing as mp

def verify_tp_o_method(rank, world_size: int, hidden_statses: list[torch.Tensor]):
    torch.npu.set_device(rank)

    dist.init_process_group(
        backend='hccl',
        init_method='tcp://127.0.0.1:55223',
        world_size=world_size,
        rank=rank
    )

    hidden_statses = hidden_statses.to(f"npu:{rank}")

    chunk_hidden_states = torch.tensor_split(hidden_statses, world_size, dim=0)

    router_hidden_states = chunk_hidden_states[rank]

    print(f"[rank={rank}] router_hidden_states.shape = {router_hidden_states.shape}")

    dist.all_gather(list(chunk_hidden_states), router_hidden_states)

    dist.barrier()

    dist.destroy_process_group()

def main():
    world_size = 16
    inter_dim = 16369
    # inter_dim = 16384
    output_dim = 7168
    hidden_statses = torch.randn((inter_dim, output_dim), dtype=torch.half)

    mp.spawn(verify_tp_o_method, args=(world_size, hidden_statses,), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

The above code will cause the following error:

Strange, this works fine on my machine.....

Is it because your device have only 8 cards instead of 16?

[Bugfix] Pad hidden_states to avoid cross-ring AllGatherV

bb6ea4e

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX force-pushed the community-fix-tp16-error branch from f03a0ae to bb6ea4e Compare May 26, 2025 16:18

ApsarasX added the ready read for review label May 26, 2025

[Chore] Rename padded_num_tokens to num_padding_tokens for consistency

cc7ec02

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX force-pushed the community-fix-tp16-error branch from c87fac1 to cc7ec02 Compare May 27, 2025 02:00

ApsarasX requested review from ganyi1996ppo, wangxiyuan, Yikun and MengqingCao and removed request for ganyi1996ppo May 27, 2025 03:55

github-actions bot added merge-conflicts and removed ready read for review labels Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Pad hidden_states to avoid cross-ring AllGatherV #963

[Bugfix] Pad hidden_states to avoid cross-ring AllGatherV #963

Uh oh!

ApsarasX commented May 26, 2025

Uh oh!

ganyi1996ppo commented May 28, 2025

Uh oh!

ApsarasX commented May 28, 2025

Uh oh!

ganyi1996ppo commented May 29, 2025

Uh oh!

ApsarasX commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

jianzs commented Jun 14, 2025

Uh oh!

Uh oh!

[Bugfix] Pad hidden_states to avoid cross-ring AllGatherV #963

Are you sure you want to change the base?

[Bugfix] Pad hidden_states to avoid cross-ring AllGatherV #963

Uh oh!

Conversation

ApsarasX commented May 26, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ganyi1996ppo commented May 28, 2025

Uh oh!

ApsarasX commented May 28, 2025

Uh oh!

ganyi1996ppo commented May 29, 2025

Uh oh!

ApsarasX commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

jianzs commented Jun 14, 2025

Uh oh!

Uh oh!