Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple redundant calls to generate_example() when using multiple GPUs #1957

Open
TheLukaDragar opened this issue Mar 10, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@TheLukaDragar
Copy link

Bug description

Multiple redundant calls to generate_example() when using multiple GPUs

Issue Description

When training with multiple devices using Fabric, the generate_example() function is redundantly called on every GPU/rank, leading to inefficient resource utilization. Each rank performs the identical computation while only the output from rank 0 is actually displayed through fabric.print(). This causes significant delays when using many devices.

litgpt/finetune/lora.py (lines 368-736) with additional logging

if not is_accumulating and step_count % eval.interval == 0:
          print(f"Rank {fabric.global_rank} validating...")
          t0 = time.perf_counter()
          val_loss = validate(fabric, model, val_dataloader, eval)
          print(f"Rank {fabric.global_rank} completed validation in {time.perf_counter() - t0:.2f} seconds")
          print(f"Rank {fabric.global_rank} generating example...")
          t0 = time.perf_counter()
          generate_example(fabric, model, tokenizer, eval, data)
          print(f"Rank {fabric.global_rank} completed example generation in {time.perf_counter() - t0:.2f} seconds")
          
          t1 = time.perf_counter() - t0
          fabric.print(f"iter {iter_num}: val loss {val_loss.item():.4f}, val time: {t1 * 1000:.2f} ms")
          metrics = {"val_loss": val_loss, "val_ppl": math.exp(val_loss)}
          fabric.log_dict(metrics, step=iter_num)
          fabric.barrier()

Reproduction Steps

  1. Run a finetuning job using multiple GPUs (in my case 8 GPUs)
  2. Observe logs showing each rank independently generating examples

Evidence

The logs show multiple "[Rank N] Generating example..." messages followed by each rank performing the same operations:

Rank 0 validating...
Rank 0 completed validation in 56.52 seconds
Rank 0 generating example...

Rank 2 validating...
Rank 2 completed validation in 56.43 seconds
Rank 2 generating example...

Rank 3 validating...
Rank 3 completed validation in 56.44 seconds
Rank 3 generating example...

Rank 4 validating...
Rank 4 completed validation in 56.47 seconds
Rank 4 generating example...

Looking at the implementation, each GPU generates the same example independently, but ultimately only rank 0's output is displayed in the logs via fabric.print() in generate_example()

Expected Behavior

Only rank 0 should generate the example, or the work should be properly distributed across devices with results gathered at rank 0. The current implementation wastes GPU resources by redundantly performing the same computation across all devices.

Suggested Fix

Modify the code to only call generate_example() on rank 0, or implement proper distribution of this workload:

if fabric.global_rank == 0:
    generate_example(fabric, model, tokenizer, eval, data)
fabric.barrier()

What operating system are you using?

Linux

LitGPT Version

Version: 0.5.8.dev1
@TheLukaDragar TheLukaDragar added the bug Something isn't working label Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant