Multiple redundant calls to generate_example() when using multiple GPUs #1957

TheLukaDragar · 2025-03-10T20:21:59Z

Bug description

Multiple redundant calls to generate_example() when using multiple GPUs

Issue Description

When training with multiple devices using Fabric, the generate_example() function is redundantly called on every GPU/rank, leading to inefficient resource utilization. Each rank performs the identical computation while only the output from rank 0 is actually displayed through fabric.print(). This causes significant delays when using many devices.

litgpt/finetune/lora.py (lines 368-736) with additional logging

if not is_accumulating and step_count % eval.interval == 0:
          print(f"Rank {fabric.global_rank} validating...")
          t0 = time.perf_counter()
          val_loss = validate(fabric, model, val_dataloader, eval)
          print(f"Rank {fabric.global_rank} completed validation in {time.perf_counter() - t0:.2f} seconds")
          print(f"Rank {fabric.global_rank} generating example...")
          t0 = time.perf_counter()
          generate_example(fabric, model, tokenizer, eval, data)
          print(f"Rank {fabric.global_rank} completed example generation in {time.perf_counter() - t0:.2f} seconds")
          
          t1 = time.perf_counter() - t0
          fabric.print(f"iter {iter_num}: val loss {val_loss.item():.4f}, val time: {t1 * 1000:.2f} ms")
          metrics = {"val_loss": val_loss, "val_ppl": math.exp(val_loss)}
          fabric.log_dict(metrics, step=iter_num)
          fabric.barrier()

Reproduction Steps

Run a finetuning job using multiple GPUs (in my case 8 GPUs)
Observe logs showing each rank independently generating examples

Evidence

The logs show multiple "[Rank N] Generating example..." messages followed by each rank performing the same operations:

Rank 0 validating...
Rank 0 completed validation in 56.52 seconds
Rank 0 generating example...

Rank 2 validating...
Rank 2 completed validation in 56.43 seconds
Rank 2 generating example...

Rank 3 validating...
Rank 3 completed validation in 56.44 seconds
Rank 3 generating example...

Rank 4 validating...
Rank 4 completed validation in 56.47 seconds
Rank 4 generating example...

Looking at the implementation, each GPU generates the same example independently, but ultimately only rank 0's output is displayed in the logs via fabric.print() in generate_example()

Expected Behavior

Only rank 0 should generate the example, or the work should be properly distributed across devices with results gathered at rank 0. The current implementation wastes GPU resources by redundantly performing the same computation across all devices.

Suggested Fix

Modify the code to only call generate_example() on rank 0, or implement proper distribution of this workload:

if fabric.global_rank == 0:
    generate_example(fabric, model, tokenizer, eval, data)
fabric.barrier()

What operating system are you using?

Linux

LitGPT Version

Version: 0.5.8.dev1

The text was updated successfully, but these errors were encountered:

TheLukaDragar added the bug Something isn't working label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple redundant calls to generate_example() when using multiple GPUs #1957

Multiple redundant calls to generate_example() when using multiple GPUs #1957

TheLukaDragar commented Mar 10, 2025

Multiple redundant calls to generate_example() when using multiple GPUs #1957

Multiple redundant calls to generate_example() when using multiple GPUs #1957

Comments

TheLukaDragar commented Mar 10, 2025

Bug description

Multiple redundant calls to generate_example() when using multiple GPUs

Issue Description

Reproduction Steps

Evidence

Expected Behavior

Suggested Fix

What operating system are you using?

LitGPT Version