Skip to content

Proposed Enhancements: Multi-GPU Logging & Power Calculation #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chrisbraddock opened this issue Mar 1, 2025 · 0 comments
Open

Comments

@chrisbraddock
Copy link
Owner

Summary
This issue proposes adjustments to ensure accurate GPU power/efficiency measurements, especially when multiple GPUs are in use. The main goals are to:

  1. Correctly aggregate power across all GPUs to avoid under-reporting total power usage.
  2. Use the mean (instead of median) to calculate power draw over each run for better energy (Watt-min) accuracy.
  3. Optionally refine token counting for inference when outputs are variable length, and refine throughput logging.

1. Sum Power Across All GPUs

Currently, only the first GPU’s power draw is recorded. This underestimates total energy if multiple GPUs are active.
Proposed fix: In the logging step, iterate over all GPU metrics and sum their power_draw:

# gpu_metrics_utils.py

def collect_power_draw_all_gpus():
    metrics_list = get_gpu_metrics()
    total_power = sum(m.power_draw for m in metrics_list)  # Summation of power across GPUs
    return total_power

Then, use collect_power_draw_all_gpus() in training/inference loops instead of get_gpu_metrics()[0].power_draw.

2. Use Mean Instead of Median for Power Draw

The code uses a groupby.median() approach to calculate power usage over a run. Although median reduces the influence of outliers, energy is fundamentally “(average power) × (time)”.

Proposed fix: Switch from median to mean in post-processing:

# process_experiment_data.py

# Instead of groupby(...).median(), do:
summaries = df.groupby('max_watt', as_index=False).mean()
# This ensures energy calculations (power * time) align with average power draw.

If you need outlier handling, consider filtering data points before the mean rather than switching to median.

3. Refine Variable-Length Inference Token Counting

When using model.generate(), not all sequences may reach the max length. Counting tokens as batch_size * seq_length can overestimate throughput.

Proposed fix: After generation, count actual output lengths:

# run_inference.py

for outputs in model.generate(...):
    actual_len = outputs.shape[-1]  # e.g., number of tokens in each generated sequence
    total_tokens += actual_len

Then compute tokens_per_second = total_tokens / elapsed_time. This yields more precise throughput numbers.

4. Optional Improvements

  • DistributedDataParallel: For large-scale training, consider migrating from DataParallel to DDP for better scalability and performance.
  • Logging: Consider per-batch or per-iteration throughput logging if you want a more granular view of performance changes over time.
  • Energy Per Token: For clarity, you might also log energy / total_tokens (Joules per token or Watt-min per token) to emphasize efficiency.

Expected Benefits

  • Accurate Energy Calculations: Summing power from all GPUs prevents under-reporting total usage.
  • Robust Averages: Using mean power correlates directly with time-based energy consumption.
  • Better Throughput Metrics: Accounting for actual tokens avoids skewing tokens/sec if generations finish early.
  • Scalability: Enhanced logging and multi-GPU handling support more accurate comparisons for different power limits.

via: o1-pro Deep Research

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant