Proposed Enhancements: Multi-GPU Logging & Power Calculation #3

chrisbraddock · 2025-03-01T10:01:58Z

Summary
This issue proposes adjustments to ensure accurate GPU power/efficiency measurements, especially when multiple GPUs are in use. The main goals are to:

Correctly aggregate power across all GPUs to avoid under-reporting total power usage.
Use the mean (instead of median) to calculate power draw over each run for better energy (Watt-min) accuracy.
Optionally refine token counting for inference when outputs are variable length, and refine throughput logging.

1. Sum Power Across All GPUs

Currently, only the first GPU’s power draw is recorded. This underestimates total energy if multiple GPUs are active.
Proposed fix: In the logging step, iterate over all GPU metrics and sum their power_draw:

# gpu_metrics_utils.py

def collect_power_draw_all_gpus():
    metrics_list = get_gpu_metrics()
    total_power = sum(m.power_draw for m in metrics_list)  # Summation of power across GPUs
    return total_power

Then, use collect_power_draw_all_gpus() in training/inference loops instead of get_gpu_metrics()[0].power_draw.

2. Use Mean Instead of Median for Power Draw

The code uses a groupby.median() approach to calculate power usage over a run. Although median reduces the influence of outliers, energy is fundamentally “(average power) × (time)”.

Proposed fix: Switch from median to mean in post-processing:

# process_experiment_data.py

# Instead of groupby(...).median(), do:
summaries = df.groupby('max_watt', as_index=False).mean()
# This ensures energy calculations (power * time) align with average power draw.

If you need outlier handling, consider filtering data points before the mean rather than switching to median.

3. Refine Variable-Length Inference Token Counting

When using model.generate(), not all sequences may reach the max length. Counting tokens as batch_size * seq_length can overestimate throughput.

Proposed fix: After generation, count actual output lengths:

# run_inference.py

for outputs in model.generate(...):
    actual_len = outputs.shape[-1]  # e.g., number of tokens in each generated sequence
    total_tokens += actual_len

Then compute tokens_per_second = total_tokens / elapsed_time. This yields more precise throughput numbers.

4. Optional Improvements

DistributedDataParallel: For large-scale training, consider migrating from DataParallel to DDP for better scalability and performance.
Logging: Consider per-batch or per-iteration throughput logging if you want a more granular view of performance changes over time.
Energy Per Token: For clarity, you might also log energy / total_tokens (Joules per token or Watt-min per token) to emphasize efficiency.

Expected Benefits

Accurate Energy Calculations: Summing power from all GPUs prevents under-reporting total usage.
Robust Averages: Using mean power correlates directly with time-based energy consumption.
Better Throughput Metrics: Accounting for actual tokens avoids skewing tokens/sec if generations finish early.
Scalability: Enhanced logging and multi-GPU handling support more accurate comparisons for different power limits.

via: o1-pro Deep Research

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Enhancements: Multi-GPU Logging & Power Calculation #3

Proposed Enhancements: Multi-GPU Logging & Power Calculation #3

chrisbraddock commented Mar 1, 2025

Proposed Enhancements: Multi-GPU Logging & Power Calculation #3

Proposed Enhancements: Multi-GPU Logging & Power Calculation #3

Comments

chrisbraddock commented Mar 1, 2025

1. Sum Power Across All GPUs

2. Use Mean Instead of Median for Power Draw

3. Refine Variable-Length Inference Token Counting

4. Optional Improvements