Select input_ids explicitly after panda conversion #2335

seungduk-yanolja · 2025-02-15T17:54:48Z

Description

Without selecting the column, applying len counts the whole row as 1 which resulting the total number of the samples instead of the token counts.

Motivation and Context

#2334

How has this been tested?

Ran a training with the fix.

Screenshots (if appropriate)

[2025-02-15 17:32:59,757] [DEBUG] [axolotl.calculate_total_num_steps:403] [PID:3274911] [RANK:0] total_num_tokens: 16_452_017_178
[2025-02-15 17:37:01,314] [DEBUG] [axolotl.calculate_total_num_steps:421] [PID:3274911] [RANK:0] `total_supervised_tokens: 16_452_017_178`

Types of changes

Social Handles (Optional)

Without selecting the column, applying `len` counts the whole row as 1 which resulting the total number of the samples instead of the token counts.

winglian · 2025-02-15T18:59:07Z

Thanks! I noticed this bit hadn't gotten around to digging into it. Is the operation still pretty fast on a large dataset?

seungduk-yanolja · 2025-02-15T19:08:31Z

Thanks! I noticed this bit hadn't gotten around to digging into it. Is the operation still pretty fast on a large dataset?

I think so. For the 60GB JSONL file,

[2025-02-15 17:31:43,546] [INFO] [axolotl.utils.data.sft.load_tokenized_prepared_datasets:245] [PID:3274911] [RANK:0] Prepared dataset loaded from disk...�[39m
[2025-02-15 17:32:59,757] [DEBUG] [axolotl.calculate_total_num_steps:403] [PID:3274911] [RANK:0] total_num_tokens: 16_452_017_178�[39m
[2025-02-15 17:37:01,314] [DEBUG] [axolotl.calculate_total_num_steps:421] [PID:3274911] [RANK:0] `total_supervised_tokens: 16_452_017_178`�[39m

and the previous version (release v0.6.0)

[2025-02-13 09:25:49,734] [INFO] [axolotl.load_tokenized_prepared_datasets:207] [PID:2735253] [RANK:0] Prepared dataset loaded from disk...�[39m
[2025-02-13 09:30:49,085] [DEBUG] [axolotl.calculate_total_num_steps:342] [PID:2735253] [RANK:0] total_num_tokens: 16_452_017_178�[39m
[2025-02-13 09:48:17,172] [DEBUG] [axolotl.calculate_total_num_steps:360] [PID:2735253] [RANK:0] `total_supervised_tokens: 16_452_017_178`�[39m

Select input_ids explicitly after panda conversion

5fb0d1f

Without selecting the column, applying `len` counts the whole row as 1 which resulting the total number of the samples instead of the token counts.

winglian approved these changes Feb 15, 2025

View reviewed changes

bursteratom merged commit 97a2fa2 into axolotl-ai-cloud:main Feb 17, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select input_ids explicitly after panda conversion #2335

Select input_ids explicitly after panda conversion #2335

seungduk-yanolja commented Feb 15, 2025

winglian commented Feb 15, 2025

seungduk-yanolja commented Feb 15, 2025 •

edited

Loading

Select input_ids explicitly after panda conversion #2335

Select input_ids explicitly after panda conversion #2335

Conversation

seungduk-yanolja commented Feb 15, 2025

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Feb 15, 2025

seungduk-yanolja commented Feb 15, 2025 • edited Loading

seungduk-yanolja commented Feb 15, 2025 •

edited

Loading