-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoutput.txt1
502 lines (311 loc) Β· 27.3 KB
/
output.txt1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
Unsloth: unsloth/tinyllama-bnb-4bit can only handle sequence lengths of at most 2048.
But with kaiokendev's RoPE scaling of 2.0, it can be magically be extended to 4096!
Unsloth 2025.3.5 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.
/workspace/Personalized_LLM/unsloth_compiled_cache/UnslothSFTTrainer.py:497: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
/workspace/Personalized_LLM/unsloth_compiled_cache/UnslothSFTTrainer.py:585: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
/workspace/Personalized_LLM/unsloth_compiled_cache/UnslothSFTTrainer.py:599: UserWarning: You passed a `dataset_num_proc` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
/workspace/Personalized_LLM/unsloth_compiled_cache/UnslothSFTTrainer.py:613: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 3,013 | Num Epochs = 1 | Total steps = 94
O^O/ \_/ \ Batch size per device = 8 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
"-____-" Trainable parameters = 25,231,360/640,837,632 (3.94% trained)
π¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning.
π¦₯ Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.3.5: Fast Llama patching. Transformers: 4.49.0.
\\ /| NVIDIA A40. Num GPUs = 4. Max memory: 44.458 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
0%| | 0/94 [00:00<?, ?it/s] 1%| | 1/94 [00:35<55:44, 35.96s/it] 1%| | 1/94 [00:35<55:44, 35.96s/it] 2%|β | 2/94 [01:03<47:30, 30.98s/it] 2%|β | 2/94 [01:03<47:30, 30.98s/it] 3%|β | 3/94 [01:31<44:46, 29.52s/it] 3%|β | 3/94 [01:31<44:46, 29.52s/it] 4%|β | 4/94 [01:58<43:08, 28.76s/it] 4%|β | 4/94 [01:58<43:08, 28.76s/it] 5%|β | 5/94 [02:26<42:09, 28.43s/it] 5%|β | 5/94 [02:26<42:09, 28.43s/it] 6%|β | 6/94 [02:54<41:14, 28.12s/it] 6%|β | 6/94 [02:54<41:14, 28.12s/it] 7%|β | 7/94 [03:21<40:27, 27.90s/it] 7%|β | 7/94 [03:21<40:27, 27.90s/it] 9%|β | 8/94 [03:49<39:52, 27.82s/it] 9%|β | 8/94 [03:49<39:52, 27.82s/it] 10%|β | 9/94 [04:16<39:14, 27.70s/it] 10%|β | 9/94 [04:16<39:14, 27.70s/it] 11%|β | 10/94 [04:44<38:44, 27.68s/it] 11%|β | 10/94 [04:44<38:44, 27.68s/it] 12%|ββ | 11/94 [05:12<38:19, 27.71s/it] 12%|ββ | 11/94 [05:12<38:19, 27.71s/it] 13%|ββ | 12/94 [05:39<37:46, 27.63s/it] 13%|ββ | 12/94 [05:39<37:46, 27.63s/it] 14%|ββ | 13/94 [06:07<37:13, 27.57s/it] 14%|ββ | 13/94 [06:07<37:13, 27.57s/it] 15%|ββ | 14/94 [06:34<36:41, 27.52s/it] 15%|ββ | 14/94 [06:34<36:41, 27.52s/it] 16%|ββ | 15/94 [07:01<36:12, 27.50s/it] 16%|ββ | 15/94 [07:01<36:12, 27.50s/it] 17%|ββ | 16/94 [07:29<35:43, 27.48s/it] 17%|ββ | 16/94 [07:29<35:43, 27.48s/it] 18%|ββ | 17/94 [07:57<35:24, 27.60s/it] 18%|ββ | 17/94 [07:57<35:24, 27.60s/it] 19%|ββ | 18/94 [08:24<34:54, 27.56s/it] 19%|ββ | 18/94 [08:24<34:54, 27.56s/it] 20%|ββ | 19/94 [08:52<34:25, 27.54s/it] 20%|ββ | 19/94 [08:52<34:25, 27.54s/it] 21%|βββ | 20/94 [09:19<33:56, 27.52s/it] 21%|βββ | 20/94 [09:19<33:56, 27.52s/it] 22%|βββ | 21/94 [09:47<33:32, 27.57s/it] 22%|βββ | 21/94 [09:47<33:32, 27.57s/it] 23%|βββ | 22/94 [10:14<33:06, 27.59s/it] 23%|βββ | 22/94 [10:14<33:06, 27.59s/it] 24%|βββ | 23/94 [10:42<32:38, 27.58s/it] 24%|βββ | 23/94 [10:42<32:38, 27.58s/it] 26%|βββ | 24/94 [11:10<32:10, 27.58s/it] 26%|βββ | 24/94 [11:10<32:10, 27.58s/it] 27%|βββ | 25/94 [11:37<31:40, 27.54s/it] 27%|βββ | 25/94 [11:37<31:40, 27.54s/it] 28%|βββ | 26/94 [12:05<31:13, 27.56s/it] 28%|βββ | 26/94 [12:05<31:13, 27.56s/it] 29%|βββ | 27/94 [12:32<30:45, 27.55s/it] 29%|βββ | 27/94 [12:32<30:45, 27.55s/it] 30%|βββ | 28/94 [13:00<30:18, 27.56s/it] 30%|βββ | 28/94 [13:00<30:18, 27.56s/it] 31%|βββ | 29/94 [13:27<29:51, 27.55s/it] 31%|βββ | 29/94 [13:27<29:51, 27.55s/it] 32%|ββββ | 30/94 [13:55<29:22, 27.54s/it] 32%|ββββ | 30/94 [13:55<29:22, 27.54s/it] 33%|ββββ | 31/94 [14:22<28:52, 27.50s/it] 33%|ββββ | 31/94 [14:22<28:52, 27.50s/it] 34%|ββββ | 32/94 [14:50<28:32, 27.63s/it] 34%|ββββ | 32/94 [14:50<28:32, 27.63s/it] 35%|ββββ | 33/94 [15:18<28:05, 27.62s/it] 35%|ββββ | 33/94 [15:18<28:05, 27.62s/it] 36%|ββββ | 34/94 [15:45<27:37, 27.62s/it] 36%|ββββ | 34/94 [15:45<27:37, 27.62s/it] 37%|ββββ | 35/94 [16:13<27:13, 27.69s/it] 37%|ββββ | 35/94 [16:13<27:13, 27.69s/it] 38%|ββββ | 36/94 [16:41<26:46, 27.70s/it] 38%|ββββ | 36/94 [16:41<26:46, 27.70s/it] 39%|ββββ | 37/94 [17:08<26:16, 27.66s/it] 39%|ββββ | 37/94 [17:08<26:16, 27.66s/it] 40%|ββββ | 38/94 [17:36<25:48, 27.65s/it] 40%|ββββ | 38/94 [17:36<25:48, 27.65s/it] 41%|βββββ | 39/94 [18:04<25:18, 27.60s/it] 41%|βββββ | 39/94 [18:04<25:18, 27.60s/it] 43%|βββββ | 40/94 [18:31<24:47, 27.55s/it] 43%|βββββ | 40/94 [18:31<24:47, 27.55s/it] 44%|βββββ | 41/94 [18:59<24:19, 27.55s/it] 44%|βββββ | 41/94 [18:59<24:19, 27.55s/it] 45%|βββββ | 42/94 [19:26<23:51, 27.53s/it] 45%|βββββ | 42/94 [19:26<23:51, 27.53s/it] 46%|βββββ | 43/94 [19:54<23:26, 27.57s/it] 46%|βββββ | 43/94 [19:54<23:26, 27.57s/it] 47%|βββββ | 44/94 [20:21<22:57, 27.55s/it] 47%|βββββ | 44/94 [20:21<22:57, 27.55s/it] 48%|βββββ | 45/94 [20:49<22:27, 27.50s/it] 48%|βββββ | 45/94 [20:49<22:27, 27.50s/it] 49%|βββββ | 46/94 [21:17<22:05, 27.62s/it] 49%|βββββ | 46/94 [21:17<22:05, 27.62s/it] 50%|βββββ | 47/94 [21:44<21:36, 27.58s/it] 50%|βββββ | 47/94 [21:44<21:36, 27.58s/it] 51%|βββββ | 48/94 [22:12<21:10, 27.62s/it] 51%|βββββ | 48/94 [22:12<21:10, 27.62s/it] 52%|ββββββ | 49/94 [22:39<20:40, 27.57s/it] 52%|ββββββ | 49/94 [22:39<20:40, 27.57s/it] 53%|ββββββ | 50/94 [23:07<20:13, 27.58s/it] 53%|ββββββ | 50/94 [23:07<20:13, 27.58s/it] 54%|ββββββ | 51/94 [23:34<19:46, 27.58s/it] 54%|ββββββ | 51/94 [23:34<19:46, 27.58s/it] 55%|ββββββ | 52/94 [24:02<19:17, 27.56s/it] 55%|ββββββ | 52/94 [24:02<19:17, 27.56s/it] 56%|ββββββ | 53/94 [24:29<18:49, 27.54s/it] 56%|ββββββ | 53/94 [24:29<18:49, 27.54s/it] 57%|ββββββ | 54/94 [24:57<18:18, 27.47s/it] 57%|ββββββ | 54/94 [24:57<18:18, 27.47s/it] 59%|ββββββ | 55/94 [25:24<17:50, 27.45s/it] 59%|ββββββ | 55/94 [25:24<17:50, 27.45s/it] 60%|ββββββ | 56/94 [25:52<17:25, 27.50s/it] 60%|ββββββ | 56/94 [25:52<17:25, 27.50s/it] 61%|ββββββ | 57/94 [26:19<16:58, 27.52s/it] 61%|ββββββ | 57/94 [26:19<16:58, 27.52s/it] 62%|βββββββ | 58/94 [26:47<16:32, 27.56s/it] 62%|βββββββ | 58/94 [26:47<16:32, 27.56s/it] 63%|βββββββ | 59/94 [27:15<16:05, 27.59s/it] 63%|βββββββ | 59/94 [27:15<16:05, 27.59s/it] 64%|βββββββ | 60/94 [27:42<15:36, 27.53s/it] 64%|βββββββ | 60/94 [27:42<15:36, 27.53s/it] 65%|βββββββ | 61/94 [28:09<15:06, 27.47s/it] 65%|βββββββ | 61/94 [28:09<15:06, 27.47s/it] 66%|βββββββ | 62/94 [28:37<14:41, 27.54s/it] 66%|βββββββ | 62/94 [28:37<14:41, 27.54s/it] 67%|βββββββ | 63/94 [29:05<14:13, 27.54s/it] 67%|βββββββ | 63/94 [29:05<14:13, 27.54s/it] 68%|βββββββ | 64/94 [29:32<13:42, 27.41s/it] 68%|βββββββ | 64/94 [29:32<13:42, 27.41s/it] 69%|βββββββ | 65/94 [29:59<13:16, 27.47s/it] 69%|βββββββ | 65/94 [29:59<13:16, 27.47s/it] 70%|βββββββ | 66/94 [30:27<12:49, 27.48s/it] 70%|βββββββ | 66/94 [30:27<12:49, 27.48s/it] 71%|ββββββββ | 67/94 [30:54<12:22, 27.50s/it] 71%|ββββββββ | 67/94 [30:54<12:22, 27.50s/it] 72%|ββββββββ | 68/94 [31:22<11:55, 27.51s/it] 72%|ββββββββ | 68/94 [31:22<11:55, 27.51s/it] 73%|ββββββββ | 69/94 [31:49<11:28, 27.53s/it] 73%|ββββββββ | 69/94 [31:49<11:28, 27.53s/it] 74%|ββββββββ | 70/94 [32:17<11:00, 27.53s/it] 74%|ββββββββ | 70/94 [32:17<11:00, 27.53s/it] 76%|ββββββββ | 71/94 [32:44<10:32, 27.49s/it] 76%|ββββββββ | 71/94 [32:44<10:32, 27.49s/it] 77%|ββββββββ | 72/94 [33:12<10:05, 27.54s/it] 77%|ββββββββ | 72/94 [33:12<10:05, 27.54s/it] 78%|ββββββββ | 73/94 [33:40<09:38, 27.54s/it] 78%|ββββββββ | 73/94 [33:40<09:38, 27.54s/it] 79%|ββββββββ | 74/94 [34:07<09:10, 27.54s/it] 79%|ββββββββ | 74/94 [34:07<09:10, 27.54s/it] 80%|ββββββββ | 75/94 [34:35<08:43, 27.53s/it] 80%|ββββββββ | 75/94 [34:35<08:43, 27.53s/it] 81%|ββββββββ | 76/94 [35:02<08:15, 27.55s/it] 81%|ββββββββ | 76/94 [35:02<08:15, 27.55s/it] 82%|βββββββββ | 77/94 [35:29<07:47, 27.47s/it] 82%|βββββββββ | 77/94 [35:29<07:47, 27.47s/it] 83%|βββββββββ | 78/94 [35:57<07:19, 27.44s/it] 83%|βββββββββ | 78/94 [35:57<07:19, 27.44s/it] 84%|βββββββββ | 79/94 [36:24<06:51, 27.46s/it] Unsloth: Will smartly offload gradients to save VRAM!
{'loss': 4.8967, 'grad_norm': 2.310270309448242, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.01}
{'loss': 4.8638, 'grad_norm': 2.257821559906006, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}
{'loss': 4.8339, 'grad_norm': 2.302016019821167, 'learning_rate': 6e-06, 'epoch': 0.03}
{'loss': 4.882, 'grad_norm': 2.1854441165924072, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.04}
{'loss': 4.8903, 'grad_norm': 2.2418012619018555, 'learning_rate': 1e-05, 'epoch': 0.05}
{'loss': 4.8991, 'grad_norm': 2.2013003826141357, 'learning_rate': 1.2e-05, 'epoch': 0.06}
{'loss': 4.791, 'grad_norm': 2.2659144401550293, 'learning_rate': 1.4e-05, 'epoch': 0.07}
{'loss': 4.8486, 'grad_norm': 2.356858253479004, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.08}
{'loss': 4.7992, 'grad_norm': 2.281806468963623, 'learning_rate': 1.8e-05, 'epoch': 0.1}
{'loss': 4.767, 'grad_norm': 2.1651251316070557, 'learning_rate': 2e-05, 'epoch': 0.11}
{'loss': 4.7799, 'grad_norm': 2.2765543460845947, 'learning_rate': 1.9761904761904763e-05, 'epoch': 0.12}
{'loss': 4.705, 'grad_norm': 2.200310468673706, 'learning_rate': 1.9523809523809524e-05, 'epoch': 0.13}
{'loss': 4.6596, 'grad_norm': 2.059007406234741, 'learning_rate': 1.928571428571429e-05, 'epoch': 0.14}
{'loss': 4.6297, 'grad_norm': 2.0557570457458496, 'learning_rate': 1.904761904761905e-05, 'epoch': 0.15}
{'loss': 4.6128, 'grad_norm': 1.8565906286239624, 'learning_rate': 1.880952380952381e-05, 'epoch': 0.16}
{'loss': 4.5706, 'grad_norm': 1.7304823398590088, 'learning_rate': 1.8571428571428575e-05, 'epoch': 0.17}
{'loss': 4.5382, 'grad_norm': 1.5327904224395752, 'learning_rate': 1.8333333333333333e-05, 'epoch': 0.18}
{'loss': 4.5219, 'grad_norm': 1.480475902557373, 'learning_rate': 1.8095238095238097e-05, 'epoch': 0.19}
{'loss': 4.459, 'grad_norm': 1.309133768081665, 'learning_rate': 1.785714285714286e-05, 'epoch': 0.2}
{'loss': 4.4424, 'grad_norm': 1.1774230003356934, 'learning_rate': 1.761904761904762e-05, 'epoch': 0.21}
{'loss': 4.4484, 'grad_norm': 1.2359248399734497, 'learning_rate': 1.7380952380952384e-05, 'epoch': 0.22}
{'loss': 4.3697, 'grad_norm': 1.073711633682251, 'learning_rate': 1.7142857142857142e-05, 'epoch': 0.23}
{'loss': 4.381, 'grad_norm': 1.0492337942123413, 'learning_rate': 1.6904761904761906e-05, 'epoch': 0.24}
{'loss': 4.3429, 'grad_norm': 1.0871938467025757, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.25}
{'loss': 4.3309, 'grad_norm': 0.9203954339027405, 'learning_rate': 1.642857142857143e-05, 'epoch': 0.27}
{'loss': 4.3189, 'grad_norm': 0.9132876396179199, 'learning_rate': 1.6190476190476193e-05, 'epoch': 0.28}
{'loss': 4.2767, 'grad_norm': 0.959171712398529, 'learning_rate': 1.5952380952380954e-05, 'epoch': 0.29}
{'loss': 4.2825, 'grad_norm': 0.8654213547706604, 'learning_rate': 1.5714285714285715e-05, 'epoch': 0.3}
{'loss': 4.2717, 'grad_norm': 0.8454081416130066, 'learning_rate': 1.5476190476190476e-05, 'epoch': 0.31}
{'loss': 4.2575, 'grad_norm': 0.8942322134971619, 'learning_rate': 1.523809523809524e-05, 'epoch': 0.32}
{'loss': 4.2439, 'grad_norm': 0.811410665512085, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.33}
{'loss': 4.2275, 'grad_norm': 0.7689286470413208, 'learning_rate': 1.4761904761904763e-05, 'epoch': 0.34}
{'loss': 4.2043, 'grad_norm': 0.7827208638191223, 'learning_rate': 1.4523809523809524e-05, 'epoch': 0.35}
{'loss': 4.1946, 'grad_norm': 0.7508720755577087, 'learning_rate': 1.4285714285714287e-05, 'epoch': 0.36}
{'loss': 4.1864, 'grad_norm': 0.7675748467445374, 'learning_rate': 1.4047619047619048e-05, 'epoch': 0.37}
{'loss': 4.1645, 'grad_norm': 0.7707952260971069, 'learning_rate': 1.3809523809523811e-05, 'epoch': 0.38}
{'loss': 4.1642, 'grad_norm': 0.7825920581817627, 'learning_rate': 1.3571428571428574e-05, 'epoch': 0.39}
{'loss': 4.1302, 'grad_norm': 0.8003215789794922, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.4}
{'loss': 4.1328, 'grad_norm': 0.7642289996147156, 'learning_rate': 1.3095238095238096e-05, 'epoch': 0.41}
{'loss': 4.1293, 'grad_norm': 0.7745412588119507, 'learning_rate': 1.2857142857142859e-05, 'epoch': 0.42}
{'loss': 4.1015, 'grad_norm': 0.7470276951789856, 'learning_rate': 1.261904761904762e-05, 'epoch': 0.44}
{'loss': 4.0979, 'grad_norm': 0.793340802192688, 'learning_rate': 1.2380952380952383e-05, 'epoch': 0.45}
{'loss': 4.1179, 'grad_norm': 0.7525438070297241, 'learning_rate': 1.2142857142857142e-05, 'epoch': 0.46}
{'loss': 4.1076, 'grad_norm': 0.7825291156768799, 'learning_rate': 1.1904761904761905e-05, 'epoch': 0.47}
{'loss': 4.0744, 'grad_norm': 0.7988024950027466, 'learning_rate': 1.1666666666666668e-05, 'epoch': 0.48}
{'loss': 4.0647, 'grad_norm': 0.7582821249961853, 'learning_rate': 1.1428571428571429e-05, 'epoch': 0.49}
{'loss': 4.0615, 'grad_norm': 0.8074766993522644, 'learning_rate': 1.1190476190476192e-05, 'epoch': 0.5}
{'loss': 4.0755, 'grad_norm': 0.7270247936248779, 'learning_rate': 1.0952380952380955e-05, 'epoch': 0.51}
{'loss': 4.0713, 'grad_norm': 0.7767596244812012, 'learning_rate': 1.0714285714285714e-05, 'epoch': 0.52}
{'loss': 4.025, 'grad_norm': 0.742392361164093, 'learning_rate': 1.0476190476190477e-05, 'epoch': 0.53}
{'loss': 4.0311, 'grad_norm': 0.758925199508667, 'learning_rate': 1.0238095238095238e-05, 'epoch': 0.54}
{'loss': 4.025, 'grad_norm': 0.8127023577690125, 'learning_rate': 1e-05, 'epoch': 0.55}
{'loss': 4.0288, 'grad_norm': 0.7399987578392029, 'learning_rate': 9.761904761904762e-06, 'epoch': 0.56}
{'loss': 4.0109, 'grad_norm': 0.7770053744316101, 'learning_rate': 9.523809523809525e-06, 'epoch': 0.57}
{'loss': 3.991, 'grad_norm': 0.756092369556427, 'learning_rate': 9.285714285714288e-06, 'epoch': 0.58}
{'loss': 4.0127, 'grad_norm': 0.7763499021530151, 'learning_rate': 9.047619047619049e-06, 'epoch': 0.59}
{'loss': 4.0023, 'grad_norm': 0.8087713718414307, 'learning_rate': 8.80952380952381e-06, 'epoch': 0.6}
{'loss': 3.9964, 'grad_norm': 0.7578100562095642, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.62}
{'loss': 3.977, 'grad_norm': 0.7600986957550049, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.63}
{'loss': 3.9964, 'grad_norm': 0.7667102813720703, 'learning_rate': 8.095238095238097e-06, 'epoch': 0.64}
{'loss': 3.9954, 'grad_norm': 0.7662025690078735, 'learning_rate': 7.857142857142858e-06, 'epoch': 0.65}
{'loss': 3.9666, 'grad_norm': 0.7695186734199524, 'learning_rate': 7.61904761904762e-06, 'epoch': 0.66}
{'loss': 3.9749, 'grad_norm': 0.7401809692382812, 'learning_rate': 7.380952380952382e-06, 'epoch': 0.67}
{'loss': 3.9254, 'grad_norm': 0.751841127872467, 'learning_rate': 7.1428571428571436e-06, 'epoch': 0.68}
{'loss': 3.9493, 'grad_norm': 0.7466264963150024, 'learning_rate': 6.9047619047619055e-06, 'epoch': 0.69}
{'loss': 3.9628, 'grad_norm': 0.7644298076629639, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.7}
{'loss': 3.9381, 'grad_norm': 0.810420036315918, 'learning_rate': 6.4285714285714295e-06, 'epoch': 0.71}
{'loss': 3.9331, 'grad_norm': 0.8251085877418518, 'learning_rate': 6.1904761904761914e-06, 'epoch': 0.72}
{'loss': 3.9269, 'grad_norm': 0.77117919921875, 'learning_rate': 5.9523809523809525e-06, 'epoch': 0.73}
{'loss': 3.9534, 'grad_norm': 0.7831836938858032, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.74}
{'loss': 3.9452, 'grad_norm': 0.7542756199836731, 'learning_rate': 5.476190476190477e-06, 'epoch': 0.75}
{'loss': 3.9413, 'grad_norm': 0.8254461884498596, 'learning_rate': 5.2380952380952384e-06, 'epoch': 0.76}
{'loss': 3.9406, 'grad_norm': 0.8346537947654724, 'learning_rate': 5e-06, 'epoch': 0.77}
{'loss': 3.9208, 'grad_norm': 0.8948437571525574, 'learning_rate': 4.761904761904762e-06, 'epoch': 0.79}
{'loss': 3.9414, 'grad_norm': 0.8242635726928711, 'learning_rate': 4.523809523809524e-06, 'epoch': 0.8}
{'loss': 3.8973, 'grad_norm': 0.80201655626297, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.81}
{'loss': 3.8862, 'grad_norm': 0.8336827158927917, 'learning_rate': 4.047619047619048e-06, 'epoch': 0.82}
{'loss': 3.8987, 'grad_norm': 0.8546895980834961, 'learning_rate': 3.80952380952381e-06, 'epoch': 0.83}
84%|βββββββββ | 79/94 [36:24<06:51, 27.46s/it] 85%|βββββββββ | 80/94 [36:52<06:24, 27.49s/it] 85%|βββββββββ | 80/94 [36:52<06:24, 27.49s/it] 86%|βββββββββ | 81/94 [37:19<05:57, 27.48s/it] 86%|βββββββββ | 81/94 [37:19<05:57, 27.48s/it] 87%|βββββββββ | 82/94 [37:47<05:30, 27.54s/it] 87%|βββββββββ | 82/94 [37:47<05:30, 27.54s/it] 88%|βββββββββ | 83/94 [38:14<05:02, 27.50s/it] 88%|βββββββββ | 83/94 [38:14<05:02, 27.50s/it] 89%|βββββββββ | 84/94 [38:42<04:35, 27.57s/it] 89%|βββββββββ | 84/94 [38:42<04:35, 27.57s/it] 90%|βββββββββ | 85/94 [39:10<04:08, 27.57s/it] 90%|βββββββββ | 85/94 [39:10<04:08, 27.57s/it] 91%|ββββββββββ| 86/94 [39:37<03:40, 27.51s/it] 91%|ββββββββββ| 86/94 [39:37<03:40, 27.51s/it] 93%|ββββββββββ| 87/94 [40:05<03:12, 27.50s/it] 93%|ββββββββββ| 87/94 [40:05<03:12, 27.50s/it] 94%|ββββββββββ| 88/94 [40:32<02:44, 27.43s/it] 94%|ββββββββββ| 88/94 [40:32<02:44, 27.43s/it] 95%|ββββββββββ| 89/94 [41:00<02:18, 27.60s/it] 95%|ββββββββββ| 89/94 [41:00<02:18, 27.60s/it] 96%|ββββββββββ| 90/94 [41:27<01:50, 27.60s/it] 96%|ββββββββββ| 90/94 [41:27<01:50, 27.60s/it] 97%|ββββββββββ| 91/94 [41:55<01:22, 27.56s/it] 97%|ββββββββββ| 91/94 [41:55<01:22, 27.56s/it] 98%|ββββββββββ| 92/94 [42:22<00:54, 27.41s/it] 98%|ββββββββββ| 92/94 [42:22<00:54, 27.41s/it] 99%|ββββββββββ| 93/94 [42:49<00:27, 27.42s/it] 99%|ββββββββββ| 93/94 [42:49<00:27, 27.42s/it]100%|ββββββββββ| 94/94 [43:17<00:00, 27.41s/it] 100%|ββββββββββ| 94/94 [43:17<00:00, 27.41s/it] 100%|ββββββββββ| 94/94 [43:18<00:00, 27.41s/it]100%|ββββββββββ| 94/94 [43:18<00:00, 27.64s/it]
{'loss': 3.8895, 'grad_norm': 0.8185483813285828, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.84}
{'loss': 3.8524, 'grad_norm': 0.8314951062202454, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.85}
{'loss': 3.8641, 'grad_norm': 0.8705918192863464, 'learning_rate': 3.0952380952380957e-06, 'epoch': 0.86}
{'loss': 3.8937, 'grad_norm': 0.8476338386535645, 'learning_rate': 2.8571428571428573e-06, 'epoch': 0.87}
{'loss': 3.8809, 'grad_norm': 0.8714267611503601, 'learning_rate': 2.6190476190476192e-06, 'epoch': 0.88}
{'loss': 3.8844, 'grad_norm': 0.8967968821525574, 'learning_rate': 2.380952380952381e-06, 'epoch': 0.89}
{'loss': 3.8975, 'grad_norm': 0.8912258744239807, 'learning_rate': 2.1428571428571427e-06, 'epoch': 0.9}
{'loss': 3.8921, 'grad_norm': 0.8876487612724304, 'learning_rate': 1.904761904761905e-06, 'epoch': 0.91}
{'loss': 3.8804, 'grad_norm': 0.8485371470451355, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.92}
{'loss': 3.9176, 'grad_norm': 0.8650866150856018, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.93}
{'loss': 3.8728, 'grad_norm': 0.9165499210357666, 'learning_rate': 1.1904761904761906e-06, 'epoch': 0.94}
{'loss': 3.8875, 'grad_norm': 0.8359355330467224, 'learning_rate': 9.523809523809525e-07, 'epoch': 0.95}
{'loss': 3.8489, 'grad_norm': 2.6129801273345947, 'learning_rate': 7.142857142857143e-07, 'epoch': 0.97}
{'loss': 3.8747, 'grad_norm': 0.8886895775794983, 'learning_rate': 4.7619047619047623e-07, 'epoch': 0.98}
{'loss': 3.8409, 'grad_norm': 0.9427151083946228, 'learning_rate': 2.3809523809523811e-07, 'epoch': 0.99}
{'loss': 3.876, 'grad_norm': 0.8977408409118652, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 2598.3802, 'train_samples_per_second': 1.16, 'train_steps_per_second': 0.036, 'train_loss': 4.18370370408322, 'epoch': 1.0}