Model Not Converging (Issue with ControlNet Fine-Tuning Script) #1900
Replies: 3 comments 4 replies
-
Hi @MaybeRichard, thanks for your interest here! I recommend starting by checking your data to ensure the dataset is loaded correctly and contains meaningful samples. It’s a good idea to visualize a few examples (inputs and their corresponding labels/masks) to verify proper alignment. Next, confirm that the label masks are binary (0 and 1) and appropriately normalized. Finally, begin with a low learning rate to stabilize the training process. cc @guopengf for additional suggestions. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your suggestions! I download the KiTS dataset and corresponding json file from NGC Catalog, and I thought the data was already processed. I will try your suggestion, thanks again! |
Beta Was this translation helpful? Give feedback.
-
Sorry to bother you again. I tried the solution you provided, but the network still does not converge. Specifically:
Note: I am using the KiTS dataset and corresponding JSON files provided from the NGC Catalog. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description:
Hello, I am encountering an issue with the fine-tuning script for ControlNet provided in this repository. Specifically, when fine-tuning the model on both the KiTS dataset and my custom dataset, the model seems unable to converge.
To Reproduce
Steps to reproduce the behavior:
Problem Background:
Screenshots

Output
[2024-12-14 17:52:38.415] INFO - load trained controlnet model from ./models/controlnet-20datasets-e20wl100fold0bc_noi_dia_fsize_current.pt
[2024-12-14 17:52:38.429] INFO - total number of training steps: 600.0.
[2024-12-14 17:52:38.430] INFO - apply weighted loss = 100 on labels: [129]
[Epoch 1/100] [Batch 1/6] [LR: 0.00000997] [loss: 0.0594] ETA: 0:00:57.509962
[Epoch 1/100] [Batch 2/6] [LR: 0.00000993] [loss: 0.0397] ETA: 0:00:05.898367
[Epoch 1/100] [Batch 3/6] [LR: 0.00000990] [loss: 0.0117] ETA: 0:00:04.425231
[Epoch 1/100] [Batch 4/6] [LR: 0.00000987] [loss: 0.0154] ETA: 0:00:02.951793
[Epoch 1/100] [Batch 5/6] [LR: 0.00000983] [loss: 0.0096] ETA: 0:00:01.475626
[Epoch 1/100] [Batch 6/6] [LR: 0.00000980] [loss: 0.0155] ETA: 0:00:00
[2024-12-14 17:52:58.030] INFO - best loss -> 0.02520664595067501.
[Epoch 2/100] [Batch 1/6] [LR: 0.00000977] [loss: 0.1253] ETA: 0:00:31.438770
[Epoch 2/100] [Batch 2/6] [LR: 0.00000974] [loss: 0.5812] ETA: 0:00:05.911908
[Epoch 2/100] [Batch 3/6] [LR: 0.00000970] [loss: 0.7666] ETA: 0:00:04.445841
[Epoch 2/100] [Batch 4/6] [LR: 0.00000967] [loss: 0.2461] ETA: 0:00:02.972570
[Epoch 2/100] [Batch 5/6] [LR: 0.00000964] [loss: 0.0693] ETA: 0:00:01.480636
[Epoch 2/100] [Batch 6/6] [LR: 0.00000960] [loss: 0.0123] ETA: 0:00:00
[Epoch 3/100] [Batch 1/6] [LR: 0.00000957] [loss: 0.3098] ETA: 0:00:32.303094
[Epoch 3/100] [Batch 2/6] [LR: 0.00000954] [loss: 0.4730] ETA: 0:00:06.003017
[Epoch 3/100] [Batch 3/6] [LR: 0.00000951] [loss: 0.2315] ETA: 0:00:04.474122
[Epoch 3/100] [Batch 4/6] [LR: 0.00000947] [loss: 0.2388] ETA: 0:00:02.972777
[Epoch 3/100] [Batch 5/6] [LR: 0.00000944] [loss: 0.4927] ETA: 0:00:01.482379
[Epoch 3/100] [Batch 6/6] [LR: 0.00000941] [loss: 0.6689] ETA: 0:00:00
[Epoch 4/100] [Batch 1/6] [LR: 0.00000938] [loss: 0.0097] ETA: 0:00:32.200603
[Epoch 4/100] [Batch 2/6] [LR: 0.00000934] [loss: 0.2147] ETA: 0:00:05.911577
[Epoch 4/100] [Batch 3/6] [LR: 0.00000931] [loss: 0.1359] ETA: 0:00:04.428170
[Epoch 4/100] [Batch 4/6] [LR: 0.00000928] [loss: 0.1607] ETA: 0:00:02.957243
[Epoch 4/100] [Batch 5/6] [LR: 0.00000925] [loss: 0.2544] ETA: 0:00:01.484895
[Epoch 4/100] [Batch 6/6] [LR: 0.00000922] [loss: 0.1566] ETA: 0:00:00
[Epoch 5/100] [Batch 1/6] [LR: 0.00000918] [loss: 0.3608] ETA: 0:00:39.206282
[Epoch 5/100] [Batch 2/6] [LR: 0.00000915] [loss: 0.3654] ETA: 0:00:05.935111
[Epoch 5/100] [Batch 3/6] [LR: 0.00000912] [loss: 0.1226] ETA: 0:00:04.445168
[Epoch 5/100] [Batch 4/6] [LR: 0.00000909] [loss: 0.4570] ETA: 0:00:02.968678
[Epoch 5/100] [Batch 5/6] [LR: 0.00000906] [loss: 0.1441] ETA: 0:00:01.491980
[Epoch 5/100] [Batch 6/6] [LR: 0.00000903] [loss: 0.2887] ETA: 0:00:00
[Epoch 6/100] [Batch 1/6] [LR: 0.00000899] [loss: 0.3012] ETA: 0:00:33.762065
[Epoch 6/100] [Batch 2/6] [LR: 0.00000896] [loss: 0.7926] ETA: 0:00:06.078450
[Epoch 6/100] [Batch 3/6] [LR: 0.00000893] [loss: 0.0092] ETA: 0:00:04.465131
[Epoch 6/100] [Batch 4/6] [LR: 0.00000890] [loss: 0.2825] ETA: 0:00:02.993035
[Epoch 6/100] [Batch 5/6] [LR: 0.00000887] [loss: 0.1342] ETA: 0:00:01.493944
[Epoch 6/100] [Batch 6/6] [LR: 0.00000884] [loss: 0.0131] ETA: 0:00:00
[Epoch 7/100] [Batch 1/6] [LR: 0.00000880] [loss: 0.6823] ETA: 0:00:36.323832
[Epoch 7/100] [Batch 2/6] [LR: 0.00000877] [loss: 0.0610] ETA: 0:00:06.089784
[Epoch 7/100] [Batch 3/6] [LR: 0.00000874] [loss: 0.5895] ETA: 0:00:04.482495
[Epoch 7/100] [Batch 4/6] [LR: 0.00000871] [loss: 0.0163] ETA: 0:00:02.972429
[Epoch 7/100] [Batch 5/6] [LR: 0.00000868] [loss: 0.7321] ETA: 0:00:01.492317
[Epoch 7/100] [Batch 6/6] [LR: 0.00000865] [loss: 0.6678] ETA: 0:00:00
[Epoch 8/100] [Batch 1/6] [LR: 0.00000862] [loss: 0.1816] ETA: 0:00:30.358437
[Epoch 8/100] [Batch 2/6] [LR: 0.00000859] [loss: 0.3500] ETA: 0:00:05.981302
[Epoch 8/100] [Batch 3/6] [LR: 0.00000856] [loss: 0.1950] ETA: 0:00:04.491425
[Epoch 8/100] [Batch 4/6] [LR: 0.00000853] [loss: 0.0150] ETA: 0:00:02.980718
[Epoch 8/100] [Batch 5/6] [LR: 0.00000849] [loss: 0.6160] ETA: 0:00:01.494416
[Epoch 8/100] [Batch 6/6] [LR: 0.00000846] [loss: 0.7288] ETA: 0:00:00
[Epoch 9/100] [Batch 1/6] [LR: 0.00000843] [loss: 0.2774] ETA: 0:00:30.263555
[Epoch 9/100] [Batch 2/6] [LR: 0.00000840] [loss: 0.2248] ETA: 0:00:05.925602
[Epoch 9/100] [Batch 3/6] [LR: 0.00000837] [loss: 0.1062] ETA: 0:00:04.446892
[Epoch 9/100] [Batch 4/6] [LR: 0.00000834] [loss: 0.7962] ETA: 0:00:02.997850
[Epoch 9/100] [Batch 5/6] [LR: 0.00000831] [loss: 0.3089] ETA: 0:00:01.493284
[Epoch 9/100] [Batch 6/6] [LR: 0.00000828] [loss: 0.5279] ETA: 0:00:00
[Epoch 10/100] [Batch 1/6] [LR: 0.00000825] [loss: 0.1299] ETA: 0:00:35.395999
[Epoch 10/100] [Batch 2/6] [LR: 0.00000822] [loss: 0.5649] ETA: 0:00:05.970950
[Epoch 10/100] [Batch 3/6] [LR: 0.00000819] [loss: 0.3622] ETA: 0:00:04.472051
[Epoch 10/100] [Batch 4/6] [LR: 0.00000816] [loss: 0.5737] ETA: 0:00:02.999292
[Epoch 10/100] [Batch 5/6] [LR: 0.00000813] [loss: 0.1142] ETA: 0:00:01.490247
[Epoch 10/100] [Batch 6/6] [LR: 0.00000810] [loss: 0.0369] ETA: 0:00:00
[Epoch 11/100] [Batch 1/6] [LR: 0.00000807] [loss: 0.7214] ETA: 0:00:29.923871
[Epoch 11/100] [Batch 2/6] [LR: 0.00000804] [loss: 0.0399] ETA: 0:00:06.025214
[Epoch 11/100] [Batch 3/6] [LR: 0.00000801] [loss: 0.0103] ETA: 0:00:04.477165
[Epoch 11/100] [Batch 4/6] [LR: 0.00000798] [loss: 0.3072] ETA: 0:00:02.993378
[Epoch 11/100] [Batch 5/6] [LR: 0.00000795] [loss: 0.5458] ETA: 0:00:01.499980
[Epoch 11/100] [Batch 6/6] [LR: 0.00000792] [loss: 0.2169] ETA: 0:00:00
[Epoch 12/100] [Batch 1/6] [LR: 0.00000789] [loss: 0.0542] ETA: 0:00:35.199655
[Epoch 12/100] [Batch 2/6] [LR: 0.00000786] [loss: 0.1887] ETA: 0:00:06.034812
[Epoch 12/100] [Batch 3/6] [LR: 0.00000783] [loss: 0.0109] ETA: 0:00:04.507118
[Epoch 12/100] [Batch 4/6] [LR: 0.00000780] [loss: 0.2449] ETA: 0:00:03.011293
[Epoch 12/100] [Batch 5/6] [LR: 0.00000777] [loss: 0.7797] ETA: 0:00:01.508624
[Epoch 12/100] [Batch 6/6] [LR: 0.00000774] [loss: 0.1621] ETA: 0:00:00
[Epoch 73/100] [Batch 4/6] [LR: 0.00000075] [loss: 0.0447] ETA: 0:00:03.022383
[Epoch 73/100] [Batch 5/6] [LR: 0.00000074] [loss: 0.4118] ETA: 0:00:01.515217
[Epoch 73/100] [Batch 6/6] [LR: 0.00000073] [loss: 0.4299] ETA: 0:00:00
[Epoch 74/100] [Batch 1/6] [LR: 0.00000072] [loss: 0.0393] ETA: 0:00:31.598071
[Epoch 74/100] [Batch 2/6] [LR: 0.00000071] [loss: 0.0467] ETA: 0:00:06.056095
[Epoch 74/100] [Batch 3/6] [LR: 0.00000070] [loss: 0.7438] ETA: 0:00:04.549044
[Epoch 74/100] [Batch 4/6] [LR: 0.00000069] [loss: 0.7230] ETA: 0:00:03.040020
[Epoch 74/100] [Batch 5/6] [LR: 0.00000068] [loss: 0.0777] ETA: 0:00:01.519922
[Epoch 74/100] [Batch 6/6] [LR: 0.00000068] [loss: 0.0102] ETA: 0:00:00
[Epoch 75/100] [Batch 1/6] [LR: 0.00000067] [loss: 0.5960] ETA: 0:00:32.156808
[Epoch 75/100] [Batch 2/6] [LR: 0.00000066] [loss: 0.6143] ETA: 0:00:06.073716
[Epoch 75/100] [Batch 3/6] [LR: 0.00000065] [loss: 0.1679] ETA: 0:00:04.548900
[Epoch 75/100] [Batch 4/6] [LR: 0.00000064] [loss: 0.3282] ETA: 0:00:03.040982
Environment (please complete the following information):
Beta Was this translation helpful? Give feedback.
All reactions