Skip to content

Loss is zero while training ViTPose Base with custom dataset #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MaxRondelli opened this issue Jun 5, 2024 · 3 comments
Open

Loss is zero while training ViTPose Base with custom dataset #138

MaxRondelli opened this issue Jun 5, 2024 · 3 comments

Comments

@MaxRondelli
Copy link

MaxRondelli commented Jun 5, 2024

I am trying to fine tuning with a custom dataset ViTPose Base trained on COCO 256x192. At the beginning of the training my losses are already zero.

2024-06-05 17:09:38,939 - mmpose - INFO - Epoch [1][1/18] lr: 2.376e-10, eta: 14 days, 5:07:40, time: 682.635, data_time: 2.816, heatmap_loss: 0.0000, acc_pose: 0.0000, loss: 0.0000, grad_norm: 0.0000

Debugging I've seen that the target tensor is composed of all zeros. target. Any() = False and the losses object is {'heatmap_loss': tensor(0., grad_fn=<MulBackward0>), 'acc_pose': 0.0}.

The images are all in images folder. My train.json and val.json follow this format (as seen in the documentation):

[ {     "image_file": "100-0.png",
        "image_size": [ ... ],
        "bbox": [ ... ],
        "keypoints": [ ... ] ,
 ... } ]

Does anyone know why is that? Does anyone can suggest me a documentation/tutorial to fine-tune a network with a custom dataset? Since I've seen some overlap and misunderstanding information between ViTPose and MMPose docs.

Thank you in advance.

@Logancreator
Copy link

I am trying to fine tuning with a custom dataset ViTPose Base trained on COCO 256x192. At the beginning of the training my losses are already zero.

2024-06-05 17:09:38,939 - mmpose - INFO - Epoch [1][1/18] lr: 2.376e-10, eta: 14 days, 5:07:40, time: 682.635, data_time: 2.816, heatmap_loss: 0.0000, acc_pose: 0.0000, loss: 0.0000, grad_norm: 0.0000

Debugging I've seen that the target tensor is composed of all zeros. target. Any() = False and the losses object is {'heatmap_loss': tensor(0., grad_fn=<MulBackward0>), 'acc_pose': 0.0}.

The images are all in images folder. My train.json and val.json follow this format (as seen in the documentation):

[ {     "image_file": "100-0.png",
        "image_size": [ ... ],
        "bbox": [ ... ],
        "keypoints": [ ... ] ,
 ... } ]

Does anyone know why is that? Does anyone can suggest me a documentation/tutorial to fine-tune a network with a custom dataset? Since I've seen some overlap and misunderstanding information between ViTPose and MMPose docs.

Thank you in advance.

Hi MaxRondell,

I noticed that you closed this issue, which makes me think you might have already resolved it. Could you share some insights or suggestions on how you tackled it?

Regards,

@MaxRondelli
Copy link
Author

Hi @Logancreator,

Actually I closed the issue after a long period of time since I found another solution without using ViTPose.

I only closed it since I wasn't getting any feedback from the community. I could re-open it though, it might be helpful.

Best,

@MaxRondelli MaxRondelli reopened this Mar 19, 2025
@KevinChan1799
Copy link

@MaxRondelli @Logancreator
你好,我之前训练的时候也遇到过这样的情况,但是当我回看自定义数据集格式(coco格式)时发现了问题。
我通过labelme标注关键点数据,通过转换的脚本将labelme格式转换为coco格式,但是转换脚本并没有转换 "area”这个字段(以下是我改正后正常训练的coco标注格式,仅供参考)

Hi, I've encountered this before when training, but when I looked back at the custom dataset format (coco format) I found the problem.
I labelled the keypoint data via labelme, and converted the labelme format to coco format via the conversion script, but the conversion script didn't convert the ‘area’ field (here's the coco labelled format after I corrected it for normal training, just for reference)

{ "id": 56, "image_id": 57, "category_id": 1, "iscrowd": 0, "bbox": [ 430.59523809523813, 282.49999999999994, 1455.9523809523812, 989.2857142857144 ], "area": 1440352.891156463, "segmentation": [ [ 430.59523809523813, 282.49999999999994, 1886.5476190476193, 1271.7857142857144 ] ], "keypoints": [ 606.7857142857144, 476.54761904761904, 1, 1856.7857142857144, 713.452380952381, 1, 1716.309523809524, 746.7857142857142, 1, 1656.7857142857144, 558.6904761904761, 1, 1573.4523809523812, 902.7380952380952, 1, 1512.7380952380954, 968.2142857142857, 1, 778.2142857142859, 614.6428571428571, 1, 763.9285714285716, 1001.5476190476189, 1, 738.9285714285716, 1214.6428571428573, 1, 693.6904761904763, 1003.9285714285714, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "num_keypoints": 10 }

在改正area字段后,我的程序能正常计算AP和AR的值,所以你可能需要检查你的标注文件是否正确。
以上是我的解决方法,希望对你有所帮助!

After correcting the AREA field, my program calculates the AP and AR values correctly, so you may need to check that your annotation file is correct.
Above is my solution, hope it helps you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants