Training on ImageNet 64x64 #2

gulperii · 2021-05-21T18:06:57Z

Hello,

I am using ImageNet 64x64 and run the code with the following command :

python BigGAN-PyTorch/train.py --dataset I64_hdf5 --parallel --shuffle --num_workers 8 --batch_size 128 --num_G_accumulations 1 --num_D_accumulations 1 --num_D_steps 1--G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 32 --D_attn 32 --G_nl relu --D_nl relu --SN_eps 1e-8 --BN_eps 1e-5 --adam_eps 1e-8 --G_ortho 0.0 --G_init xavier --D_init xavier --G_eval_mode --G_ch 32 --D_ch 32 --ema --use_ema --ema_start 2000 --test_every 5000 --save_every 1000 --num_best_copies 5 --num_save_copies 2 --seed 0 --which_best FID --num_iters 200000 --num_epochs 1000 --embedding inceptionv3 --density_measure gaussian --retention_ratio 50

and getting this error:

File "train.py", line 229, in
main()
File "train.py", line 226, in main
run(config)
File "train.py", line 184, in run
metrics = train(x, y)
File "/BigGAN-PyTorch/train_fns.py", line 42, in train
split_D=config['split_D'])
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 140, in forward
return self.module(*inputs, **kwargs)
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/BigGAN-PyTorch/BigGAN.py", line 443, in forward
D_out = self.D(D_input, D_class)
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/BigGAN-PyTorch/BigGAN.py", line 403, in forward
out = out + torch.sum(self.embed(y) * h, 1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered

The interesting thing is when I create a "mini dataset" by randomly selecting 500 images per label from the original ImageNet dataset, code runs fine. What could be the problem? How can I solve this issue?

TDeVries · 2021-05-22T11:10:50Z

This is quite strange, I haven't seen this behaviour before. Is it possible that self.embed(y) is receiving values greater than the number of classes in the dataset? That seems to be a particularly common failure case that produces this error.

Otherwise you could try running with the flag CUDA_LAUNCH_BLOCKING=1 (if you haven't already) for a more informative stack trace.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on ImageNet 64x64 #2

Training on ImageNet 64x64 #2

gulperii commented May 21, 2021 •

edited

Loading

TDeVries commented May 22, 2021

Training on ImageNet 64x64 #2

Training on ImageNet 64x64 #2

Comments

gulperii commented May 21, 2021 • edited Loading

TDeVries commented May 22, 2021

gulperii commented May 21, 2021 •

edited

Loading