Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on ImageNet 64x64 #2

Open
gulperii opened this issue May 21, 2021 · 1 comment
Open

Training on ImageNet 64x64 #2

gulperii opened this issue May 21, 2021 · 1 comment

Comments

@gulperii
Copy link

gulperii commented May 21, 2021

Hello,

I am using ImageNet 64x64 and run the code with the following command :

python BigGAN-PyTorch/train.py --dataset I64_hdf5 --parallel --shuffle --num_workers 8 --batch_size 128 --num_G_accumulations 1 --num_D_accumulations 1 --num_D_steps 1--G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 32 --D_attn 32 --G_nl relu --D_nl relu --SN_eps 1e-8 --BN_eps 1e-5 --adam_eps 1e-8 --G_ortho 0.0 --G_init xavier --D_init xavier --G_eval_mode --G_ch 32 --D_ch 32 --ema --use_ema --ema_start 2000 --test_every 5000 --save_every 1000 --num_best_copies 5 --num_save_copies 2 --seed 0 --which_best FID --num_iters 200000 --num_epochs 1000 --embedding inceptionv3 --density_measure gaussian --retention_ratio 50

and getting this error:

File "train.py", line 229, in
main()
File "train.py", line 226, in main
run(config)
File "train.py", line 184, in run
metrics = train(x, y)
File "/BigGAN-PyTorch/train_fns.py", line 42, in train
split_D=config['split_D'])
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 140, in forward
return self.module(*inputs, **kwargs)
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/BigGAN-PyTorch/BigGAN.py", line 443, in forward
D_out = self.D(D_input, D_class)
File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/BigGAN-PyTorch/BigGAN.py", line 403, in forward
out = out + torch.sum(self.embed(y) * h, 1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered

The interesting thing is when I create a "mini dataset" by randomly selecting 500 images per label from the original ImageNet dataset, code runs fine. What could be the problem? How can I solve this issue?

@TDeVries
Copy link
Collaborator

This is quite strange, I haven't seen this behaviour before. Is it possible that self.embed(y) is receiving values greater than the number of classes in the dataset? That seems to be a particularly common failure case that produces this error.

Otherwise you could try running with the flag CUDA_LAUNCH_BLOCKING=1 (if you haven't already) for a more informative stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants