Skip to content

Result Mismatch with the original results in paper in human faces #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mu-cai opened this issue Aug 9, 2020 · 11 comments
Open

Result Mismatch with the original results in paper in human faces #6

mu-cai opened this issue Aug 9, 2020 · 11 comments

Comments

@mu-cai
Copy link

mu-cai commented Aug 9, 2020

Hi Rosinality,

Thanks for your excellent code! It is quite excellent!
However, I found that the result of human images is not good. For example, your results are:
image

(It is obvious that the eyeglasses for the generated images are different from the paper's claim.)
image

Maybe the problem is the training scheme? Or the cropping method?

@rosinality
Copy link
Owner

Yes, in mumy cases the effect of the texture code in quite restrictive. I suspect that maybe the differences of training scheme was crucial, but I don't know what could be. (Maybe it is related to image/patch resolutions.)

@mu-cai
Copy link
Author

mu-cai commented Aug 17, 2020

@rosinality
Thanks for your information!
During the past few days, I have trained your model under CeleAHQ dataset, and the result is quite bad.
image
each row means original A, original B, reconstructed A and structure A + structure B. You can see that the resulting image even can't keep the pose of A.

Also, I trained it on the LSUN church dataset, the result is also not good.

image

You can see that the reconstruction quality is not very good, never to say the swapping results.

I think there may be several possible issues:

(1) padding, someone has already pointed that out.
(2) The crop. As for the church dataset, they don't do the resize first, instead, they do the crop.

image

I also wonder how can they keep the original image size when keeping the ratio and the short side(256) unchanged.

(3) The co-occur unit. You can see that in paper they stated that
image

And for each prediction, they did the following operation:

image

So this operation should be down for 8 times.

Thanks again for your nice work!

Best,
Mu

@rosinality
Copy link
Owner

rosinality commented Aug 17, 2020

  1. Could you let me know which part of the padding is incorrect?
  2. LSUN church dataset is already resized to be shorter side size is 256, so resizing will not affect the result.
  3. Do you mean that cooccur discriminator should be done on 8 patches? Hmm, it seems like that this is different from the paper. I will try to fix this.

Thank you for your testing & check!

@mu-cai
Copy link
Author

mu-cai commented Aug 18, 2020

@rosinality

Thanks for your reply!

  1. You have already fixed this problem yesterday(already committed).

  2. Yes, the shorter side is 256, however, the size of the longer side is not fixed. However, during training, you resized the image into a square in your code, making the ratio of two sides changed, which mismatched with the paper.
    image

  3. Yes! Your single operation should be done for 8 times. Because when you sample a patch from one real image and 8 patches from the fake image, you will get just one prediction. You need 8$N$ predictions, no N .

Thanks again for your answer!

Best,
Mu

@rosinality
Copy link
Owner

  1. Actually padding will not affect results, as that bug will only affect 1x1 convs in current implementation
  2. As prepare_data.py will do resizing with torchvision, it will respect aspect ratios by default.
  3. Seems like that it is important issue. Fixed it at 38cb3ae.

@mu-cai
Copy link
Author

mu-cai commented Aug 18, 2020

@rosinality

Thanks for your quick programming!
I have run your code just now, and one more question:
In your code, for each structure/texture pair, you have 8 crops for the real/fake images, but only 4 crops for the ref image. However, I think that for each crop of the real/fake image, we need 4 patches. That is to say, we need 4*8=32 patches in total.

This is my understanding, however, the author didn't state this in his paper... what is your opinion?

Mu

@rosinality
Copy link
Owner

Hmm maybe you are right. But as model uses mean of ref image vectors, maybe it is not very different from using distinct ref patches for each samples. (Hopefully.)

@rosinality
Copy link
Owner

rosinality commented Aug 18, 2020

I have changed to use distinct reference samples for each samples. It is less resource consuming than I thought, and I suspect that it will be more robust way to do the training.

@mu-cai
Copy link
Author

mu-cai commented Aug 18, 2020

@rosinality

Thanks for your working! Yes, in my opinion, if the training iterations are large enough, then the fixed reference samples would produce the same result as the distinct reference samples. Yes, I also think that the model would be more robust if adopting the distinct reference patches. The GPU memory won't increase too much if doing so... also superised.

Mu

@zhangqianhui
Copy link

My tf implementation: https://github.com/zhangqianhui/Swapping-Autoencoder-tf. Hope to help you.

@virgile-blg
Copy link

virgile-blg commented Oct 29, 2020

Hi @mu-cai,

Did the above corrections lead to better structure/style swapping results on your side ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants