Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference: own data #42

Open
sbharadwajj opened this issue Mar 4, 2022 · 9 comments
Open

Inference: own data #42

sbharadwajj opened this issue Mar 4, 2022 · 9 comments

Comments

@sbharadwajj
Copy link

Hi,

I just had a few questions regarding using our own data and running inference using PENet pretrained weights.

  1. How sparse can the depth map be?
    Currently, my inference image is from the Kitti360 dataset which is quite similar to the previous kitti that the network was trained on. But there is no GT depth to sample the depth from. So my sparse depth map is quite sparse.
    When I run inference on this image, the prediction is also sparse i.e I have prediction only in the regions covered by the sparse depth map. Is this an expected behaviour?

  2. What should my input be for 'positions' (i.e the cropped image), I don't want to crop the images for running inference, so should I just set input['positions'] = input['rgb']?

It would be great if you can answer these questions when time permits :)

Regards,
Shrisha

@JUGGHM
Copy link
Owner

JUGGHM commented Mar 4, 2022

Thanks for your interest!

  1. The one-frame sparse depth maps as inputs are of about 5% density of all pixels while the groundtruth depth maps are of 15%. The groundtruth data are generated by accumulating 11 sequential frames. For details of the KITTI Depth dataset, you could refer to [Sparsity Invariant CNNs] by Dr. Uhrig. If you do not want to manually generate groundtruth depth maps for KITTI 360, you could refer to [Self-supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera] by Dr. Fangchang Ma for self/un-supervised learning in depth completion.

  2. You do not need to change the 'positions' since they represent positional encodings as prior for computer vision tasks.

Feel free to let me know if you have further questions.

@sbharadwajj
Copy link
Author

sbharadwajj commented Mar 4, 2022

Thank you for your quick reply.

  1. So the ground truth depth maps with 15% sparsity are used as supervision? Thanks for the references, I will check them immediately.

Current setting: sparse depth map has 5% density of all pixels, but I use the same for supervising the network. So if I understood it correctly, the supervision of the groundtruth depth maps need to be at least 15% correct?

Details of inference: model - PENet_C2 ; penet_accelerated = True ; dilation rate is 2; convolutional-layer-encoding = 'xyz' ; H = 192, W = 704 (I also changed the respective lines here

Here is an example result --> do you think this is the expected result?

  1. Ah, but when I run PENet_C2 to evaluate, what should the input['positions'] be?

@JUGGHM
Copy link
Owner

JUGGHM commented Mar 4, 2022

Thank you for your quick reply.

  1. So the ground truth depth maps with 15% sparsity are used as supervision? Thanks for the references, I will check them immediately.

Current setting: sparse depth map has 5% density of all pixels, but I use the same for supervising the network. So if I understood it correctly, the supervision of the groundtruth depth maps need to be at least 15% correct?

Details of inference: model - PENet_C2 ; penet_accelerated = True ; dilation rate is 2; convolutional-layer-encoding = 'xyz' ; H = 192, W = 704 (I also changed the respective lines here

Here is an example result --> do you think this is the expected result?

  1. Ah, but when I run PENet_C2 to evaluate, what should the input['positions'] be?
  1. One important feature of GT depth maps is that in some local regions the annotation could be much denser than the average density of 15%. This indicates that the denser the GT depth maps are, the better results the trained model will predict. So there is actually no accurate limitation of GT density.

Using the same maps for supervision is not sufficient. To alleviate this problem, you need to generate denser GT
depth maps or deploy un/self-supervised methods.

  1. The positional encodings do not need to be changed. If you don't like them, you could set --co to std when executing training commands. But in our pretrained models, the default settings are used.

@sbharadwajj
Copy link
Author

sbharadwajj commented Mar 4, 2022

Okay I understand regarding the sparsity for training.

But I am still unsure about what the positional encodings are because I am preparing my own data, so is there a flag for evaluation? I understand --co to std is for training right?

I can see in model.py that u & v coordinates of input['position'] are used, but I am still not sure what the positional encodings are or how they can be created for my own data.

@JUGGHM
Copy link
Owner

JUGGHM commented Mar 4, 2022

Okay I understand regarding the sparsity for training.

But I am still unsure about what the positional encodings are because I am preparing my own data, so is there a flag for evaluation? I understand --co to std is for training right?

I can see in model.py that u & v coordinates of input['position'] are used, but I am still not sure what the positional encodings are or how they can be created for my own data.

You could refer to [An intriguing failing of convolutional neural networks and the CoordConv solution] by Liu for more details about positional encoding. In our default settings, we use the geometric encoding (ie. 3d coordianates) described in our paper. And the evaluation and training process should share consistent settings.

@sbharadwajj
Copy link
Author

sbharadwajj commented Mar 4, 2022

I get it now, I was able to create the positional encoding (u, v coordinates using the camera intrinsic). I still get a patchy result like this when I evaluate, are you able to analyze what else may possibly go wrong or maybe sensitive?

(I am just using the pretrained weights to evaluate on this data)

@JUGGHM
Copy link
Owner

JUGGHM commented Mar 4, 2022

I get it now, I was able to create the positional encoding (u, v coordinates using the camera intrinsic). I still get a patchy result like this when I evaluate, are you able to analyze what else may possibly go wrong or maybe sensitive?

(I am just using the pretrained weights to evaluate on this data)

Intuitively I guess the GT maps are not dense enough for supervised depth completion. GT maps are reuqired to be much denser than the sparse inputs.

@sbharadwajj
Copy link
Author

But we dont need GT maps in this case where we just evaluate.
The result that I've shared here is not based on fine-tuning but merely test_completion

@JUGGHM
Copy link
Owner

JUGGHM commented Mar 5, 2022

But we dont need GT maps in this case where we just evaluate. The result that I've shared here is not based on fine-tuning but merely test_completion

I think two points could be taken into consideration:

  1. The sparse depth maps in KITTI360 seem denser than the ones we use in KITTI depth. This means that there exists domain gaps between those two datasets, leading to the failure of predicted results. We suggest that you could
    (i) Construct denser GT maps in KITTI360 for further training or finetuning. This step is necessary for transfer learning. Or
    (ii) Consider depth completion methods with "Sparsity Invariance", which aims at countering the instability brought by unknown and varying density. You could refer to [Sparsity Invariant CNNs] by Uhrig or [A Normalized Convolutional Neural Network for Guided Sparse Depth Upsampling] from our group. Recently, this topic has been discussed in [Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion] as well.

  2. A secondary reason is that: Not all pixels in the predicted depth map are reliable. You could refer to a previous issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants