June 2020
tl;dr: Mono3D based on CenterNet and monoDIS.
The paper is a solid engineering paper as an extension to CenterNet, similar to MonoPair. It does not have a lot of new tricks. It is similar to the popular solutions to the Kaggle mono3D competition.
- SMOKE eliminates 2D object detection altogether. Instead of predicting the 2d bbox center and the 3d/2d center offset, SMOKE predicts 3D center directly.
- Rather than regressing the 7 DoF variables with separate loss functions, SMOKE transform the variables into 8 corner representation of 3D boxes and regress them with a unified loss functions. This is a nice way to implicitly weigh the loss functions. (cf To learn or not to learn which regresses an essential matrix.)
- Disentangles loss from monoDIS groups the 8 parameters into 3 groups. In each group, use the prediction in that group and the gt from other groups to lift to 3D and calculate overall loss. The final loss is an unweighted averaged of the loss from different group.
- Classification
- Projection of 3D center is predicted as a virtual keypoint via heatmap, similar to that in CenterNet.
- Regression
- Regresses 8 parameters for 7 DoF (cos and sin for angle). Normalize the regression target to ease training. The prediction bit is after sigmoid.
- Data augmentation only used to regress keypoint.
- When a car's 3D center is outside the image, discard it. This is about 5% of all objects.
- Runs real-time at 30 ms per frame with Titan XP.
- The distance estimation is quite good. About 3 meter at 60 meters. Less than 5% error. This is much better than frontal obj distance estimation by NYU and xmotors.ai.
- 3D --> 2D bbox also achieves very good results than many 2D --> 3D method. This shows 3D object detection can have more robust detection results.
- Ablation Details
- GroupNom > BN
- Dis L1 > L1 > smooth L1.
- Vector (sin, cos) > Quaternion representation