Physical Reasoning: Predicting the Stability of Block Towers with Computer Vision Methods With Only Visual Cues
This project explores the use of deep learning and computer vision techniques to predict the stability of block towers based solely on RGB images. The task requires a model to infer physical reasoning principles like gravity and balance purely from visual data. The study evaluates multiple deep-learning architectures to determine their effectiveness in solving this problem.
ShapeStacks dataset
7,680 RGB images of block towers captured from various angles and lighting conditions.
Predict the stable height of a given tower before it collapses.
Stability levels range from 1 to 6.
The following deep learning models were tested:
ResNet50: Deep CNN model with residual connections for feature extraction.
EfficientNetB0: Optimized for computational efficiency and accuracy.
MobileNetV3 (Small & Large): Designed for lightweight applications while maintaining performance.
CLAHE (Contrast Limited Adaptive Histogram Equalization): Improves contrast while limiting noise.
Data Augmentation: Flipping, translation, rotation, and brightness adjustments to enhance model generalization.
Depth Estimation: Uses Depth Anything V2 to segment the block tower from the background.
Split: 80% training, 20% validation.
Loss Function: Categorical cross-entropy.
Metrics: Accuracy, validation loss.
Hyperparameter Tuning: Batch sizes (16, 32, 64)
Optimizers: Adam, RMSProp
Fine-tuning MobileNetV3Large with additional dense layers and dropout regularization.
Test Accuracy: 55.72% (after fine-tuning and augmentation).
-
Difficulty in generalizing complex structures and occluded blocks.
-
Performance drop due to class imbalance (fewer samples for higher stability levels).
-
Camera angle variations affecting predictions.
-
MobileNetV3Large performed best among tested models.
-
Data augmentation significantly improved accuracy.
-
Additional techniques, such as multi-task learning and contrastive learning, could further enhance model performance.
-
Future work may explore physics-based modelling for improved physical reasoning.
Frameworks: TensorFlow, Python
Hardware:
- Kaggle notebook with the following specifications: RAM: 29GB, Disk size: 57.6 GB, GPU: T4 GPU, and CPU: 4 x Intel Xeon CPU.
- Other local environments are a Macbook Air M2 with RAM: 16GB, Disk size: 512GB and a Macbook Pro M1 with RAM: 16GB and Disk size: 1TB