Hand Segmentation Using Deep Learning For Arabic Sign Language
Problem statement
Hand segmentation is a challenging task despite advances in hand detection and hand tracking.
Problems for hand segmentation can be caused by a variety of factors, such as complicated background removal, skin color detection, etc.
Proposed solution
Deep Learning is a kind of machine learning and training to build a model of a large amount of data.
This type of algorithm is built to learn the features without having to define those features in advance. Besides, it is one of the best algorithms that enable the machine to learn different levels of data features (for example images).
Deep learning has excelled in creating new characteristics that can be learned at different levels, and this may lead researchers in the future to focus on this vital aspect. The features are the first factor
in the success of any
intelligent machine learning algorithm. The ability to extract and / or select features correctly, and represent and configure data for
learning is the point between success and failure of the algorithm. Deep learning can be used in making predictions with pre-trained
models to recognize/classify an object in an image, which helps the image segmentation process, as the segmentation is a specific image
processing technique used to split an image into two or more significant parts.
In our work, we implemented a system to accurately segment ArSL images into two class labels: hand and background. By training a deep
learning model using 60% of a huge dataset of ArSL gestures while split the rest of the dataset equally in 20% for validation and 20%
for testing. And then use the mask images to assess the model performance.
The system uses ResNet-18 architecture, which is a pre-trained convolutional neural network.
The system takes as input two file one contains the unlabeled images, and the other contains the ground truth. The model will learn the relation between the two and consequently will be able to segment new hand images. This is be achieved by the following steps:
Training
1- Loading the input and labels files.
2- Create two classes for the classification step, namely, Background and Hand.
3- Split the data into 60% training, 20% validation, and 20% testing.
4- Load the ResNet-18 model.
5- Assign weight for every class based on the class frequency.
6- Set the training options as follows:
Learning rate: 0.3
Learning period drop: 10
Meaning, the learning rate reduced by a factor of 0.3 every 10 epochs. This allows the network to learn quickly with a higher initial learning rate while being able to find a solution close to the local optimum once the learning rate drops.
Momentum: 0.9
L2Reqularization: 0.005
Max Epochs: 30
Mini Batch Size: 8
Shuffle: every epoch
Validation Patience: 4
This will stop training early when the validation accuracy converges. Thus, it prevents the network from overfitting on the training dataset.
7- Train the model with the training and validation datasets.
8- Save the trained network.
Testing
1- Load the pretrained model.
2- Choose one image from the test set and pass it to the pretrained model.
3- Compare the predicted output with the expected output. The white spaces indicates correct segmentation. Whereas the green highlights is the mismatch spaces.
Evaluation
The model can be evaluated using the predefined evaluateSemanticSegmentation
function that will show the model accuracy, IoU and MeanBFScore for each class.
In our training the accuracy is rising as the training goes further while the loss is dropping down.
Overall our validation accuracy was # 96.08%.
The image placed on the left is the input image and the image placed on the right is the output image. The white area that appears on the output images is the area that is matched successfully from the input image and the mask.