- Time: 13:20 - 13:50
- Attendees: Bob Zhang (Supervisor), Mai Jiajun, Huang Yanzhen
- We discovered the exact config for MMPose to run on cuda.
- Introduce new repo
posture_mmpose
, which is themmpose
version of the oldposture
repo. - The required environments and config steps is listed in README.md of repository
posture_mmpose
.
- Lower process time per frame compared to Mediapipe.
- MMPose uses configuration files to determine which key landmarks to recognize for each person.
- Originally, we used the 133-point config, recognizing all the possible landmarks, including the facial details of each person.
- However, the real-time performance is bad, coming with a high processing time per frame under stress testing. Moreover, facial detection performs not as good as the body one.
- Therefore, we omit the facial detection part, leaving only the 17-point body detection.
- MMPose comes with built-in object detection, so we don't need to use a dedicated YOLOv5s model to perform image cropping.
- The original
posture
will be archived as a backup.
- Geometric average of scores of three points of the angle
$angle_score = (score_1\times score_2\times score_3)^{\dfrac{1}{3}}$ - Backup calculation plan: Harmonic Mean (Harder for computers to compute, lower accuracy).
- Input: Each person flattened, a [num_people x (2 x num_angles)] matrix stored in
.npy
file format.- Each Person:
-
$9$ key angles, along with the angle score - Flattened into a
$18$ -d vector.
-
- Input matrix is of shape
$(num_test\times 18)$ - Normalize Value:
- Normalized angle value into
$[0,1]$ by dividing$180$ . - Normalized entire input matrix using Z-value approach, i.e., a normal distribution with
$\mu=0$ and$\sigma=1$ .
- Normalized angle value into
- Each Person:
- Output: 2 classes
- Structure:
- 4 fully-connected layers
- use ReLU as activation function
- Training Parameters:
- input size =
$18$ , which is the dimension of data samples. - hidden size =
$100$ - learning rate =
$0.01$ - number of epochs =
$100$
- input size =
- We introduced how to use several target angles and their corresponding confidence levels calculated using site information as input for deep learning model training.
-
Loss don't come below
$0.5$ since around the$160^{th}$ epoch. -
Training accuracy is around
$0.61$ and test accuracy is$0.16$ . Data sample too few.- Add more samples by capturing videos of ourselves.
- Convert video to images, and crop person from video to get dataset.
- Video should include all camera visual perspectives.
-
Potential need for multi-pattern classification
- "Calling"
- "Using"
- "Not using"
- etc.
- Added Pause function to frontend.
- Significant Extension in training data by using videos. Goal is up to 2000, the more the better.
- Explore more internal functionalities of the MLP (multi-layer perceptrons).