- Time: 16:10 - 16:45
- Attendees: Bob Zhang (Supervisor), Huang Yanzhen, Mai Jiajun
We introduced a new process of filtering out pedestrians that has low outcome of posture detection. These pedestrians will be ignored by the classifier to leverage performance on the detectable individuals. Currently, we have 2 undetectable pedestrians:
- Back (Discussed in previous meetings ): The pedestrian is showing its back to the camera.
- Criteria: The backside ratio is smaller than -0.2 and confidence of the landmarks of the two shoulder is higher than 0.3.
- Backside ratio =
$\frac{l_shoulder_x-r_shoulder_x}{width_bbox}$
- Out of frame (New): Most of the landmarks of this pedestrian is out of frame.
We use the binary digits of an integer to denote such filtering states for great extendability.
Base-10 Integer | Binary Integer | |
---|---|---|
OK_CLASSIFY |
0 | 0000, 0000 |
OUT_OF_FRAME |
1 | 0000, 0001 |
BACK_SIDE |
2 | 0000, 0010 |
Possible future filtering state 1 |
4 | 0000, 0100 |
Possible future filtering state 2 |
8 | 0000, 1000 |
By AND -ing the classification state with the filtering state, we could efficiently determine whether to classify one person or not: Perform classification if and only if the AND result is 0. |
P.S. OUT_OF_FRAME
has higher priority over BACK_SIDE
. That means, if both of them is fulfilled, by the current mechanism, only OUT_OF_FRAME
will be reported. This is done to save performance from the algorithm giving un-useful determinations.
Python package of ultralytics
and roboflow
is used. Roboflow's online workspace is used to manage training data to train a pre-trained YOLOv11 model to detect specifically for phones. (Training is done locally, we added a step03_yolo_phone_detection
directory in the yolo_dev
branch of posture_mmpose
).
Data are currently from Google and Baidu. Original data has about 200 pictures. Around 600 individual training data is obtained after various image augmentations.
The performance of the model is still under optimization.
We expose the interface for the YOLOv11 model by cropping the sub-pictures of a person's hand and feed them into the model as inputs.
The center of the sub-frame is determined by:
- The landmark of ones wrists, and
- the width of the bounding box:
$l_center = lm_l_wrist+(lm_l_wrist-lm_l_elbow)\times\times0.5$ - The center is obtained by offsetting the landmark of the wrist by half of the vector pointing from the pedestrian's elbow to its wrist. The width & height of the sub-frame is determined by the width & height of the bounding box.
$hand_w = \frac{bbox_w}{2}$ $hand_h=\frac{bbox_h}{2}$
We added frame-rate monitoring, and the model's performance on a single person without YOLO is around 25 FPS, and the one with YOLO is around 10 FPS.
We added additional dataset for the conditions when pedestrians are wearing thick clothes. This is done because that we discovered the issues that the posture detection model's performance is weak on pedestrians with thick clothes. We inspected that the MMPose inference result is quite different for people with thick clothes. Therefore, we recorded a full set of dataset with a thick clothes worn by the model.
Posture Estimation Model | Equipment (GPU) | Data Num | Epoch Num | LR |
---|---|---|---|---|
RTMPose Medium | RTX-4060 Laptop | 8221 +, 5100 - | Early Stopped at 211 | 5e-6 |
The model recovered its performance after the additional data is given.