Skip to content

Latest commit

 

History

History
61 lines (46 loc) · 4.45 KB

20241128.md

File metadata and controls

61 lines (46 loc) · 4.45 KB

Information

  • Time: 16:10 - 16:45
  • Attendees: Bob Zhang (Supervisor), Huang Yanzhen, Mai Jiajun

Discussion Summary

1. Undetectable Pedestrians

We introduced a new process of filtering out pedestrians that has low outcome of posture detection. These pedestrians will be ignored by the classifier to leverage performance on the detectable individuals. Currently, we have 2 undetectable pedestrians:

  • Back (Discussed in previous meetings ): The pedestrian is showing its back to the camera.
    • Criteria: The backside ratio is smaller than -0.2 and confidence of the landmarks of the two shoulder is higher than 0.3.
    • Backside ratio = $\frac{l_shoulder_x-r_shoulder_x}{width_bbox}$
  • Out of frame (New): Most of the landmarks of this pedestrian is out of frame.
    • Criteria: More than 5 points of the scores of the first 13 landmarks is smaller than 0.3.

We use the binary digits of an integer to denote such filtering states for great extendability.

Base-10 Integer Binary Integer
OK_CLASSIFY 0 0000, 0000
OUT_OF_FRAME 1 0000, 0001
BACK_SIDE 2 0000, 0010
Possible future
filtering state 1
4 0000, 0100
Possible future
filtering state 2
8 0000, 1000
By AND-ing the classification state with the filtering state, we could efficiently determine whether to classify one person or not: Perform classification if and only if the AND result is 0.

P.S. OUT_OF_FRAME has higher priority over BACK_SIDE. That means, if both of them is fulfilled, by the current mechanism, only OUT_OF_FRAME will be reported. This is done to save performance from the algorithm giving un-useful determinations.

2. YOLO11

2.1 Model Training (In Progress)

Python package of ultralytics and roboflow is used. Roboflow's online workspace is used to manage training data to train a pre-trained YOLOv11 model to detect specifically for phones. (Training is done locally, we added a step03_yolo_phone_detection directory in the yolo_dev branch of posture_mmpose).

Data are currently from Google and Baidu. Original data has about 200 pictures. Around 600 individual training data is obtained after various image augmentations.

The performance of the model is still under optimization.

2.2 Interface Exposure

We expose the interface for the YOLOv11 model by cropping the sub-pictures of a person's hand and feed them into the model as inputs.

The center of the sub-frame is determined by:

  • The landmark of ones wrists, and
  • the width of the bounding box:
  • $l_center = lm_l_wrist+(lm_l_wrist-lm_l_elbow)\times\times0.5$
  • The center is obtained by offsetting the landmark of the wrist by half of the vector pointing from the pedestrian's elbow to its wrist. The width & height of the sub-frame is determined by the width & height of the bounding box.
  • $hand_w = \frac{bbox_w}{2}$
  • $hand_h=\frac{bbox_h}{2}$

We added frame-rate monitoring, and the model's performance on a single person without YOLO is around 25 FPS, and the one with YOLO is around 10 FPS.

3. Additional Dataset

We added additional dataset for the conditions when pedestrians are wearing thick clothes. This is done because that we discovered the issues that the posture detection model's performance is weak on pedestrians with thick clothes. We inspected that the MMPose inference result is quite different for people with thick clothes. Therefore, we recorded a full set of dataset with a thick clothes worn by the model.

Posture Estimation Model Equipment (GPU) Data Num Epoch Num LR
RTMPose Medium RTX-4060 Laptop 8221 +, 5100 - Early Stopped at 211 5e-6

The model recovered its performance after the additional data is given.