Skip to main content

Evaluation & Performance

This page outlines the evaluation metrics and experimental results obtained during the system validation. All numbers are aligned with the research findings documented in the academic thesis.


1. YOLOv11 Knife Detection (Object Detection)

The custom-trained knife detection model was evaluated on a test split of 704 images containing 779 knife instances.

Performance Metrics

ClassPrecision (P)Recall (R)F1-ScoremAP@0.5mAP@0.5:0.95
Knife (All)0.8900.8250.8560.9100.612
  • Precision (0.890): Demonstrates high reliability, minimizing false weapon alerts (crucial for security monitoring to prevent unnecessary panic).
  • mAP@0.5 (0.910): Indicates exceptional spatial overlap accuracy under loose boundary criteria.
  • mAP@0.5:0.95 (0.612): Shows expected regression limitations under extremely strict boundary alignments, typical for small objects in motion.

Confusion Matrix (Object Detection)

  • True Positives (Knife correctly detected): 685
  • False Positives (Background marked as Knife): 151
  • False Negatives (Knife missed by model): 94
  • Note: True Negatives are not tracked, which is standard for single-class object detectors.

2. YOLOv11-Pose Estimation Benchmarks

To establish baseline skeletal accuracy, the pose estimator was benchmarked against the standard COCO val2017 dataset.

Performance Metrics

TaskPrecision (P)Recall (R)F1-ScoremAP@0.5mAP@0.5:0.95
Skeletal Pose0.8490.7600.8020.8100.511

The model provides stable, occlusion-resistant tracking of key joints (shoulders, elbows, wrists) even during physical contact or overlapping profiles.


3. Temporal Bi-LSTM Classifier

The temporal sequence classifier was evaluated using a strictly isolated Test Set composed of 183 skeletal sequences derived from 55 source videos, ensuring zero data leakage.

Classification Metrics

ClassPrecisionRecallF1-Score
Non-Violent (Safe)0.650.790.71
Violent (Danger)0.870.760.81
Overall Accuracy0.77 (77%)
  • High Violent Precision (0.87): Ensures that when the system logs a violent interaction or sounds a warning, there is an 87% chance an actual threat is occurring.
  • Balanced Violent Recall (0.76): Correctly flags three-quarters of all violent sequence events.

Confusion Matrix (Bi-LSTM per Video)

Predicted \ ActualTrue ViolentTrue Non-Violent
Predicted Violent26 (True Positive)4 (False Positive)
Predicted Non-Violent8 (False Negative)15 (True Negative)
  • Violent Detection Rate: Correctly classified 76.4% of the violent video streams (26 out of 34).
  • Safe Detection Rate: Correctly classified 78.9% of the non-violent control streams (15 out of 19).

ROC-AUC Score

  • ROC-AUC: 0.6749 The ROC curve demonstrates performance significantly above random chance (0.5), showing that the Bi-LSTM model successfully isolates the kinematics of stabbing gestures.

4. Signal Stability (Flicker Rate)

The Flicker Rate measures the temporal stability of model predictions across consecutive frames to prevent erratic visual alert toggling.

  • Flicker Rate: 0.0202 (2.02%)

A rate of 2.02% is excellent. It indicates that once a person is classified under a threat status, the alert label remains solid and constant, preventing visual "flicker" on the operator screen. This stability is achieved by:

  1. Posing a temporal threshold constraint of 30 consecutive frames before starting inferences.
  2. Implementing a fallback rendering mechanism that repeats the last processed frame during frame skips instead of switching back to unannotated feeds.