Data Preprocessing
Before training the LSTM model, raw videos must be converted into structured, normalized keypoint sequences. This is handled by the script src/preprocess_videos.py.
Script Purpose
The script automates the extraction of human pose keypoints from video datasets using YOLOv11-Pose and saves them as NumPy (.npz) file packets ready for LSTM-based action classification.
┌─────────────┐ ┌─────────────────────┐ ┌───────────────────────┐ ┌─────────────┐
│ Raw Video │ ──> │ YOLOv11-Pose Track │ ──> │ Normalize (To Torso) │ ──> │ Save .npz │
│ (.mp4/.avi)│ │ (Extract Keypoints)│ │ & Filter Movements │ │ Dataset │
└─────────────┘ └─────────────────────┘ └───────────────────────┘ └─────────────┘
Code Workflow
The preprocessing pipeline follows these steps:
1. File Scanner & Label Assignment
- The script scans directories for video files (
.mp4,.avi,.mov). - It automatically assigns binary labels based on the folder path:
- Folders with "violent" in their path: Assigned base label
1(Violent). - Other folders: Assigned label
0(Non-violent/Safe).
- Folders with "violent" in their path: Assigned base label
2. Multi-Person Tracking
- It processes each video frame-by-frame and feeds them into the YOLOv11 pose estimation model (
yolo11n-pose.pt). - It initializes tracking using the
botsort.yamlconfig. This allows the script to follow the keypoints of a specific person across multiple frames, separating their movement from others in the background.
3. Torso-Relative Normalization
To make the coordinate sequence scale-invariant (e.g. when a person walks closer to or further from the camera), keypoints are normalized relative to the torso size:
- Calculates the center point between the left shoulder (Index 5) and the right shoulder (Index 6).
- Calculates the Euclidean distance between the two shoulders, representing the shoulder width.
- Centers all 17 keypoints by subtracting the shoulder center, and scales them by dividing by the shoulder width:
Normalized Point = (Point - Torso Center) / Shoulder Width - Re-appends the detection confidence score to each keypoint to create a triplet array of shape
(frames, 17, 3).
4. Logic Filtering & Movement Score
Not all people in a "violent" video are acting violently. Spectators, for instance, stand still while violence occurs. To filter out passive tracks:
For Violent Videos:
- Discards tracks that contain less than 30 frames of data.
- Calculates a Movement Score: The average frame-to-frame change (Euclidean distance magnitude) of keypoint vectors.
- Movement Score Less than 1.5: Reclassified as Spectator (Label 0).
- Movement Score Greater than or equal to 1.5: Kept as Violent (Label 1).
For Non-Violent Videos:
- Discards short tracks with 15 frames or less.
- Saves all remaining tracks with Label 0.
5. Saving Keypoints
Processed sequences are saved in the ../datasets/lstm_dataset/ directory. Files use the naming convention:
{violent|nonviolent}_video{filename}_{cam}_person{id}.npz
Each .npz file contains two keys:
data: Float32 NumPy array of shape(frames, 17, 3)containing keypoints(x, y, conf).label: Integer scalar (0or1).
Parallel Execution
Extracting pose keypoints from high-resolution video databases is computationally expensive. The script uses Python's multiprocessing.Pool to spawn multiple camera folder workers in parallel.
The number of active worker processes is set dynamically:
num_workers = min(len(video_dirs), os.cpu_count() // 2)
Execution Command
Ensure the pre-trained pose model ../models/yolo11n-pose.pt is present, and run the preprocessing script from the src/ directory:
python preprocess_videos.py
This will populate the datasets/lstm_dataset/ directory with .npz files.