The proliferation of streaming services necessitates robust automatic movie genre classification. While 3D Convolutional Neural Networks (3D CNNs) and Video Transformers achieve high accuracy, they are computationally prohibitive for real-time or edge applications. This paper introduces , a novel architecture that marries a patched frame sampling strategy with a modified MobileNetV3 backbone. By dividing each frame into spatial patches and applying a temporal attention mechanism across patch sequences, MovieSMobileNet captures both local textures and short-term motion cues without 3D convolutions. Experimental results on the MMAct and a subset of MovieNet show that our patched approach improves F1-score by 4.2% over standard frame aggregation, achieving 89.1% accuracy with only 5.2M parameters and 1.8 GFLOPs—suitable for mobile deployment.
Tags: model, patch, moviesmobilenet, update moviesmobilenet patched