Abstract:
Ability to automatically recognize violence behaviors is one of the key technology for CCTV cameras. However, it is still a challenging task to obtain effective features for detecting violence in CCTV videos due to the visual quality of the video data. In this work, we propose a novel representation learning approach to improve the detection rate of violent behaviors in videos. Our proposed approach consists of two parts. In the first part, we leverage features extracted from image-based deep convolution neural network to describe spatial information in a video frame. We also introduce a new kind of image features, named multiscale convolutional features, to handle variations in the video data. In the second part, a bidirectional long-short term memory (LSTM) is applied to learn a video-level classifier from both violent/non-violent video sequences. Experimental results on three challenging datasets demonstrate that accuracy improvements are achieved by the proposed detection method compared with the previous methods.