With the spread of smart phones and camera devices, as well as various surveillance and monitoring devices all over the world, video data is becoming increasingly easy to capture and store, and growing at an exponential rate. Moreover, as video acquisition is no longer professional’s work, anyone is able to record a video and upload it to internet. However, users always follow the mentality ’capture ﬁrst, ﬁlter later’, while think little of time spending, cutting, content and view selection. Consequently, these user-generated videos consist of long, poorly-ﬁlmed (including illumination, shakiness, dynamic background and soon) and unedited contents, such as surveillance feeds, home videos or video dumps from a wearable camera. However, there is still information in these videos, yet most of them are likely not able or enjoyable to be reviewed in detail. Hence, the demand for eﬃcient ways to search and retrieve desired content increases fast, and it will cost huge amounts of resources like time, human resources and machine conﬁgurations to process these videos. Currently, users preview a video through various metadata, such as thumbnail, tile, description, video length or quick skim of the entire video. However, it is usually impossible for users to have a concrete sense of the video content or ﬁnd signiﬁcant contents quickly. Consequently, the best video usually have been carefully and manually edited to feature the highlights and rim out the boring segments. Therefore, video summarization plays an important role in this context. The summary of a video is a brief representation of it, but still able to convey signiﬁcant content. A good summary should be concise and with high coverage, and retain the most informative and signiﬁcant contents. However, generating a good video summary is a challenge task because its character is actually contradictory, since it must be compact but include all signiﬁcant contents as much as possible. Moreover, the most basic but challengable part of video summarization is to ﬁnd which part is the most signiﬁcant and important. Most of user-generated videos only contain several segments of frames where signiﬁcant contents occur. Hence, traditional approaches with high frequency do not produce semantically meaningful results without prior knowledge, so they focus on some speciﬁc areas, such as sports or news. Thus, it is diﬃcult to identify the semantic of a generic video for current development of machine intelligence. In consequence, it is necessary to develop new techniques to summary user-generated videos. The objective of this research is to develop a video summarization method for user-generated video. Video summarization can be applied in many practical applications, such as analyzing surveillance data, video browsing, action recognition or creating a visual diary. In some speciﬁc domain, it can also be used to generate movie, sports or news highlights. In addition, these summarization techniques is possible to naturally translate to robotics applications in the future.
Zhuo Lei, Ke Sun, Qian Zhang, and Guoping Qiu. 2016. User Video Summarization Based on Joint Visual and Semantic Affinity Graph. In Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion (iV&L-MM '16).
This work was carried out at the International Doctoral Innovation Centre (IDIC). The authors acknowledge the financial support from Ningbo Education Bureau, Ningbo Science and Technology Bureau, China's MOST, and the University of Nottingham. The work is also partially supported by EPSRC grant no EP/G037574/1.