Video and 3D Media Processing
A lot of research has been devoted in the last 2 decades on object recognition from large-scale image collections, and some successes have been reported on systems that surpasses human recognition performance. The next frontier is on relation inference and knowledge extraction from video data. This project thus aims to go beyond image and object recognition, to video and relation inference, with the ultimate aim of inferring knowledge in the form of relation triplets from video. A related effort is to extract the (pseudo) 3D models from video.
During Phase I of NExT, we have developed a deep learning model for unlimited-vocabulary object recognition from social image collections. Research in Phase II will focus on detecting the relations between objects from images and videos. It involves the tracking of object motion trajectories in video, recognition of object types and actions, and inference of relations between any pair of objects. The result is a set of fundamental relation triplets underlying the basic semantic of video. This will bring the level of video semantics towards that of language, and facilitate the fusion of language and video in multimodal analysis. Research will also be carry out to infer higher level semantics such as events, and support other innovative tasks such as the multimodal question-answering, knowledge graph generation, and conversational framework. Separately, we will conduct research on transforming video into 3D models as an alternative approach to inferring video semantics. Part of this research is to transform video to 3D for VR/AR applications.
For the initial 2 years, we will work on video relation inference, video to 3D, as well as videoQA on “how-to” type videos in the domain of cooking, as well as multimodal knowledge graph generation in Ecommerce and Food domains. To kick start our research and to spur this line of research in international community, we will take the lead in building a large video relation corpus.