Unstructured data analytics is a fundamental technology in many AI applications in various domains, ranging from conversation to recommender systems and from fintech to healthcare. It aims to perceive data of different modalities, e.g., texts and video, by recognizing their labels or hidden structures with deep learning models. We are particularly interested in incorporating knowledge guidance from Multimodal Knowledge Graph (MMKG) into deep neural models for analyzing heterogeneous data, including texts, videos, and time-series data, and verifying them in any domain of interest.


Different from conventional text-based Knowledge Graph (KG) (such as YAGO), MMKG highlights the complementary knowledge of different modalities. To date, however, only a few works have been done on MMKG, and the curated KGs are still far from complete. To fill this research gap, we aim to extend research on text-based KG construction to multimodal information sources. Besides, we will organize and explore knowledge based on events, which have a stronger expressiveness than entities for understanding the semantics of multimodal data.


Based on MMKG, we aim to push forward the analysis of unstructured texts from shallow to deep understanding and from coarse-grained to fine-grained levels. The incorporated knowledge can provide the necessary backgrounds for text understanding and reasoning ability for downstream applications. However, the finer granularities will bring in issues of limited annotations for training. Thus, we will focus on incorporating prior knowledge for few-shot learning on texts and developing knowledge inference to verify and correct the unreliable text analytics.


The visual medium is another key component of MMKG because of the abundance of such information on the Internet. For video, visual relation understanding is a challenging task since it requires the system to understand many aspects of two visual entities, including appearance, action, and interactions between them. It permits the convergence of multimodal knowledge in which the visual relation and semantic relation can be seamlessly linked to form coherent MMKG. Here, we aim to mine the relations in an extensive collection of videos and discover visual knowledge to enhance the existing knowledge graphs with visual supports and common-sense knowledge. By taking this prior knowledge as guidance, we can further build a robust information filtering system, which is particularly useful for many video websites and apps, such as information filtering.


Given the increasing amount of multimodal data, it is essential to advance the studies of multimodal search and recommendation to alleviate the information overload problem of users. However, the current recommendation systems estimate user preferences through historical user behaviors; they hardly know what the user exactly likes and the exact reasons they like an item. Hence, often fail to provide accurate and explainable recommendation service. To tackle this problem, we envision a Multimodal Conversational Search and Recommendation (MCSR) system that is capable of resolving the information asymmetry problem by directly asking for user’s intent through multiple turns of conversations. Such a system is able to make decisions based on clearer user intentions, thus greatly improving transparency and robustness. Our team has made solid foundations of MCSR with heterogeneous data and has played a leading role in the current stage of MCSR. Building on this foundation, we will move towards the next step to develop advanced MCSR techniques and their real-world applications.


The research breakthroughs that we aim to achieve in this direction include:


  • The incorporation of textual and visual semantics towards the robust construction and updates of MMKG.

  • Representation learning over KG captures the evolution of structures over time.

  • The fundamental theories and framework for MCSR; in particular, on how to design intervention strategies to seamlessly integrate conversations into multimodal search in order to better help users navigating the complex search process.

  • Advanced multimodal search techniques include the study of emerging powerful techniques of contrastive learning and causal reasoning and the exploration of how such techniques can contribute to the effectiveness of MCSR.

User simulator to offer a scalable, reproducible, and cost-efficient solution to building and evaluating a dialogue agent for MCSR. This is in contrast to the current reliance on human-based trials, which are very costly while not scalable and reproducible.


Rebele, T., Suchanek, F., Hoffart, J., Biega, J., Kuzey, E., & Weikum, G. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In International Semantic Web Conference, 177-185, 2016.

Wang, M., Wang, H., Qi, G., & Zheng, Q. Richpedia: A large-scale, comprehensive multimodal knowledge graph. In Big Data Research, 22, 2020.


Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. ERNIE: enhanced language representation with informative entities. In ACL, 1441-1451, 2019.


Ding, M., Zhou, C., Chen, Q., Yang, H., & Tang, J. Cognitive graph for multi-hop reading comprehension at scale. In ACL, 2694-2703, 2019.


Xiao, J., Shang, X., Yang, X., Tang, S., & Chua, T.S. Visual relation grounding in videos. In ECCV, 447-464, 2020.


Lei, W., Zhang, G., He, X., Miao, Y, Wang, X., Chen, L., & Chua, TS. Interactive ath reasoning on graph for conversational recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2073-2083. 2020.

Lei, W., He, X., Rijke, M., & Chua, TS. Conversational Recommendation: Formulation, Methods, and Evaluation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2425-2428. 2020.

Lei, W., He, X., Miao, Y., Wu, Q., Hong, R., Kan, MY, & Chua, TS. Estimation-action-reflection: Towards deep interaction between conversational and recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 304-312. 2020.

Liao, L., Ma, Y., He, X., Hong, R. & Chua, TS. Knowledge-aware Multimodal Dialogue Systems. In Proceedings of the ACM Multimedia Conference on Multimedia Conference, pp. 801-809. 2018.