Latent Knowledge of LFMs

This research direction investigates the internal representations of LFMs to uncover "dark" or latent knowledge — knowledge that the model learned but not explicitly encoded or easily perceivable by humans. Understanding these internal mechanisms is crucial for interpretability, controllability and cross-lingual/modal transfer of LFMs.

Modality-Specific and Shared Representations: We employ probing techniques to identify “neurons” or “experts” that are linked to alignment, safety and reasoning of specific modalities (include different languages), and those that are shared. In general, it is found that high-resource modalities (like English, Chinese, images etc.) contain a large set of shared neurons, while low-resource modalities tend to rely more on modality-specific representations. This suggests the emergence of "language/media-agnostic" concept space within the model, similar to the hypothesis by ancient Eastern and Western philosophers that there exists an “universal concept space” from which all multilingual and multimodal concepts are derived. Such studies can help to better understand the commonalities in different cultures, and uncover new knowledge from the vast amount of multimodal data, beyond those that can be encoded in text.

Cross-Lingual/Modal Transfer and Understanding: We study how the capabilities (include alignment, understanding, reasoning and safety) of strong LFMs trained from high-resourced language/modality can be transferred to those with lower capabilities. For example, safety neurons identified in a high-resource language such as English can be aligned with low-resource languages (like Bengali or Swahili) to improve their safety performance. This "transfer" of latent knowledge is a cost-effective way to improve the safety and reasoning of models in under-represented domains. Understanding of latent knowledge also enables us to better control and predict the performance of LFMs.