AI Safety: From RL to Model Edit and Guardrail

Recent advancements in Large Foundation Models (LFMs) have necessitated a multi-layered approach to safety that extends beyond traditional training. As models become more capable, safety mechanisms must evolve from simple alignment during post-training to real-time interventions and architectural safeguards. This topic explores a comprehensive safety architecture that spans the entire lifecycle of an AI model.

Reinforcement Learning for Alignment (Post-Training): The first line of defense occurs during post-training. Techniques like Reinforcement Learning with Human Feedback (RLHF) and Verifiable Reward (RLVR) are used to enhance the model’s general safety awareness and domain-specific safety guidelines. To ensure scalability, we explore RL with AI Feedback (RLAIF) by employing “Chain-of-Thought” principles to decompose complex specific tasks into simpler ones that require only binary decisions which can be performed reliably by AI systems, thereby highlighting the importance of more granular reward signals for robust safety training. Furthermore, evolving from hard rejection to soft rejection mechanisms also enables models to better handle uncertain situations, fostering more adaptive and context-aware decision-makings.

Model Editing (Post-Deployment): Once a model is deployed, it may still contain outdated or unsafe knowledge. Retraining large models to fix these issues is often computationally prohibitive. Emerging solutions include “unlearning” and “null-space constrained model editing”, which optimizes model parameters within a null space to inject new knowledge or safety measures without impacting the model's existing general capabilities. The latter ensures that safety and utility are preserved simultaneously without the need for full retraining. This facilitates the incorporation of task-specific rules and policies tailored to different application scenarios, enabling models to better adapt to varying contexts while maintaining safety standards.

Safety Guardrails (Inference Time): The final layer of defence involves the implementation of input and output guardrails. These are designed as simple, high-performance classifiers that act as a firewall for the model. Input guardrails filter out “red-line” queries before they reach the model, while output guardrails prevent the display of harmful content generated by the model. Therefore, the key goal is in decomposing the complex safety requirements into simpler sets of verifiable rules guarded by these classifiers, so as to achieve high safety rates while avoiding excessive rejection of valid inputs.

Governance Framework for E2E Safety: In practice, the above layers of defence are still insufficient. As most organizations have well defined governance laws and domain/app-specific safety rules, an E2E (end-to-end) safety framework must incorporate these rules into the overall system. Moreover, to ensure safety in a complex system, a hierarchical governance structure is required. Thus, one key research direction is how to teach the AI system to learn these rules without the need for huge amounts of manually curated training data. The other research direction is in developing methodology to decompose complex systems into simpler ones, each with a simple set of verifiable rules so that a high-performance classifier can be used as its guardrail. This mimics human societal governance, where principles are defined at the top level and specific rules are enforced at various checkpoints.

Natively Safe AI Models: A long-term goal is to build natively safe AI models. One possible direction is to pre-train systems on basic foundational knowledge in science and humanity, like in human educational curricula, and employ non-formal and informal learning techniques to inject commonsense, social and street level survival skills into the models. We expect such systems to have high self-learning and reasoning skills to pick up new knowledge in any domain. The other potential direction is to teach AI models to possess emotions with the sense of pain and loss. Models without such attributes cannot be fully trusted to do the right things.