MIT researchers have developed a novel method to reduce the tendency of large language models (LLMs) to produce confident but incorrect answers, a phenomenon known as hallucination. The approach, called Reinforcement Learning from Confidence Representation (RLCR), trains AI models to internally assess their own uncertainty and adjust their responses accordingly.
Current AI training methods, including Reinforcement Learning from Human Feedback (RLHF), inadvertently reward models for providing confident guesses even when they lack knowledge. RLCR addresses this by using the model's internal confidence signals to teach it when to say "I don't know" or to provide less certain answers.
In experiments, models trained with RLCR showed significantly improved calibration between their expressed confidence and actual accuracy. This advancement could be critical for deploying AI in high-stakes fields such as medicine, law, and engineering, where reliability is paramount.
The research was conducted by scientists at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL).