
Researchers at Nvidia have developed a brand new approach that flips the script on how giant language fashions (LLMs) be taught to cause.
The strategy, referred to as reinforcement learning pre-training (RLP), integrates RL into the preliminary coaching part slightly than saving it for the tip.
This method encourages the mannequin to “assume for itself earlier than predicting what comes subsequent, thus educating an unbiased considering habits earlier within the pretraining,” the researchers state of their paper.
By studying to cause on plain textual content while not having exterior verifiers, fashions skilled with RLP present vital enhancements in studying advanced reasoning duties downstream, hinting at a way forward for extra succesful and adaptable AI for real-world duties.
The everyday LLM coaching cycle
Usually, giant language fashions are first pre-trained on huge quantities of textual content utilizing a “next-token prediction” goal, the place they’re given a string of textual content and requested to repeatedly guess what the following phrase (or token) might be. On this part, they be taught grammar, information, and fundamental associations.
Within the later post-training part, fashions normally be taught advanced reasoning skills similar to chain-of-thought (CoT) the place a mannequin lays out its reasoning step-by-step. This stage usually includes supervised fine-tuning (SFT) or reinforcement studying from human suggestions (RLHF), which require specialised, curated datasets.
The paper’s authors argue this sequential course of doesn’t match human comprehension, which is “not a linear token-by-token course of, however slightly a parallel integration of enter with prior data.” Current pre-training strategies lack this mechanism, hindering a mannequin’s means to develop deep reasoning from the beginning.
How reinforcement studying pre-training works
RLP reframes this course of by treating CoT era as an motion the mannequin takes earlier than predicting the following token. At every step, the mannequin first generates an inner “thought” or reasoning chain. It then predicts the following phrase within the textual content, utilizing the unique context augmented with its new thought.
The mannequin receives a reward primarily based on how a lot its thought improved the accuracy of its prediction in comparison with a baseline that did not generate a thought (pure next-token prediction). This reward sign is calculated robotically primarily based on the change in chance, eliminating the necessity for exterior verifiers or human-labeled knowledge.
The reward is constructive solely when the generated thought helps the mannequin higher predict the following token. By rewarding ideas primarily based on their predictive profit, RLP successfully teaches the mannequin easy methods to assume usefully on the identical huge, unstructured datasets used for normal pre-training.
This steady suggestions loop permits the mannequin to be taught when a easy predictive guess is enough and when it wants to interact in deeper reasoning. Because the researchers put it, “RLP is designed to form considering in base fashions by rewarding solely these ideas that measurably assist next-token prediction.”
This foundational method, nevertheless, would not make later fine-tuning levels out of date. In line with Bryan Catanzaro, VP of utilized deep studying analysis at Nvidia and a co-author of the paper, RLP is designed to enhance, not exchange, these essential steps. “RLP isn’t meant to switch the later post-training levels like supervised fine-tuning or reinforcement studying from human suggestions,” Catanzaro informed VentureBeat. “These levels stay essential for refining mannequin habits… It’s actually designed to amplify the effectiveness of these later phases by giving the mannequin a head begin.”
RLP in motion
In experiments with Qwen3-1.7B and Nemotron-Nano-12B, Nvidia’s group examined RLP throughout a collection of math and science reasoning benchmarks. The outcomes present that fashions enhanced with RLP constantly outperformed their conventionally skilled counterparts, with significantly sturdy positive factors in reasoning-heavy duties.
For an enterprise, this improved reasoning might translate to extra dependable outputs in multi-step workflows like monetary evaluation or authorized doc summarization.
“RLP encourages the mannequin throughout pretraining to assume earlier than it predicts, serving to the mannequin internalize a extra coherent reasoning type,” mentioned Catanzaro. “This might assist cut back delicate logical errors, particularly in longer workflows.”
Whereas stressing that RLP-trained fashions will nonetheless want the same old guardrails similar to verification layers, human oversight, and consistency checks, Catanzaro mentioned that “RLP provides you a stronger baseline.”
Importantly, the advantages of RLP compound as an alternative of disappearing throughout subsequent fine-tuning levels (catastrophic forgetting is a standard drawback in LLM coaching, the place later coaching levels trigger the mannequin to overlook its beforehand realized abilities and data). The RLP-trained mannequin achieved an general rating that was 7-8% larger than baselines after an similar post-training routine. The researchers conclude that RLP “establishes sturdy reasoning foundations that aren’t washed out by downstream alignment however as an alternative compound with post-training.”
The effectivity of the approach is a key discovering. On the Qwen3-1.7B mannequin, RLP improved efficiency by 17% over customary steady pre-training and in addition beat an analogous approach referred to as Reinforcement Pretraining through prefix-matching rewards (RPT). This benefit held even when the baseline mannequin was skilled with 35 occasions extra knowledge to match the computational price, confirming the positive factors come from the strategy itself, not simply extra processing.
Moreover, RLP demonstrates spectacular scalability and flexibility, efficiently extracting a reasoning sign from general-purpose internet knowledge—not simply curated datasets. When utilized to the hybrid Mamba-Transformer mannequin Nemotron-Nano-12B, RLP achieved a 35% relative enchancment over a closely skilled baseline whereas utilizing only a tiny fraction of the information.
Whereas these outcomes level towards a extra environment friendly path for constructing highly effective fashions, Catanzaro frames the innovation as a basic shift within the studying course of itself, slightly than a right away resolution to excessive coaching prices.
“This analysis is thrilling as a result of it affords a shift in how fashions take in data throughout pretraining resulting in a wiser studying course of,” he defined. “It wouldn’t exchange large-scale pretraining, however provide one other inventive methodology in constructing the very best fashions.”
A brand new basis for AI coaching
In the end, RLP factors towards a future the place pre-training is not a monolithic means of next-token prediction. As an alternative, the following era of fashions could possibly be constructed on a hybrid of targets, creating AI that learns to assume extra robustly from day one. Catanzaro affords a robust analogy to border this shift:
“Subsequent-token prediction teaches a mannequin what the world seems to be like; reinforcement-style targets like RLP can educate it how to consider what it’s seeing,” he mentioned. “The mixture of those two targets might assist fashions develop deeper, extra structured considering a lot earlier in coaching… Instruments like RLP can construct on prime of that basis, making studying extra energetic, curious, and much more environment friendly.”
There’s nonetheless quite a bit to be taught in regards to the dynamics of reinforcement studying within the pre-training part, however what appears clear is that “introducing exploration earlier in coaching opens a brand new axis for scaling — not simply in dimension, however in how fashions be taught to cause,” Catanzaro mentioned.
