Blogs

Reconciling conflicting claims about reinforcement fine-tuning in LLMs

Two recent studies highlight how reinforcement fine-tuning (RFT) can both broaden robustness and restrict reasoning diversity — depending on how “generalization” is defined.

Published on

November 20, 2025

Oops! This content is hosted elsewhere.

Access Content Here

This year has brought major advances in reinforcement fine-tuning (RFT) of large language models (LLMs). The release of DeepSeek-R1 — an open-source LLM that demonstrated how reinforcement fine-tuning with verifiable rewards could push models to reason as effectively as the leading closed sourced LLMs at the time — has been a major development.

‍

Getting familiar with the research

Since its release, research teams have rapidly explored RFT’s effects compared to supervised fine-tuning (SFT), which occurs when a pre-trained model is taught to better handle a specific use case through training via non-public data or examples. Recently, two published academic papers offered seemingly conflicting takes on how RFT changes model behavior — or, how the intricacies of AI models actually work. For the SmarterDx machine learning team, unpacking and reconciling these new ideas is simply part of the job. I’d like to share our insights below:

To begin, let’s get familiar with the overall arguments of each paper:

SFT memorizes, RL generalizes: The first paper we’ll look at shows that in card-based reasoning tasks, supervised fine-tuning boosted accuracy on familiar problems but hurt performance on unfamiliar ones, which suggests that SFT models rely on memorization. However, with reinforcement fine-tuning, both familiar and unfamiliar reasoning tasks showed improvement, which supports the conclusion that RFT is better at generalization. In summary? SFT is good with reasoning — as long as the tasks are familiar. RFT was better able to handle both familiar and unfamiliar tasks.
Does reinforcement learning really incentivize reasoning?: In our next paper, we’re looking at how RFT handled both math and coding tasks. With respect to both, RFT models did better — but only on the first try. But let’s take a deeper dive. When RFT models were allowed to attempt a task multiple times (a setup called pass@k —the model gets a correct score on accuracy if any one of k attempts is correct), the base models caught up. The takeaway: RFT, at least in today’s LLMs, doesn’t invent new reasoning strategies; it just steers models toward the better ones more consistently.

Opposing perspectives?

At first glance, these results might look like they’re at odds with each other. But to me, they’re actually complementary. Both papers suggest RFT makes models more reliable without fundamentally changing their underlying capabilities. And while the findings are solid, I don’t see them as revolutionary — rather, they’re more like useful clarifications of what RFT is and isn’t.

But to take a step back, let’s clarify why the conclusions seem to contradict each other. Overall, the first paper says that RFT enables generalization. Paper two, though, says RFT restricts generalization.

‍

Breaking it down

How can these claims both be true? From a machine learning perspective, I’d argue that the tension arises from the authors holding different definitions of the term generalization. That doesn’t surprise me. After all, this is a term that’s quite overused in the AI research space.

Does reinforcement learning really incentivize reasoning’s definition of generalization: An exploration of reasoning paths in which RFT prunes alternatives, making models less creative but more reliably correct.
SFT memorizes, RL generalizes’ definition of generalization: Robustness to out-of-distribution (OOD) inputs: RFT helps models stay accurate even when prompts shift, making the inputs appear less like the inputs the model was trained on.

This distinction reflects the nature of reinforcement learning (RL) itself. Unlike RL in constrained environments such as Atari and Go, RFT begins from a pretrained base model with useful priors. Deviating too far from those priors can yield nonsensical outputs and heavy penalties, so RFT naturally favors a smaller set of safe, high-reward reasoning paths.

Yet contrary to popular belief, RFT doesn’t magically expand reasoning capacity. What it actually does is steer LLMs toward very reliable and correct paths. On the surface, that seems great — but it often comes at the expense of creative exploration.

As scientists, this is a critical trade-off we must consider before building a model. In a given scenario, do we prioritize robustness even if it limits diversity? Or do we favor diversity at the cost of some reliability?

‍

Broader implications

At SmarterDx, our goal is not to overfit to the current clients but to generalize across all clients. For this reason, RFT is the right choice for us right now. It guides the model toward consistent reasoning that works across varied inputs. But we also place a premium on clarity — the reasoning trace should be short and easy for clinical documentation integrity (CDI) specialists to follow. By rewarding both correctness and concise explanations, we aim to achieve the best of both worlds — robust generalization without sacrificing interpretability.

This year has brought major advances in reinforcement fine-tuning (RFT) of large language models (LLMs). The release of DeepSeek-R1 — an open-source LLM that demonstrated how reinforcement fine-tuning with verifiable rewards could push models to reason as effectively as the leading closed sourced LLMs at the time — has been a major development.

‍

Getting familiar with the research

Since its release, research teams have rapidly explored RFT’s effects compared to supervised fine-tuning (SFT), which occurs when a pre-trained model is taught to better handle a specific use case through training via non-public data or examples. Recently, two published academic papers offered seemingly conflicting takes on how RFT changes model behavior — or, how the intricacies of AI models actually work. For the SmarterDx machine learning team, unpacking and reconciling these new ideas is simply part of the job. I’d like to share our insights below:

To begin, let’s get familiar with the overall arguments of each paper:

SFT memorizes, RL generalizes: The first paper we’ll look at shows that in card-based reasoning tasks, supervised fine-tuning boosted accuracy on familiar problems but hurt performance on unfamiliar ones, which suggests that SFT models rely on memorization. However, with reinforcement fine-tuning, both familiar and unfamiliar reasoning tasks showed improvement, which supports the conclusion that RFT is better at generalization. In summary? SFT is good with reasoning — as long as the tasks are familiar. RFT was better able to handle both familiar and unfamiliar tasks.
Does reinforcement learning really incentivize reasoning?: In our next paper, we’re looking at how RFT handled both math and coding tasks. With respect to both, RFT models did better — but only on the first try. But let’s take a deeper dive. When RFT models were allowed to attempt a task multiple times (a setup called pass@k —the model gets a correct score on accuracy if any one of k attempts is correct), the base models caught up. The takeaway: RFT, at least in today’s LLMs, doesn’t invent new reasoning strategies; it just steers models toward the better ones more consistently.

Opposing perspectives?

At first glance, these results might look like they’re at odds with each other. But to me, they’re actually complementary. Both papers suggest RFT makes models more reliable without fundamentally changing their underlying capabilities. And while the findings are solid, I don’t see them as revolutionary — rather, they’re more like useful clarifications of what RFT is and isn’t.

But to take a step back, let’s clarify why the conclusions seem to contradict each other. Overall, the first paper says that RFT enables generalization. Paper two, though, says RFT restricts generalization.

‍

Breaking it down

How can these claims both be true? From a machine learning perspective, I’d argue that the tension arises from the authors holding different definitions of the term generalization. That doesn’t surprise me. After all, this is a term that’s quite overused in the AI research space.

Does reinforcement learning really incentivize reasoning’s definition of generalization: An exploration of reasoning paths in which RFT prunes alternatives, making models less creative but more reliably correct.
SFT memorizes, RL generalizes’ definition of generalization: Robustness to out-of-distribution (OOD) inputs: RFT helps models stay accurate even when prompts shift, making the inputs appear less like the inputs the model was trained on.

This distinction reflects the nature of reinforcement learning (RL) itself. Unlike RL in constrained environments such as Atari and Go, RFT begins from a pretrained base model with useful priors. Deviating too far from those priors can yield nonsensical outputs and heavy penalties, so RFT naturally favors a smaller set of safe, high-reward reasoning paths.

Yet contrary to popular belief, RFT doesn’t magically expand reasoning capacity. What it actually does is steer LLMs toward very reliable and correct paths. On the surface, that seems great — but it often comes at the expense of creative exploration.

As scientists, this is a critical trade-off we must consider before building a model. In a given scenario, do we prioritize robustness even if it limits diversity? Or do we favor diversity at the cost of some reliability?

‍

Broader implications

At SmarterDx, our goal is not to overfit to the current clients but to generalize across all clients. For this reason, RFT is the right choice for us right now. It guides the model toward consistent reasoning that works across varied inputs. But we also place a premium on clarity — the reasoning trace should be short and easy for clinical documentation integrity (CDI) specialists to follow. By rewarding both correctness and concise explanations, we aim to achieve the best of both worlds — robust generalization without sacrificing interpretability.

Download now

Reconciling conflicting claims about reinforcement fine-tuning in LLMs

Oops! This content is hosted elsewhere.

Getting familiar with the research

Opposing perspectives?

Breaking it down

Broader implications

Getting familiar with the research

Opposing perspectives?

Breaking it down

Broader implications

Why our AI is smarter by design

AI and charge capture 101

Using SmarterPrebill to Uncover Millions: The UAMS Story

Empowered Patient Podcast: Guest Dr. Michael Gao

Capture your differential