In today’s column, I examine the recently revealed feature augmenting OpenAI’s advanced o1 AI model that was briefly showcased during the second day of the “12 Days Of OpenAI” video-streamed announcement. The feature is referred to as reinforcement fine-tuning (RFT).
Much of the media has been clamoring that this is “new” as though nobody has ever thought of RFT before.
Sad and silly.
There has indeed been AI research on reinforcement fine-tuning, sometimes labeled as RFT or ReFT. In any case, yes, this is ostensibly new in the sense that it is an additional capability for OpenAI o1 and thus a new feature for the product. That is surely exciting. Please note that OpenAI may have opted to establish RFT in ways differently than others have – right now, their version of RFT is only available on a limited preview basis, and they often keep the nitty-gritty technical details under wraps since they consider their AI models proprietary.
So, one must do a modicum of armchair AI-soothsaying detective work to know what it’s all about.
Let’s talk about it.
This analysis of an innovative proposition is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here). For my analysis of the key features and vital advancements in the OpenAI o1 AI model, see the link here and the link here, covering various aspects such as chain-of-thought reasoning, reinforcement learning, and the like.
The Overarching Aim Of Reinforcement Fine-Tuning
Here’s how RFT is typically conceived.
First, suppose you want to take a generic generative AI or large language model (LLM) and turn it into a domain-specific wizard of sorts.
This is a big trend these days. Most AI is rather generic and a jack-of-all-trades. Some refer to this as AI being an inch deep and a mile long. The aim is to apply generative AI to particular domains such as legal, finance, medical, and the like. Doing so requires going from a mile long and an inch deep to becoming at least many feet deep in a narrow niche of interest.
In case you are interested in how domain-specific instances are derived, I’ve discussed extensively the adaptation of generative AI for performing legal advisement, see the link here, while another domain that I’ve explored in-depth is the use of generative AI for mental health guidance, see the link here. The usual method or technique employed consists of in-context modeling, or retrieval-augmented generation (RAG), which you can read about in my explanation at the link here.
There is a kind of pursuit of the Holy Grail when it comes to finding the best way to push a generic generative AI into achieving domain-specific proficiency.
RFT Is One Such Method For Domain-Specificity
Voila, that takes us to the grand promise and hope of using reinforcement fine-tuning or RFT.
The deal is this.
RFT is a method or technique that leans into fine-tuning a generic generative AI model to become domain-specific in some respects. You can accomplish this by putting together data that pertains to the domain of interest, feeding it into the generative AI, and using the RFT approach to guide the AI toward “learning” about the domain.
The AI model is incrementally fine-tuned by providing a semblance of reinforcement to the AI. When AI gets things right, it is instructed that it’s doing well and should adjust toward producing future answers similarly (essentially, being given a reward for being correct). When the AI during this data training gets something wrong, it is instructed that the response was incorrect, and therefore the AI ought to steer away from that approach in the future (a penalty for being incorrect).
That’s how reinforcement works.
Note that I earlier put the word “learning” into quotes. I did so because we are excessively anthropomorphizing AI by using terminology that applies to humans and then outstretching those words to suggest the same applies to AI. The type of “learning” that the AI is doing should not be considered on par with human learning, see my discussion at the link here. It is a form of mathematical and computational reformulation and adjustment.
The Balance Of Generic Versus Specific
Keep in mind that you usually retain the generic aspects that are within the AI model and aren’t necessarily reducing those when trying to bring the AI up to speed on a particular domain. That being said, if you don’t especially need the full breadth of generic generative AI, you might strip down the AI to some barebones and then apply RFT, or possibly do the RFT first and then strip down the resultant AI. It all depends on what your goals are.
Why strip out some of the generic stuff?
Most generative AI is large in size and won’t run natively on smartphones, ergo requiring you to access the AI online. This means you need a reliable online connection. It is also costly due to your accessing expensive servers in the cloud. All in all, a movement toward small language models (SLM) is being avidly pursued so that a reduced-sized and likely reduced functionality version of generative AI can run on a standalone basis on everyday devices, see my analysis at the link here.
The same is often the case when producing domain-specific AI models. You are likely to want it to run on smartphones and not have to depend on the cloud. Thus, you can potentially hack out all sorts of generic aspects that don’t seem relevant to the domain at hand (does AI need to know for example about Abraham Lincoln to dispense medical advice on say a particular disease?).
The downside is that the AI won’t be able to respond well to across-the-board prompts and could be seen as weaker than the larger-sized AI.
The Fundamental Steps For Performing RFT
My way of depicting reinforcement fine-tuning is to say that RFT consists of five major steps:
- (1) Dataset Preparation: Put together a suitable custom dataset for the chosen domain and format the prepared data into a common structured format (e.g., JSONL).
- (2) Grader Formation: Devise a computer-based grader capability and/or leverage existing automated grading systems, which will be used to evaluate the model outputs. The evaluations usually include scoring the AI responses for correctness (topmost priority) and possibly also scoring for quality and reasoning.
- (3) Reinforcement Fine-Tuning: The AI model receives iterative feedback through computational rewards for accurate reasoning (considered providing incentives) and penalties for errors (known as disincentives), gradually improving performance. During RFT, feed in a selected portion of the prepared datasets and retain other portions for later use during validation.
- (4) Validation Process. Make use of the held-back or unseen dataset portions to validate and assess the AI model’s ability to generalize effectively. This is the validation process and is tremendously crucial for ascertaining whether the RFT has made a positive significant difference in the AI model’s domain specificity. Iterate as needed.
- (5) Optimization and Roll-out: Finalize the RFT to ensure that the AI model is suitably efficient and effective, determine if the footprint is sized well (usually, smallness is preferred), and whether the AI is sufficiently specialized for the chosen targeted domain. Deploy the completed AI model. Keep tabs on ongoing usage and feedback. Make updates to the AI model including performing maintenance as required.
Those five steps capture the essence of what needs to be undertaken for RFT. Variations exist that have six steps, seven steps, and even ten steps. My indicated five steps pretty much cover the gamut and do so in a tidy way.
Importance Of The Grading
One aspect that might have caught your eye is step #2, grader formation.
Allow me to elaborate on this.
I had already noted that the reinforcement process consists of telling the AI when it is right and when it is wrong, doing so during the RFT overall endeavor. Parlance amongst AI insiders is that the AI is being graded, almost like getting a letter grade in school.
An “A” grade in school means things went well. The dreaded “F” grade means the answers were incorrect. Instead of assigning letter grades during RFT, a numeric value is usually used. The common practice is to assign a score of zero for a wrong response, and a score of 1 for a response that is correct. Since not all answers will be completely right or completely wrong, a value between 0 and 1 is used to suggest how right or wrong the response was.
For example, go ahead and envision that I am data training a generic generative AI by using RFT. It is being tuned to the legal domain. I’ve fed in a bunch of legal content consisting of various laws, regulations, and so on. During the RFT process, I feed in a prompt asking the AI to decide whether a given legal clause is legally sound. The AI churns through the computational assessment and comes back with an answer that the clause is good to go.
If that was a correct answer, the grade given would be a 1, while if incorrect the grade would be a 0. But the world isn’t always quite so binary. Suppose the AI indicated that the clause is legally correct in certain circumstances but has loopholes in other circumstances. Perhaps that is a relatively fair answer, though in some ways correct and some ways incorrect. The grade given might be 0.60, suggesting that the response was mostly right (because it is assigned a score above 0.50 and inching toward a full 1.0), though it also was partially incorrect (thus it isn’t a full 1.0 and only given a score of 0.60).
How is the grading determined?
You could employ a human during the RFT that doles out grades. This is laborious, tends to be slow, and can be expensive. Generally, the grading component is usually some form of automation. It could be a specialized program that was developed for a particular domain. It could be a generic grading system that can be used across various domains. You can even use another generative AI as a grader, such as having a second generative AI standing there that does the grading during the RFT.
The bottom line is that the grader is vital and if you don’t get that setup properly, the rest of the RFT is going to be kaput.
Grand Twist Is The Introduction Of Chain-Of-Thought
I’ve got an important twist for you.
An ongoing assumption that is subject to heated debate is that the use of RFT will notably shine when the generative AI contains advanced AI features such as chain-of-thought reasoning (CoT), see my discussion about CoT at the link here.
Chain-of-thought refers to the conception that when the AI is trying to solve a problem or come up with an answer, the AI is instructed to perform a series of logical steps when doing so. If trying to diagnose a patient, the AI might first assess basic patient data such as age, weight, health, etc. The second step might be to examine medical tests like a blood test. The third step might be to then review what kinds of aliments seem to fit that patient. The fourth step might be to reach a medical diagnosis and explain how that diagnosis was determined.
Let’s bring RFT back into the picture.
A generative AI that leverages a chain of thought could be exercised and fine-tuned with reinforcement processes in the following way. We let the AI proceed trying to diagnose a patient based on data that we’ve collected for data training purposes. A particular chain-of-thought is derived. Great, that’s what we want to have happen.
Lots And Lots Of CoTs Make For Choosiness
It turns out that like the old saw, there are more ways than one to skin a cat (sorry, that’s a bit dour), we could have the AI take another shot at the diagnosis. The second time around the chain-of-thought might differ. We do this a third time and keep getting the AI to try out a wide variety of CoTs. For each of the attempts, we assign a grade to the derived answer, using whatever grader or grading system we’ve established.
What does this accomplish?
Aha, the hope is that by telling the AI which answers were right, and which were wrong, this also sheds light on which of the chain of thoughts were right and wrong. The AI is going to presumably mathematically begin to lean toward CoTs that are being rewarded and shift away from CoTs that are being penalized or disincentivized.
The act of this reinforcement fine-tuning is indirectly guiding the generative AI toward hopefully stronger and better chain-of-thought approaches and steering it from CoTs that aren’t as good.
If this is done well, we are not merely arriving at the right answers, we are also in a sense shaping the nature of the chain of thoughts that the AI is going to use. A cheeky way to express this is the famous adage that if you give a person a fish, you feed them for a day, but if you teach them how to fish, they will be fed for a lifetime.
Boom, drop the mic.
OpenAI Has Opened The Door To RFT
Previously, OpenAI had embraced the use of supervised fine-tuning (SFT), which I describe at the link here. SFT as adopted by OpenAI was mainly about tuning the AI tone and style of responses. That was handy. RFT is aimed at digging into specific domains and getting the AI up-to-speed on answering domain-specific prompts. It is a different angle on fine-tuning.
Both techniques have their particular aims.
OpenAI’s RFT is considered available only on a limited preview basis right now and will be more widely accessible sometime next year. Meanwhile, OpenAI has also indicated that they are earnestly seeking to identify ripe domains to use RFT on. AI researchers and domain experts who want to have ready access to the preview capability can submit their keen interest to OpenAI (see the OpenAI official log for details).
Here’s what OpenAI officially said about RFT in their formal announcement as noted in “OpenAI’s Reinforcement Fine-Tuning Research Program”, OpenAI blog, December 6, 2024 (excerpts):
- “This new model customization technique enables developers to customize our models using dozens to thousands of high-quality tasks and grade the model’s response with provided reference answers.”
- “This technique reinforces how the model reasons through similar problems and improves its accuracy on specific tasks in that domain.”
- “We’ve seen promising results in domains like Law, Insurance, Healthcare, Finance, and Engineering because Reinforcement Fine-Tuning excels at tasks where the outcome has an objectively “correct” answer that most experts would agree with.”
- “We’re expanding our Reinforcement Fine-Tuning Research Program to enable developers and machine learning engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.”
- “We encourage research institutes, universities, and enterprises to apply, particularly those that currently execute narrow sets of complex tasks led by experts and would benefit from AI assistance.”
If you are versed in a specific domain and believe that generative AI would be a boon, and if you are intrigued with RFT as a potential approach, you might want to consider putting your hat in the ring to make use of this latest OpenAI o1 model augmentation.
The Future Is Bright With More Approaches
A final comment for the moment.
There is a fascinating twist upon the twist that I earlier brought to your attention. It goes like this. The prevailing approach of RFT is usually that the grades are only assigned based on the AI responses. My point is that the chain of thought is not being directly graded. The CoT is only indirectly being graded.
An interesting next step consists of grading the actual CoT and even pieces or slices of the CoT.
Let me frame this in human terms, cautiously so. Imagine that a student gives me their completed test and they were instructed to write down the logic for their answers on the test, immediately adjacent to each question. One means of grading would be to simply look at the answer and assign a grade. As a grader, I utterly ignore the logic the student has displayed.
Another form of grading would be to look at how they came up with the answer and assign a grade based on both the answer and the logic used.
Mull over that approach to grading.
Maybe that’s a lot better means of grading since the student will have some semblance of where or how their logic went awry. If they only know that the answer is merely right or wrong, they aren’t getting much feedback about how they arrived at the answer. You could persuasively argue that doing grading at a more granular level could significantly enhance their capabilities.
There are tradeoffs. The grader must do a lot more work. The grader has to be a lot better at grading since they are no longer simply comparing one answer against an answer key. Also, suppose the grader messes up and gives foul guidance about the logic that the student used. Oops, that could frazzle a student, and they are worse off than they were beforehand. Etc.
If we do proceed to further enhance RFT in that manner, should we refer to that as some kind of super RFT, perhaps noted as SRFT or SURFT?
You never know what nomenclature catches hold.
Let’s end with a famous proverb: “Learning is a treasure that will follow its owner everywhere.” I suppose we can say that this motto applies to humans and perhaps even applies to the advancement and future of AI.
Keep on learning.