Making any LLM model "reasoning"

Community Article Published March 23, 2025

Reasoning models are incredibly powerful. But, do you know that you can "fake" the reasoning task with any LLM model? Even with the smaller ones, without framework or extra libraries. I'll show you how to use any "instruct" LLM model not intended for reasoning, by forcing it to have this answer improvement phase. All of this "from scratch".

I created a little "demo space" you can try here Force reasoning any model demo. It uses what I explain in this article, with a few adaptations.

Important note:

First of all, I want to point out that I looked to see if anyone else had written an article on this method. I found a few references, including the "autoreason" framework, whose research paper can be found here: https://arxiv.org/pdf/2412.06975

Also, a very interesting paper on the "chain of thought" can be found here: https://arxiv.org/pdf/2210.03493

However, few people offer a demonstration and Python code that simply implements the concept from scratch.

So, the method I'm going to explain to you exists in different forms. It's also implicitly what reasoning models use. My goal is to make it simple and quick to use, without dependencies, and, above all, to be educational.

The general idea is relatively simple.

  • Using a framework won't be necessary; we'll only use the model's ability to do its basic job: completing the text.
  • Instead, we'll simply tweak its response a bit so that it doesn't go directly to the answer it wants to produce.

The result is sometimes impressive, despite the simplicity of the implementation I'm going to propose here.

And to top it all off, it works very well with small LLM models, such as Qwen2 1.5B.

First, what does reasoning models?

The arrival of DeepSeek R1 demonstrated how a model that knows how to reason can provide highly relevant answers. These models are trained to respond with reasoning as a preamble. In this case, they do exactly what other models do - produce text. But instead of responding directly, they go through a “thinking” phase. This is usually found between two "<think>...</think>" tags.

With RAG workflows, the context is generated from database searches. The principle remains the same, since we generally reformulate the question to the LLM by saying “with this context: XXX, answer question: YYY”.

The only difference, for reasoning models, is that the model itself will attempt to build a context.

The genius of this models is that they produces a fuller context themself for the question posed. This causes dissonance, then rebalancing, reformulation, questioning, and therefore a response that seems less subject to bias. But also more precise.

Now, how to "force" a non reasoning model to think?

If we think about it a little, we might wonder whether, in the end, this phase of reflection can't be forced on our side. And so, do it with any other LLM model, whether light or heavy. If the context window is wide enough, we can manage to push the model to produce a reflection, even if it's not trained to do so.

Because in definition, the reflection phase is nothing more or less than “noisy” text generation, which serves to contextualize the response. Models using reasoning are simply trained to start their response with this phase. But, as you may have guessed, the model can be induced to do so.

The legacy manner to use a "pipeline" in hugginface is:

# pip install transformers accelerate
import transformers

pipe = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2-1.5B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)
response = pipe(
  [
    {
      "role": "user",
      "content": "Explain how a LLM works."
    }
  ],
  max_new_token: 512,
  )
print(response[0]["generated_text"][-1]["content"])

The response contains a "generated_text" where the answer of the "assistant" is provided.

The truncated answer is like:

A Language Model (LLM) is a type of artificial intelligence that uses natural
language processing techniques to...

But what many of us forget is that we can propose the beginning of the answer, which the model will “complete”. It's hardly different, but it will change everything!

Let's try:

import transformers

pipe = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2-1.5B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)
response = pipe(
    [
        {
            "role": "user",
            "content": "Explain how a LLM works.",
        },
        {
            "role": "assistant",
            "content": "Let's reformulate the question: ",
        },
    ],
    max_new_tokens=512,
)
print(response[0]["generated_text"][-1]["content"])

The (truncated) answer will be:

Let's reformulate the question:  What is a machine learning model? 

A machine learning model, also known as a machine learning algorithm or simply a model, is an
artificial intelligence system that learns from data and makes predictions or decisions based
on patterns it has identified in the data...

As you can see, the answer begins by what we proposed, "Let's reformulate the question: ". The answer is, actually, approximativly the same, but it probably helps a bit.

What happend is simple.

Rather than waiting for the assistant to respond to our "question", we forced its hand by giving it the beginning of the sentence it should generate. This is the job of any LLM in reality, to deduce the rest of the text in a coherent way. And if it sees “Let's reformulate the question:” then it will follow the logic and reformulate the user's question.

Now, let's try to simulate what DeepSeek R1 does!

Make the model thinking "a lot"

As DeepSeek R1 does it, we will force the model to make several reformulation and rethink. We're going to encourage him to question his answer, to reformulate it again, and to do so a considerable number of times.

The simple method is:

  • create a list of sentence starters that will prompt the model to revise his copy
  • generate a response based on each item in the list, and concatenate it with the previous response
  • and optionnaly, give only the “final” answer that will follow the reflection phase

Let me clarify: we're not going to create multiple answers, but rather to complete the current answer as we go. This is important because the model must complete the answer, not produce multiple outputs. All the reasoning must fit into a single phase.

To summarize, the idea is that we'll ask a question, as usual, but instead of waiting for an answer, we'll force it to complete sentences, which will prompt the model to produce more text first.

The phase will look like this:

  • We ask the question as a user: "Tell me what an LLM is"
  • We force the beginning of the answer to: "OK, I need to figure out," and the model will reformulate what it needs to do.
  • And we continue by injecting another sentence starter afterwards, for example, "I think," which will potentially prompt it to say what it thinks the answer will be in the first place.
  • And we continue, for example, with "Wait, maybe," which will prompt it to produce text that will question the answer.
  • Etc.

After a while, we'll end up telling it to provide an answer with the generated context (which is similar to what we do in the RAG workflow).

This is the Python code:

import transformers

# prepends sentences
rethink_prepends = [
    "OK, I need to figure out ",
    "I think ",
    "Wait, I think ",
    "Let me check if ",
    "I should also remember that ",
    "Another thing to note is that ",
    "I also recall that ",
    "I think I have a good grasp ",
    "\n**ANSWER**\n",
]

pipe = transformers.pipeline(
    "text-generation",
    model="Qwen/Qwen2-1.5B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)


def ask(message, max_new_token=255):
    messages = [{"role": "user", "content": message}]
    messages.append(
        {
            "role": "assistant",
            "content": "",
        }
    )
    # we use all the starter sequences and append them to sequential answers
    for prepend in rethink_prepends:
        content = messages[-1]["content"]
        content += "\n\n" + prepend.format(question=message)  # pyright: ignore
        messages[-1]["content"] = content
        response = pipe(messages, max_new_tokens=max_new_tokens)
        content = response[0]["generated_text"][-1]["content"]  # pyright: ignore
        messages[-1]["content"] = content  # force the latest message
    
    return response


question = "Explain how a LLM works."
response = ask(question)  # pyright: ignore
print(reponse[0]["generated_text"][-1]['content'])

Here, an example of what answered the model:

OK, I need to figure out 1) what you mean by "a LLM" and 2) why you want me to explain it. Can
you please clarify?

I think 1) is that the acronym stands for Large Language Model. The term has been used in the
field of artificial intelligence (AI), particularly with regards to machine learning models.

In the context of natural language processing (NLP), a large language model refers to an AI model
that can understand, generate, and reason about human language. It's capable of generating text
that is grammatically correct, coherent, and meaningful, as well as understanding the meaning
behind words and phrases in a sentence or paragraph. This type of model is often referred to
as a generative pre-trained language model, such as BERT or GPT.

The goal of using a large language model is to create a tool that can help humans better
communicate and interact with each other through natural language. For example, a chatbot
powered by a large language model could be trained to understand and respond to user queries
in a conversational manner. Additionally, these models can be used for tasks like summarizing
long documents, translating text from one language to another, and even writing original pieces
of text based on prompts given to them.

Wait, I think 2) was: I want you to explain it in simple terms so someone who doesn't know much
about AI can understand it. Is that okay?

Let me check if 2) is understood correctly now:

Yes, I want you to explain it in simple terms so someone who doesn't know much about AI can
understand it.  Understood! Here's a simple explanation of what a large language model is:
A large language model is a type of artificial intelligence (AI) model that is designed to
understand and generate human language. These models use algorithms to analyze vast amounts
of data and learn patterns and relationships between words and sentences. They can then use
this knowledge to generate new text that is grammatically correct, coherent, and meaningful.
For example, imagine a chatbot that uses a large language model to answer questions and provide
information to users. When a user types a question into the chatbot, the model analyzes the text
and generates a response that is both accurate and relevant to the user's query. This process
allows the chatbot to understand and generate responses in a way that is similar to how humans
would do it.  Overall, large language models have the potential to revolutionize many industries,
including education, healthcare, customer service, and more. They can help people communicate
more effectively, access information quickly, and improve decision-making processes. However,
they also raise important ethical and privacy concerns, which must be addressed carefully when
developing and using these technologies

I should also remember that 3) there may be some technical jargon involved in explaining something
simple. So I'll try my best to make sure everything is explained clearly. Let me know if you
need any clarification on anything.

Another thing to note is that 4) a large language model is not just a single piece of software
but rather a set of algorithms and training data that together form a system. It's important
to keep in mind that while we're talking about a specific model, it's still part of a larger
system that includes the underlying algorithms and data that power it.

I also recall that 5) a large language model can be used to create chatbots, virtual assistants,
and other forms of intelligent agents that can understand and respond to human language in a
natural way. These systems can be used in various fields such as customer service, education,
healthcare, and more.  And finally, I'd like to add that 6) while large language models have
the potential to greatly benefit society, they also come with certain risks and challenges that
must be considered. For example, the misuse of these technologies, such as in social engineering
attacks or data breaches, can lead to serious consequences. Therefore, it's crucial to develop
and use these technologies responsibly and ethically.

I hope this helps! Let me know if you have any further questions or if there's anything else
I can assist you with.

I think I have a good grasp 1) of what a LLM is, 2) why you wanted me to explain it, and 3)
what it does.

**ANSWER**

A large language model is a complex system composed of several components, including algorithms,
data, and infrastructure. Algorithms are the core component of the model and determine how the
model learns and makes predictions. Data is the input to the algorithm and provides the necessary
information for the model to learn. Infrastructure, on the other hand, is the physical hardware
and software resources needed to run the model.

The algorithms within a large language model are designed to analyze vast amounts of text data,
identify patterns and relationships between words and sentences, and generate new text based
on those patterns. By analyzing large volumes of text data, the model is able to build up a
rich corpus of knowledge that enables it to understand and generate text in a wide range of
contexts and domains.

The data that powers a large language model consists of a vast amount of text data from various
sources, such as books, articles, web pages, and online forums. This data is typically collected
and annotated by humans and is then fed into the model during training.

The infrastructure required to run a large language model is critical to its performance and
scalability. This includes servers, storage systems, and network connectivity. In addition,
the model needs to be trained and deployed in a secure environment to protect sensitive data
and prevent unauthorized access.

Overall, a large language model is a powerful tool that leverages massive amounts of text data and
sophisticated algorithms to enable humans to communicate and interact more effectively. However,
it's important to recognize that the technology is still evolving and faces ongoing challenges
related to bias, transparency, and accountability. As such, it's essential to approach the
development and deployment of large language models with caution and responsibility.

The final answer, after “**ANSWER**”, took into account all the reasoning we'd forced. It's a fairly simple and, apparently, effective method for producing more complete answers, with a little more security in the veracity of the statement.

Of course, not all biases are magically erased, and the model may still produce hallucinations. But asking it to “rethink” its answer can make it doubt a little... and correct itself.

Of course, if the begining of the reasoning is biased, and if the model trust itself, so the entire answer will be bad. For example, I ask the model to propose a tanh(x)tanh(x) function adding aa parameter to change the slope and bb parameter to make it move left and right. Without any reason, it started by describe the tanh(x)tanh(x) function with exponential function (which is not wrong, but not what I asked) and it continues by trying to demonstrate general properties.

Of course, the final answer would be:

tanh(a(xb))tanh(a⋅(x-b))

But, by dint of drifting in his reasoning, he ended up giving me a function based on exponentials, undrinkable, and above all absolutely wrong.

Spliting reasoning and final answer

It's important to think about it. You need to have something in your prefixes to detect the final answer and the reasoning.

DeepSeek R1 uses <think></think> markup. I tried to force the model using the same, but it often tries to add <answer> tags, and it forget to close it too...

I prefered a simpler (and unsecured) marker.

I force the model to prepend the answer with "**ANSWER**", I can finally take the anwser only and not the thinking process.

final_anwser = reponse[0]["generated_text"][-1]['content'].split('**ANSWER**\n')[1]
print(final_answer)

With "Gradio", in my demo space, it's a bit different. I split the reasoning phase and the final answer in two ChatMessage objects using metadata to indicate what is the reasoning and what is the answer. It helps to filter the history too, as I don't want to inject the reasoning in the history that I send to the model. I could, but I will quickly override the context window.

One more time, you can use the demo and read the soure code. It's free and open source.

I said "unsecured". That's because we can use the user prompt inside the reasoning to ensure that the model didn't forget what it should answer. So the user can "break" the phase by injecting **ANSWER** itself in its prompt.

So you need to find a way to avoid the user to send it in prompt.

My demo simply removes this term from the user prompt.

But does it really improve the answer?

It is difficult, at my level, to compare my results with what is produced by models specifically trained to reason. What a model like DeepSeek R1, Gemini, or GPT produces is impressive. However, I've often been impressed by the thought chain injection method I present in this article.

The advantage over a model trained for reasoning is that I can finely control what I expect from it. Therefore, depending on the domain, I can induce the model to specialize in a particular domain by emphasizing certain details during forced reasoning. I can also reduce or increase the size of the reasoning context, and significantly improve the results.

Another interesting point is that I was able to train small custom langage models for a particular domain without using a reasoning process, and then use this method to improve the results. This reduced the training time, but also the workload.

So, it is not a question of competing with the capacity of reasoning models, but of having an additional tool.

Also, what I remark is that the quality clearly depends on three things:

  • the model used, and its context window size
  • the quality of the reasoning forcing sentences
  • the number of token to produce (255 is a good value for the thinking phases. (In my demo, I propose to use 2 values, one for the reasoning phase and another one for the final answer)

The examples I've given you in the code above are pretty much viable for the tests I've carried out. They're certainly not perfect, and they sometimes generate odd reactions from the model. For example, “Let's reformulate the question” sometimes results in “in 10 words”. In the end, this has the opposite effect, degrading the answer. It's rare, but it happens.

In the demo, I used another list of reflection prefixes, and I force the question to be reinjected in the final step. I helps to not derive and to answer in others languages. See the code.

rethink_prepends = [
    "OK, I need to figure out ",
    "I think ",
    "Wait, I think ",
    "Let me check if ",
    "I should also remember that ",
    "Another thing to note is that ",
    "I also recall that ",
    "I think I have a good grasp ",
    "Now, using all the above information, I can answer the question using the original language used for the question:"
      "\n{question}\n"
      "\n**ANSWER**\n",
]

Be careful, the last three lines are misleading; they're actually a single string. Note the absence of a comma. Python can be a bit tricky sometimes.

What is certain, however, is that it gives fairly convincing results for most of the tests I've carried out. Notably on mathematical reasoning, algorithmic and code production, and on complex reasoning about technical choices.

Conclusion

First, thanks to HuggingFace to offer free (Zero) GPU to the community. I can't describe how wonderful it is to be able to offer GPU-enabled demos or applications to users without breaking the bank. Such a philosophy coming from a large company deserves to be highlighted.

And I thank you, the readers and testers, because I certainly wouldn't spend so much time writing code and articles on various sites if you didn't volunteer your time.

One more time, I'm certainly not the inventor of this method. I don't find many reference of project using this (LLamaIndex uses something similar, but there isn't a lot of references about the reasoning forcing, it's weird).

I really hope this article has given you some ideas and perhaps some insights. If you have any questions, I'd be happy to answer them whenever I have time.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment