About

This is an experiment to add reasonning to Lucie-7B-Instruct with GRPO finetuning.

I used 500 exemples from open-r1/Mixture-of-Thoughts Science subset.

Evaluation procedure

I used the same system prompt and same param on 100 test exemples from open-r1/Mixture-of-Thoughts Science subset. I used gemini-2.0-flash-lite to compare each model answer to the ground truth.

Usage

import transformers

messages = [
  {'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks and then provides the user with the answer. You begin you answer with the reasoning process and answer enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer>\\boxed{letter}</answer>. Your reasoning process should be detailed and should include all the steps you took to arrive at the answer. The answer should be based on the reasoning process and should be only the answer letter.',
   'role': 'system'},
  {'content': 'What happens to the equilibrium constant when the concentration of a reactant is increased in a reversible reaction?A: The equilibrium constant will fluctuate until a new equilibrium is reached.\nB: The equilibrium constant will increase.\nC: The equilibrium constant will decrease.\nD: The equilibrium constant will not change.',
   'role': 'user'}
]

model_name = "PhilSad/Lucie-7B-GRPO-Science-500"
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_lora_path,
    device_map="auto",
)
tokenizer = transformers.AutoTokenizer.from_pretrained("OpenLLM-France/Lucie-7B-Instruct-v1.1")


pipeline_base = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    device_map="cuda",
    temperature=0.1,
    top_p=0.95,
    top_k=50,
)

with torch.no_grad():
    out = pipeline(exemple["prompt"])

print(out[0]["generated_text"][-1]["content"]

# > <think> When the concentration of a reactant is increased in a reversible reaction, the system will shift towards the products to re-establish equilibrium. This shift will cause the equilibrium constant to decrease, as the reaction will favor the formation of more products. </think><answer>\boxed{D}</answer>

PhilSad
/

Lucie-7B-GRPO-Science-500

About

Evaluation procedure

Usage

Model tree for PhilSad/Lucie-7B-GRPO-Science-500

Dataset used to train PhilSad/Lucie-7B-GRPO-Science-500