AbleCredit Reasoner R0 Qwen 2.5 3B Instruct

Introduction

This model is trained by Deepseek R1 style (GRPO) reinforcement learning on Qwen 2.5 3B Instruct as a base model. Primarily intended for research in application of small LLMs trained using GRPO/RL in the domain of finance, credit underwriting etc.

Model Description

  • Fine Tuned by: AbleCredit (LightBees Technologies Private Limited, Bengaluru, India)
  • License: We've retained the original Qwen research license. Note that license does not allow commercial use.
  • Finetuned from model: Qwen/Qwen2.5-3B-Instruct

How to Get Started with the Model

Use with standard Huggingface based setup

model_name = "AbleCredit/AbleCredit-R0-Qwen-2.5-3B-Instruct" # or local path to model
system_prompt = {
    "role": "system",
    "content": (
        "You are a helpful assistant. User asks a question the assistant answers it.\n"
        "The assistant first thinks about reasoning process in mind and then provides the user with the answer."
        ),
      }

suffix_prompt = {
    "role": "assistant",
    "content": "Let me solve this step by step.\n<think>",
}

prompt_msgs = [
    system_prompt,
    {"role": "user", "content": "What is 15 times 3 ?"},
    suffix_prompt,
]

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = tokenizer.apply_chat_template(
    prompt_msgs,
    tokenize=False,
    continue_final_message=True,
    add_generation_prompt=False,
)

# Tokenize the prompt and move it to the appropriate device.
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

print("\nGenerating response...\n")
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.5,
    min_p=0.01,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nResponse:\n", response)

Training Details

Training Data

Trained using open source logical reasoning datasets and a proprietary finance dataset created by AbleCredit.com.

Training Procedure

Trained using deepseek style reinforcement learning using GRPO with rule based rewards.

Evaluation

  • Model achieves ~67% score on GSM8K benchmark in a zero shot setting (check benchmarking script for more details).

Model Card Contact

contact Harshad Saykhedkar via LinkedIn

Downloads last month
23
Safetensors
Model size
3.09B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AbleCredit/AbleCredit-R0-Qwen-2.5-3B-Instruct

Base model

Qwen/Qwen2.5-3B
Finetuned
(322)
this model
Quantizations
1 model