Vinnnf
/

Thinkless-1.5B-RL-DeepScaleR

Text Generation

text-generation-inference

Model card Files Files and versions

Vinnnf commited on May 19

Commit

ea333b8

·

verified ·

1 Parent(s): c1c8054

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -47,6 +47,10 @@ library_name: transformers
 We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, \<short\> for concise responses and \<think\> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50\% - 90\%, significantly reducing the computational cost of Reasoning Language Models.
 ## QuickStart
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer

 We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, \<short\> for concise responses and \<think\> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50\% - 90\%, significantly reducing the computational cost of Reasoning Language Models.
+## Pipeline
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/646a1939c37ca1e12308fe81/2FK8C2Hp9maxJF-_m1VzG.png)
 ## QuickStart
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer