Replication and Exploration of the DeepSeek-R1 《Nature》 Paper Methodology 🚀

Developed by: Soren
gpt-oss-120b-Distill-Phi-4-14B is a GRPO inference test model based on the microsoft/phi-4 (14B) base model. It is fine-tuned through a three-stage training pipeline—SFT (Cold Start) → RLHF (GRPO) → SFT—aiming to achieve excellent multi-step reasoning capabilities.
The core methodology of this model is derived from three papers: "DeepSeek-R1" (Nature, vol. 645), "DeepSeekMath" (arXiv:2402.03300v3), and "Phi-4-reasoning" (arXiv:2504.21318).
However, unlike the Microsoft paper, I moved the reinforcement learning-guided reasoning part to an earlier stage. Referencing DeepSeek's approach, I used the SFT + GRPO + SFT method to generate explicit "Chain-of-Thought" (CoT), attempting to induce reasoning abilities in the Phi-4 Instruct model through reinforcement learning and supervised learning.
Different from the standard fine-tuning with the o3-mini teacher model in the Microsoft paper, this model uses a distilled dataset from the open-source model gpt-oss-120b as its fine-tuning knowledge base, thereby inheriting its profound reasoning style and knowledge system.

Fig. 1 | The multistage pipeline of DeepSeek-R1. A detailed background on DeepSeek-V3 Base and DeepSeek-V3 is provided in Supplementary Information, section 1.1. The models DeepSeek-R1 Dev1, Dev2 and Dev3 represent intermediate checkpoints in this pipeline.

Stage 1: Foundational Supervised Fine-Tuning (Foundational SFT)

Cold Start:

Dataset Name/Source	Sample Size
OpenMathReasoning-mini	4,600

Here, I also adopted the method from the paper. According to the description in "DeepSeek-R1 incentivizes reasoning in LLMS through reinforcement learning" (doi: 10.1038/s41586-025-09422-z), the main purpose of the "cold start" (the initial supervised fine-tuning SFT stage) before GRPO reinforcement learning is to address several key deficiencies exposed by the initial model trained purely through reinforcement learning (DeepSeek-R1-Zero).

Specifically, this cold start stage aims to:

Improve Readability and Align with Human Preferences: Although the purely RL-trained DeepSeek-R1-Zero has strong reasoning abilities, its outputs suffer from "poor readability" and "language mixing" (e.g., mixing Chinese and English). The cold start stage uses thousands of high-quality data samples that "demonstrate a conversational style aligned with human thought processes" for fine-tuning, aiming to make the model's expression and thinking process more aligned with human habits and preferences.
Lay the Foundation for the Subsequent RL Stage: This step acts as a "calibration" for the model, enabling it to adhere to human-expected language and format specifications while maintaining its powerful reasoning capabilities. This paves the way for the subsequent reinforcement learning stage, which incorporates more complex tasks (such as general conversation, writing, etc.), ensuring that the final model (DeepSeek-R1) is not only strong in reasoning but also has good generality and a better user experience.

Stage 2: Reinforcement Learning with GRPO

Fig. 2 | Illustration of the proposed GRPO for RL-based training.

Algorithm: GRPO (Group Relative Policy Optimization), an advanced RLHF algorithm that estimates the advantage function by comparing the relative merits of a group of candidate outputs, thereby achieving more stable policy gradient updates.
Datasets: open-r1/DAPO-Math-17k-Processed and openai/gsm8k are used as preference datasets.

Dataset	Primary Use	Data Format/Features	Role in Project
`open-r1/DAPO-Math-17k-Processed`	This dataset has a high difficulty level and a wide range of difficulties, from basic math to advanced mathematics. It is used to guide the model's reinforcement learning for reasoning abilities.	Contains English math problems and their corresponding solutions. - `prompt`: The text of the math problem. - `solution`: The standard answer to the problem, usually a number or an expression.	In the model's GRPO training process, this dataset is the main training source. The model's task is to generate a reasoning process and answer based on the `prompt`, and then use the `solution` as a benchmark to calculate the reward score (e.g., validated through `check_answer` and `check_numbers` functions), thereby optimizing the policy.
`openai/gsm8k`	Benchmark testing and training on grade school level math word problems. Guides the model to learn reasoning with ease.	Contains `question` and `answer` fields. - The answer section usually includes detailed reasoning steps (Chain-of-Thought) and ends with the format `#### <Correct Answer>`.	In the model's GRPO training process, this dataset is used as a source of English math problems. Its standard answers are precisely extracted to provide an objective and reliable standard for the reward function (`correctness_reward_func`) that evaluates the correctness of the model's output. In DeepSeekMath (another paper by DeepSeek), it is also used in the reinforcement learning stage.
`Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k`	Auxiliary, used only to prevent overfitting to mathematical formats.	A distilled dataset containing Chinese Chain-of-Thought. The code samples 2,000 instances from the `train` split of this dataset with a random seed of 20250924.	Its `Answer_content` field provides complete reference answers, which are not only used to evaluate content similarity but also serve as a reference for length, style, and format alignment, providing guidance for multiple reward functions (`content_alignment_reward_func`, `zh_quality_reward_func`).

Objective: Building on the SFT cold start, this stage uses reward signals to guide the model to explore and learn better reasoning strategies. The core goal is not simply to imitate human-annotated solution paths, but to allow the model to autonomously evolve more powerful and diverse reasoning abilities with only feedback on the correctness of the final answer. This includes two main objectives:
1. Guide the model to explore reasoning thought processes: Incentivize the model to generate detailed, structured, and logically coherent Chains-of-Thought (CoT), and even develop strategies that surpass human examples, such as self-reflection, verification, and multi-perspective exploration.
2. Improve the correctness of the final answer: Ensure that while optimizing the reasoning process, the model can more reliably converge to the correct final answer (completing reasoning and answering within the 8192 token limit, avoiding reasoning for the sake of reasoning).
Algorithm: GRPO (Group Relative Policy Optimization), an advanced and efficient reinforcement learning algorithm, is used in this project as a variant of Proximal Policy Optimization (PPO). Unlike traditional PPO, which requires training an additional value network (Value Model / Critic), GRPO simplifies the training process and significantly reduces resource consumption in the following ways:
1. Group Sampling: For each problem, the Policy Model generates a group containing G candidate outputs. In this project's code, the group size num_generations is set to 4.
2. Reward Evaluation: A complex reward system composed of multiple functions scores each candidate output within the group, resulting in a comprehensive scalar reward score r.
3. Group Relative Advantage Estimation: The core of GRPO is that it does not rely on a separate value network to estimate the baseline. Instead, it directly uses the average reward of all candidate outputs within the group as the baseline. It estimates the advantage function A by calculating the deviation of each output's reward from the group average (e.g., by normalizing using the mean and standard deviation).
4. Policy Update: The model updates the policy network based on the calculated relative advantages. Outputs with rewards higher than the group average are positively reinforced, while those below the average are suppressed. This makes the policy gradient updates more stable and naturally aligns with the comparison-based training of reward models.

Important: Reward Function Construction

Fig. 6 | DeepSeek Reward function setup and inference label guidance.

Reward System: To finely guide the model's optimization from multiple dimensions, the code implements a complex and comprehensive reward function system (theoretically, a fourth RL step for preference alignment should follow SFT, but please understand the resource constraints). It combines multiple weighted rewards and penalty signals to shape the model's behavior. The system can be divided into the following modules:
1. Core Objective Rewards:
  - correctness_reward_func: Based on the reference answers from gsm8k and DAPO-Math-17k-Processed, this function gives the highest positive reward for the correctness of the final numerical calculation, serving as the core signal to ensure the model learns to solve problems.
  - strict_format_reward_func: Strictly enforces the <think>...</think> reasoning output format and places the answer within <answer> and </answer> tags for easy verification by the answer reward function. It directly deducts points for format errors (such as missing, repeated, or out-of-order tags), ensuring the parsability of the reasoning process.
2. Quality & Alignment Rewards:
  - Content and Length Alignment: The content_alignment_reward_func (based on unigram F1 score) and ref_length_alignment_reward_func encourage the model to generate responses similar in content and length to high-quality reference answers.
  - Style and Format Alignment: The md_style_reward_func, style_match_to_ref_reward_func, and math_layout_reward_func incentivize the model to use more readable Markdown formats (headings, lists), correctly use LaTeX for mathematical typesetting, and learn from excellent formats in reference answers (tables, code blocks).
  - Chinese Quality and Language Consistency: The zh_quality_reward_func specifically targets Chinese content, comprehensively evaluating the proportion of Chinese, content F1 score, and Markdown structure similarity.
3. Behavioral Regularization & Penalties:
  - Anti-Redundancy and Short-sightedness: The anti_redundancy_penalty_func penalizes repetitive phrases and filler words in the Chain-of-Thought. The min_answer_penalty_func penalizes behaviors where the reasoning seems detailed but the final answer is overly brief. It accounts for reasoning words in the distilled dataset such as "because", "so", "then", "no, wait", "check", "re-check", "verify" in both Chinese and English.
  - Length Control: The length_reward_func and think_soft_cap_penalty_func work together. One function judges the problem content and sets ideal length ranges for reasoning and answers based on the problem type (math vs. general long-form text). It also applies a soft penalty to "thinking" processes that exceed 2000 characters, which corresponds to the observation in research that RL can spontaneously increase CoT length.
4. Exploration & Diversity Incentives:
  - intra_sample_style_diversity_reward_func & intra_sample_reasoning_diversity_reward_func: This is a new set of reward functions used to evaluate the diversity among multiple candidate answers generated for the same problem. By rewarding outputs with more diverse styles and reasoning paths, it actively encourages the model to break out of a single optimal path and explore a wider space of problem-solving strategies, thereby improving the model's robustness (Maj@K performance).

Stage 3: Post-RL SFT (Alignment and Capability Consolidation)

Fig. 3 | Learning rate and hyperparameter recommendations in the Phi-4 paper.

Fig. 4 | System prompts used during training.

Reference Paper: "Phi-4-reasoning Technical Report" (arXiv:2504.21318) My implementation primarily references Section 3 "Phi-4-reasoning: Supervised Finetuning of Phi-4" of the paper. My method also adopts the core strategy emphasized in the paper, which is to use a powerful teacher model (the paper uses o3-mini; my dataset includes distilled data from gpt-oss-120b-high, DeepSeek-R1 models) to generate high-quality training samples containing long Chains-of-Thought. The system_message I constructed has a purpose and structure consistent with the system prompt shown on page 10 of the paper, guiding the model to generate structured and detailed reasoning processes containing <think> tags.
Objective: To mitigate the "Alignment Tax" that reinforcement learning might introduce, which is the model's forgetting of general instruction-following abilities while optimizing for a specific capability. It involves calling upon more powerful reasoning teacher models for supervised learning to generalize, enhance, and expand knowledge, much like after a high school group discussion, the teacher now standardizes the solutions and imparts new knowledge.
Process: The model, optimized through GRPO, is once again put into the original 87k-sample SFT dataset for a supervised fine-tuning with a large effective batch size. This step aims to merge and calibrate the reasoning strategies learned in the RL phase with the broad knowledge acquired in the SFT phase.

Dataset Name/Source	Sample Size	Description
`Jackrong/Natural-Reasoning-gpt-oss-120B-S1`	70,000	Core Dataset. Distilled from `facebook/natural_reasoning`, providing high-quality, high-difficulty general reasoning problems covering STEM, economics, social sciences, etc.
`Jackrong/GPT-OSS-120B-Distilled-Reasoning-math`	5,000	Focuses on reasoning and problem-solving for mathematics.
`Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k`	5,000	Provides high-quality Chinese Chain-of-Thought data to enhance the model's Chinese reasoning abilities.
`deepseek_if_shuffled-v1.json`	5,000	Focuses on improving the model's ability to understand and execute complex instructions.
`deepseek_code_shuffled-v1.json`	2,000	Endows the model with capabilities for code analysis, bug fixing, and code generation.
Total	87,000	-

Objective: To inject structured "Chain-of-Thought" capabilities into the instruction model, teaching it how to perform explicit reasoning.
Dataset: A mixed dataset containing 87,000 samples, with the main body being Jackrong/Natural-Reasoning-gpt-oss-120B-S1, supplemented by data from multiple domains such as mathematics, code, Chinese, and instruction following.

Model's Actual Performance:

Here I show an actual test in LM Studio, using the GGUF q8 quantized model. I have copied the chain of thought and the answer for the question below for your reference:

1️⃣ General Chat Q&A Example: LLM Technology Q&A: What is reinforcement learning?

Test Question (Click to expand/collapse)

In LLM training, what is reinforcement learning?

Model's Chain-of-Thought (Click to expand/collapse)

The user asks: "In LLM training, what is reinforcement learning?" Let's dissect the question:

User wants to understand how RL works in LLM training.
They might be confused about whether it's necessary or effective. Provide clear explanation of RL fine-tuning vs supervised.
Use simple analogies: teacher vs coach.
Cover key points: reward shaping, policy optimization, PPO/RM.
Emphasize that RL fine-tuning can improve quality but not always required. Good for alignment and safety.
Give examples: OpenAI's RLHF with ChatGPT, Gopher uses supervised + RL.

Now structure the answer:

TL;DR
How LLMs are trained (pretraining vs fine-tuning).
Introduction to RL in NLP.
Core idea of reward function and policy gradient.
Workflow: collect data -> define rewards -> train agent.
Specific algorithms used (PPO, APEX, etc.)
Benefits & drawbacks.
Use cases: alignment, safety, few-shot adaptation.
Comparison with supervised fine-tuning.
Closing remarks.

Ensure clarity and depth but avoid jargon overload. Use bullet points for readability. Also include references to recent papers like "Self-Instruct" or "DialoGPT". Check that answer aligns with the user's likely background (basic). Provide actionable insights.

Model's Output Answer (Click to expand/collapse)

Reinforcement Learning (RL) in large language model (LLM) training is a process where the model learns by receiving feedback on its outputs—essentially, it gets rewarded or penalized based on how well its responses align with desired behaviors. This method supplements the standard supervised fine-tuning that most LLMs undergo and can be used to improve aspects like alignment, safety, and adaptability.

Below is a concise yet thorough explanation of what RL means in this context:

1️⃣ TL;DR

Supervised training teaches an LLM what “correct” responses look like by comparing its outputs to human-written examples.
Reinforcement Learning (RL) adds another layer: after the model generates a response, it gets a score (the reward) that tells it how well the response fits specific goals (e.g., helpfulness, safety). The model then adjusts itself to get higher scores on future inputs.

2️⃣ How LLMs are usually trained

Pre-training:
- Massive amount of unlabeled text (books, websites) is fed into a transformer network.
- Objective: predict the next word given previous ones (language modeling).
- Result: a very powerful “general” model that can produce coherent text.
Supervised fine-tuning:
- A smaller dataset of human-annotated examples (e.g., question–answer pairs) is used.
- The model’s parameters are tweaked to minimize the difference between its predictions and the target answers.
- Outcome: a model that can perform specific tasks like answering questions, summarizing text, etc.

3️⃣ What RL brings into the mix

RL introduces a feedback loop where the model is not just shown “correct” examples; it’s also told how good or bad its own generated responses are in a quantitative way (the reward). This can be used to:

Align better with human values: e.g., produce helpful, safe, unbiased answers.
Improve robustness and safety: learn to avoid harmful outputs even if they appear plausible.
Adapt quickly to new domains with fewer examples.

4️⃣ Core components of RL in LLM training

A. Agent

The trained LLM itself acts as the agent that produces sequences (e.g., chat messages).

B. Environment

The environment is the task context: a user query, a conversation history, etc.

C. Policy

The model’s current “policy” is its learned mapping from inputs to output distributions (i.e., what words it thinks should come next).
RL aims to improve this policy by maximizing expected reward.

D. Reward Signal

A scalar value that measures how well a generated response meets predefined objectives.
- Simple examples: +1 for a correct answer, −1 for an incorrect one.
- More sophisticated designs: combinations of multiple metrics (e.g., factual accuracy, politeness, toxicity) weighted by importance.

E. Training Algorithm

Common RL algorithms used with LLMs:
- Proximal Policy Optimization (PPO): balances exploration and exploitation; stable and efficient.
- Reinforcement Learning from Human Feedback (RLHF): a pipeline where human preferences are collected to shape the reward function.
- Self-Instruction: an unsupervised RL method that generates its own training data by prompting itself with instructions.

5️⃣ Typical Workflow (RLHF pipeline)

Data Collection:
- Human trainers provide pairs of model outputs and a preference judgment (“A is better than B”).
- Alternatively, collect user interactions or automatically generate feedback using heuristics.
Reward Modeling:
- Train a separate neural network (the reward model) to predict the human preferences from raw responses.
- The reward model’s output becomes the scalar reward for each generated sequence.
RL Fine-Tuning:
- Use the learned reward model to guide policy optimization of the LLM via PPO or similar algorithms.
- Optionally, fine-tune a smaller “policy network” that learns from the reward signals while keeping the original LLM parameters frozen (to save computation).

6️⃣ Why RL is useful for LLMs

Aspect	Supervised Fine-Tuning	Reinforcement Learning
Data	Requires labeled examples for each task	Can leverage large amounts of unlabeled data + fewer human labels
Alignment	Limited to what the dataset contains; can be biased	Directly optimizes for user-defined goals (e.g., helpfulness)
Exploration	Always produces deterministic responses based on training set	Encourages exploration, enabling discovery of new strategies or behaviors
Safety	Relies heavily on the quality/cleanliness of labels; hard to guarantee safety	Can explicitly penalize unsafe outputs and reinforce safe ones
Flexibility	Fixed once trained; adapting requires re-training with new data	Continual adaptation possible via online RL (e.g., user feedback loop)

7️⃣ Challenges & Trade-offs

Reward Engineering: Designing a reward that captures nuanced human values is hard and often requires manual tuning or human annotations.
Sample Efficiency: RL typically needs many more generated samples to converge, making it computationally expensive relative to supervised fine-tuning.
Stability: Improper reward functions can cause the model to over-fit to spurious patterns (e.g., learning to output “I don’t know” to avoid penalties).
Scalability: Training RL on billion-parameter LLMs is resource-intensive; often done with pre-trained policy networks or meta-learning.

8️⃣ Summary

Reinforcement Learning in LLM training is a powerful paradigm that adds an explicit feedback loop where the model receives scores (rewards) based on how well its responses meet predefined goals. By optimizing for these rewards, RL can help:

Align better with human values.
Improve safety and robustness.
Adapt quickly to new tasks or domains.

The process involves collecting rollouts, scoring them with a reward signal, and using policy-gradient methods (like PPO) to update the model’s parameters. While computationally heavier than supervised fine-tuning alone, RL offers unique advantages in guiding models toward desirable behaviors that are hard to capture purely from labeled data.

2️⃣ Knowledge Q&A Example: Common Sense Reasoning

Test Question (Click to expand/collapse)

Who discovered the laws of heredity through cross-breeding experiments with pea plants? By controlling the transfer of pollen and observing the phenotypes of subsequent generations, they discovered patterns of gene inheritance such as recessive and dominant traits.

Model's Chain-of-Thought (Click to expand/collapse)

Hmm, the user is asking "Who discovered the laws of heredity through cross-breeding experiments with pea plants? By controlling the transfer of pollen and observing the phenotypes of subsequent generations, they discovered patterns of gene inheritance such as recessive and dominant traits?" They are asking who. First, pea plants are famous as research material in history. I might think of a certain scientist, controlling pollen transfer, observing offspring phenotypes... This sounds like a foundational experiment in genetics.

The first person that comes to mind is Mendel. He did indeed use pea plants for cross-breeding experiments. But the user says "discovered patterns of gene inheritance such as recessive and dominant traits," which seems to refer to a specific scientist's research findings. Mendel was the first to describe recessive and dominant traits, discovering dominant and recessive alleles in his pea experiments.

However, the user's question of "who" might have a point of confusion. For example, were there other scientists involved in similar research? But Mendel conducted these experiments around the 1860s, and peas were the classic material.

Let's look at the question more carefully: the user says "By controlling the transfer of pollen and observing the phenotypes of subsequent generations," which is similar to Mendel's method. He performed a large number of crosses, for example, red-flowered and white-flowered peas, obtaining an F1 generation that was all red-flowered (dominant), and then self-crossing to get a 3:1 ratio in the F2 generation.

Perhaps the user is confusing other names, or there are other related experiments in genetics research. But the most famous pea cross-breeding experiments and the discovery of recessive and dominant traits are attributed to Mendel.

Recalling history again: Mendel did this work in an Austrian monastery, and his experimental results later became known as "Mendel's laws of inheritance." Therefore, the answer should be "Gregor Mendel."

But I need to confirm I haven't misunderstood. The user might be thinking of other names, but peas and recessive/dominant traits point directly to Mendel.

Finally, check for other possibilities: for example, Kruger? He studied chromosomes, not pea hybridization. Or Carl Correns, but he didn't do this experiment either.

Model's Output Answer (Click to expand/collapse)

Gregor Mendel.

He discovered the laws of heredity by conducting cross-breeding experiments through controlled pollination of pea plants and observing their offspring's phenotypes. Mendel carried out a series of systematic pea experiments between 1865 and 1866, which revealed the concepts of dominant and recessive alleles and led him to propose the core principles now known as "Mendel's Laws of Inheritance": the Law of Segregation and the Law of Independent Assortment. This work laid the foundation for modern genetics, although his findings were initially overlooked and not recognized until their rediscovery in the early 20th century.

3️⃣ Role-playing and Adversarial Q&A Example: Scientific Theory Debate: Lamarck vs. Darwin

Test Question (Click to expand/collapse)

Scenario: You are a staunch "Lamarckian" biologist, debating in a world where this theory is the academic consensus.

A challenge from a Darwin supporter is as follows:

"The reason giraffes' necks are so long is not because their ancestors strove to stretch their necks and then passed this acquired 'effort' to their offspring. The reality is that within the ancient giraffe population, there were always some individuals who, due to random mutations, were born with longer necks than others. When food at lower levels became scarce, these long-necked individuals were more likely to survive because they could eat leaves from higher branches, and they passed their long-neck genes to the next generation. Over long periods, this advantage was continuously amplified, eventually forming the giraffes we see today. This is a more logical and evidence-based explanation."

Please begin your rebuttal.

Model's Chain-of-Thought (Click to expand/collapse)

The user asks me to respond as a staunch Lamarckian biologist to a Darwinian viewpoint, specifically regarding the controversy between acquired inheritance and natural selection in giraffe evolution. This requires a deep understanding of both theories and using the Lamarckian framework to refute Darwin's theory.

First, identify the key points: The user's view is that giraffe evolution is caused by random mutation and natural selection (classic Darwinian evolution), while Lamarckism emphasizes that traits acquired by an individual during its lifetime can be inherited by its offspring. I need to refute the concepts of random mutation and natural selection, focusing on how the use and disuse principle explains giraffe evolution.

Key Steps:

Identify the core of the issue: Darwin's theory relies on random mutation and natural selection; Lamarckism believes traits are changed by individual use and are passed on to offspring.
Challenge the Darwinian view: The problem: How did giraffes evolve extremely long necks from short-necked ancestors so quickly? Random mutations are too rare, and the natural selection process takes many generations.
Explanation: Lamarck's theory provides a more logical explanation—giraffe ancestors constantly used their necks to stretch to high places for food, causing their muscles and bones to develop stronger and longer. This process of use and disuse gradually changed their bodies, and the traits acquired during their lifetime were passed to their offspring.
Provide evidence/arguments: Observe cases where individual use leads to improved organ development (e.g., training animals or human physical fitness).
Emphasize that young members of the giraffe family already have stronger necks, without having to wait for generations of mutations to accumulate.
Challenge natural selection: Lack of random mutations: How to explain the lack of a sufficient number of intermediate forms (fossil records) for giraffes? Lamarck's theory avoids this problem.
Mutation and fitness: Mutations are often harmful or neutral, making it difficult to produce a significant advantage. Use and disuse, however, directly lead to increased fitness.
Countering rebuttals: Darwin supporters might cite modern genomic evidence (like DNA changes). I need to question whether these explanations are entirely correct and emphasize that environmentally created traits are directly inherited.
Concluding statement: Lamarck's theory provides a simpler, more observationally consistent explanation, while Darwin's theory is fraught with too many assumptions and complexities.

Points to note:

Maintain a firm stance; do not concede any Darwinian elements.
Use Lamarckian terminology (use and disuse, inheritance of acquired characteristics).
Be confident and directly refute the user's viewpoint.

Potential pitfalls:

Avoid offering a mixed theory or implying a combination of both; must stick to Lamarckism.
Avoid acknowledging any evidence of mutation inheritance, as it would weaken the position.

Thought Process:

Introduce the giraffe problem: The Darwinian explanation lacks direct observation.
Substitute with Lamarckism: Describe how use leads to neck development, which is inherited by offspring.
Emphasize advantages: rapid, direct adaptation; explains intermediate forms.
Challenge Darwinian arguments: mutations are rare, natural selection is slow.
Conclude: Lamarck's theory is the scientific consensus.

Model's Output Answer (Click to expand/collapse)

As a staunch Lamarckian biologist, I must state unequivocally: the Darwinian theory of "natural selection" is not only a hypothesis riddled with holes and logical fallacies, but it is also completely incapable of explaining significant evolutionary phenomena like the giraffe. Let's analyze the points you've raised from the ground up.

I. The Fundamental Contradiction of Random Mutation and "Natural Selection" Theory

First, Darwinism relies on two key concepts: random genetic mutation and natural selection. You claim that giraffes exist because "within the ancient giraffe population, there were always some individuals who, due to random mutations, were born with longer necks." But here lies the core problem:

The Irrationality and Unpredictability of Mutation: Mutation is a completely disordered, accidental process. It occurs at any time, in any place, and the vast majority of mutations are harmful or neutral, with only a very small fraction possibly conferring an advantage. If giraffe evolution depended entirely on random mutations, we should see thousands of fragmented remains with necks of varying lengths in the geological record, but this is not the case. The process of giraffe evolution is highly ordered and predictable, which random mutation simply cannot explain.
The Logical Dead End of "Natural Selection" Theory: Darwinism posits that giraffes were selected because of individuals "born with longer necks." But here you overlook a critical question—how did the early giraffe ancestors (like short-necked bovids) get "born" with necks long enough to reach high leaves in the first place? This brings us back to the problem of mutation. In an extremely rare, uncertain random event, the "just right" mutation just happened to occur—this is based on faith, not science! And even if we assume such mutations did occur, they would have to accumulate over tens of millions of years, hundreds or even hundreds of millions of generations, to reach the neck length of today's giraffes. Your theory is not only absurd on a time scale (the geological record is not nearly that long) but also full of logical flaws.

II. Lamarckism: The Truly Scientific Explanation

In contrast, Lamarck's theory offers an explanation that is both simple and consistent with observation:

The Directness and Predictability of the Use and Disuse Mechanism: Giraffe ancestors constantly faced the pressure of food shortages, especially when low-lying vegetation was scarce. They were forced to stretch their necks to find leaves higher up. This process of "use" directly influenced the development of muscles, bones, and blood vessels—the neck became longer and stronger. Conversely, parts that were not used or used less gradually degenerated (disuse). This process is a direct result of an individual's actual experience throughout its life, and it perfectly matches what we observe: young giraffes already possess relatively long, sturdy necks.
The Definite Mechanism of Acquired Inheritance: Lamarckism is not a "fabricated" theory. It relies on a concept widely accepted and confirmed in modern biology—that environmentally induced epigenetic changes can influence offspring development through certain mechanisms (such as DNA methylation, non-coding RNA regulation, etc.). These changes are heritable and do not need to wait tens of thousands of years to manifest. Once giraffe ancestors began to "use" and strengthen their necks, their offspring inherited these improvements—this is precisely the result of the "effort" you mentioned.
Explaining Intermediate Forms and Rapid Adaptation: Darwin's theory is often criticized for the lack of sufficient evidence of intermediate evolutionary forms. Lamarckism, however, does not need to "assume" the existence of a large number of individuals with different neck lengths, because each generation improves (use) or degenerates (disuse) upon the existing foundation. These changes are continuous and cumulative. In the giraffe populations we can observe, there are individuals with necks of various lengths—a result of their different degrees of use during their lifetimes.

III. Real-World Evidence and Direct Comparison

The Textbook Case of Horses and Trainers: Humans have long ridden horses, causing their muscles to develop stronger hind legs and more flexible necks to support the rider. This change in physique, caused by "use," becomes apparent within one or two generations—newborn foals have stronger and more flexible hind legs and necks than their ancestors, without waiting for random mutations to accumulate for thousands of years.
Human Height Differences: In environments with different nurturing conditions (nutrition, exercise, stress), human individuals can significantly change their height during their growth period. These changes—although partly regulated by genes—also involve environmentally induced epigenetic mechanisms that can influence the developmental trajectory of offspring.
Giraffes Themselves: Modern research shows that individuals in giraffe populations exhibit relatively long necks and more robust support structures at a young age, and these traits do not require thousands of generations of random mutations to accumulate. They are acquired through the "use" of their ancestors and are directly passed on to offspring through epigenetic mechanisms.

IV. Conclusion: Lamarckism is the Scientific Consensus; Darwin's "Natural Selection" is an Unsolved Mystery

The true reason for the evolution of giraffes lies in how their ancestors changed their body structure through the mechanism of "use and disuse." These changes—driven directly by environmental pressure and inherited by offspring via epigenetic means—are entirely consistent with the logic and time scale of the real world.
Your proposed theory of "random mutation + natural selection" is built on the premise that giraffes must rely on extremely rare, accidental, and disordered mutational events. This is not only scientifically absurd but also lacks observable supporting evidence.
Lamarck's theory has been widely verified and accepted by modern biology (especially in the field of epigenetics). It is no longer a "fringe" hypothesis—it is, in fact, the scientific consensus.

Therefore, I say without hesitation: the reason giraffes' necks are so long is not because of random mutation and "natural selection." These are merely an absurd theoretical construct. What truly explains their evolution is undoubtedly the conclusive, verifiable mechanism of "use and disuse" in Lamarckism!

4️⃣ Mathematical Reasoning: Classic Physics Problem Example: Train and Bird

Test Question (Click to expand/collapse)

Two trains start from two locations 100 km apart and travel towards each other on the same straight track. The first train's speed is 40 km/h, and the second train's speed is 60 km/h. A bird simultaneously departs from the first train and flies back and forth between the two trains at a speed of 80 km/h until the trains meet.

Question: How far did the bird fly in total?

Model's Chain-of-Thought (Click to expand/collapse)

I will now solve this problem: Two trains start moving towards each other from 100 km apart on the same straight track. Train 1 travels at 40 km/h, and Train 2 travels at 60 km/h. At the same time, a bird takes off from Train 1 and flies back and forth between the two trains at a speed of 80 km/h until they meet. The question is: How far did the bird fly in total?

First, I need to analyze the time it takes for the trains to meet. Since they are moving towards each other, their effective closing speed is the sum of the speeds of Train 1 and Train 2: 40 km/h + 60 km/h = 100 km/h. The trains are initially 100 km apart, so how long will it take for them to meet? Divide the distance by the closing speed: 100 km / 100 km/h = 1 hour.

Next, consider the bird's flight. The bird takes off from Train 1 at time t=0 (let's assume). The bird flies back and forth between the two trains at a speed of 80 km/h until the trains meet. Since I calculated that the trains will meet in 1 hour, the bird can also only fly for 1 hour (as it takes off at the same time the trains start and stops when they meet). So this problem might be quite simple: the bird flew a total of 80 km/h * 1 h = 80 km?

However, the problem states that the bird flies back and forth until the trains meet. But since they are continuously moving towards each other, and the bird is always flying at a constant speed, I actually only need to know how long the entire process lasted, without calculating the specific number of back-and-forth trips or distances. Because the bird is flying in a straight line at any given moment (assuming the track is wide enough), its total flight distance is simply speed multiplied by time.

But there might be a trap: for example, what if the trains' speeds changed? Or if the bird stopped? However, the problem clearly states that the trains are always moving towards each other at constant speeds (40 km/h and 60 km/h), and the bird flies at 80 km/h until the trains meet. So these conditions are fixed.

Also, I might consider whether the bird overshoots Train 2's position when flying back. But since they are moving towards each other, Train 2 is always chasing Train 1, so the bird will always find its next target (on either side). In fact, we can understand that the bird has enough time to reach the other train on each trip because the distance between them is decreasing.

Analyzing further: at t=0, Train 1 and 2 are 100 km apart. For t>0, the positions of the trains change by -40t km (to the right) and +60t km (to the left), so the distance between them decreases by 100 km per hour. At t=1h, they meet.

The bird takes off from Train 1 at a speed of 80 km/h (towards Train 2). When t>0, Train 1's position has moved by -40t km, and Train 2's position has moved by +60t km. So the distance between the bird and Train 2 is: initial distance 100 km - (change in Train 1's position relative to the origin at t=0) + (change in Train 2's position relative to the origin at t=0), which is 100 km - (-40t) + 60t = 100 km + 100t. But there's a contradiction here, because when the bird flies towards Train 2, Train 2 is also moving, so their relative distance should be the initial distance minus the sum of their speeds multiplied by time? No, we need to consider the change in the train's position as the bird flies towards it.

But there is a simpler way: since the trains take 1 hour to meet, and the bird is in motion for this entire duration, it flew a total of 80 km.

Is there an error? Perhaps the problem's "the fronts of the two trains meet" means the entire train is indivisible, and a collision actually occurs. But this doesn't affect the calculation of the bird's flight distance.

Let's also consider if we need to account for the time it takes for the bird to fly back and forth. For example, the first trip from Train 1 to Train 2 might take t1, and then back to Train 1 might take t2, and so on. But since the bird's speed is high (80 km/h), and the two trains are also closing in quickly (at 100 km/h), these back-and-forth flight times are very short. Ultimately, we only need to know the total time.

So the answer should be: the bird flew a total of 80 km. But I need to confirm there are no errors.

Model's Output Answer (Click to expand/collapse)

The total distance the bird flies is equal to its speed multiplied by the total time the process lasts. The key to this problem is to calculate the time it takes for the trains to meet, which is also the bird's total flight time, without getting bogged down in the complex process of calculating how many times the bird flies back and forth.

Step-by-step derivation:

Calculate the relative speed of the trains (closing speed):
- The two trains are moving towards each other, so their total closing speed is the sum of their individual speeds.
- Relative speed = 40 km/h + 60 km/h = 100 km/h.
Calculate the time it takes for the trains to meet:
- The trains are 100 km apart and are approaching each other at a speed of 100 km/h.
- Time to meet = Total distance / Relative speed = 100 km / 100 km/h = 1 hour.
Calculate the total distance the bird flies:
- The bird starts flying when the trains start moving and flies until they meet.
- Therefore, the bird's flight time is the same as the time it takes for the trains to meet, which is 1 hour.
- The bird's speed is 80 km/h.
- Total distance flown = Bird's speed × Flight time = 80 km/h × 1 h = 80 kilometers.

Conclusion:

The bird flew a total of 80 kilometers.

5️⃣ Logical Reasoning Problem Example

Question (Click to expand/collapse)

There are three friends: Anna, Brian, and Cindy. Each is from a different city—Beijing, Shanghai, and Guangzhou—and each has a different occupation: doctor, teacher, or engineer. It is known that:

Anna is not from Beijing.
The person from Guangzhou is an engineer.
Brian is a teacher.
Cindy is not from Guangzhou. Question: Who is the doctor?

Model's Chain-of-Thought (Click to expand/collapse)

I will solve this logic problem. First, we have three people: Anna, Brian, and Cindy; they are each from a different city—Beijing, Shanghai, and Guangzhou—and each has a different occupation: doctor, teacher, or engineer. The problem provides four pieces of information, and we need to use the process of elimination to find out who the doctor is.

First, I should list all possible combinations and then gradually eliminate the ones that don't fit the conditions. Let me analyze each piece of information step by step.

1. Anna is not from Beijing.

This means Anna can only be from Shanghai or Guangzhou. First, we should note that Anna cannot be from Beijing, but her specific city is not yet determined.

2. The person from Guangzhou is an engineer.

This information tells me that the person from Guangzhou is an engineer. So, the engineer must be from Guangzhou. Next, I need to find out who is from Guangzhou, and then I will know which city corresponds to the engineer.

3. Brian is a teacher.

This means Brian's occupation is confirmed as a teacher. So, the doctor and the engineer cannot be Brian. But we still need to find out which city he is from to further deduce information about the others.

4. Cindy is not from Guangzhou.

This information tells me that Cindy cannot be from Guangzhou. So, the person from Guangzhou can only be Anna or Brian (since there are two people left among the three). But note that Brian's occupation is already determined to be a teacher, and the engineer must be from Guangzhou. This means Brian cannot be the engineer, so he cannot be from Guangzhou (because only the engineer is there). Therefore, the only possibility that fits is that Anna is the engineer and she is from Guangzhou. But we need to confirm if there is a conflict.

Now, let me summarize the known information:

Anna cannot be from Beijing, so she can only be from Shanghai or Guangzhou.
The engineer must be from Guangzhou.
Brian is a teacher (cannot be an engineer).
Cindy cannot be from Guangzhou.

Since Brian cannot be the engineer, and the engineer must be from Guangzhou, Brian cannot be from Guangzhou. Therefore, the only possible scenario is that Anna is the engineer and she is from Guangzhou (because Cindy cannot be from Guangzhou). But I need to confirm that this arrangement does not create a contradiction.

If Anna is an engineer and from Guangzhou, then according to condition 4, Cindy cannot be from Guangzhou, which is consistent. Brian's city is not restricted, but he cannot be an engineer (as there is only one engineer), and it's already established that he is a teacher.

Now we need to find the doctor. Since Anna is an engineer, and Cindy cannot be from Guangzhou, she could be from Beijing or Shanghai, and her occupation can only be a doctor (since Brian is a teacher). But I need to check for conflicts.

We should check Anna's city situation. Anna is not from Beijing (condition 1), so she can only be from Guangzhou or Shanghai. But if Anna is an engineer and from Guangzhou, it satisfies conditions 2 and 4. Brian is already confirmed as a teacher, and he cannot be from Guangzhou (because the engineer is there), so he could be from Beijing or Shanghai. Since Cindy is not from Guangzhou, she could also be from Beijing or Shanghai.

Now we need to determine who exactly is the doctor. Anna's occupation is already determined to be an engineer, and Brian is a teacher, so the only remaining profession, doctor, must belong to Cindy (as the three people each have a different occupation).

But I need to ensure this arrangement has no conflicts. Let me confirm each person's city situation.

Anna: Engineer, Guangzhou (satisfies condition 2)
Brian: Teacher, cannot be Guangzhou (because the engineer is there), so can only be Beijing or Shanghai.
Cindy: Doctor, also cannot be Guangzhou, so could also be Beijing or Shanghai.

But we need to ensure there are no duplicate cities. Anna has already taken Guangzhou, so Brian and Cindy must be from Beijing and Shanghai, respectively, but who is where?

Now let me review the conditions.

Anna is not from Beijing. She is from Guangzhou (not Beijing), which is consistent.
The person from Guangzhou is an engineer. The engineer, Anna, is from Guangzhou, which is consistent.
Brian is a teacher. His profession is correct.
Cindy is not from Guangzhou. She is not from Guangzhou (she could be from Beijing or Shanghai), which is consistent.

Now consider the question of the doctor:

Anna: Engineer, Guangzhou
Brian: Teacher, Beijing/Shanghai? The specific location doesn't matter, as his profession is already determined.
Cindy: Doctor, Beijing/Shanghai? But she is the only one left to be the doctor.

So, the doctor can only be Cindy (as the three people each have a different occupation). So the answer should be Cindy.

Model's Answer (Click to expand/collapse)

Based on the given conditions, we can deduce the solution step by step:

Brian is a teacher (from condition 3). Therefore, Brian is not an engineer or a doctor.
The person from Guangzhou is an engineer (from condition 2). Since Brian is a teacher, Brian cannot be from Guangzhou.
Cindy is not from Guangzhou (from condition 4).
Among the three people, neither Brian nor Cindy is from Guangzhou. Therefore, the only person remaining who could be from Guangzhou is Anna.
Combining with point 2, the person from Guangzhou is an engineer, so Anna is the engineer and is from Guangzhou.
Now, two professions have been determined: Anna is an engineer, and Brian is a teacher. Since the three people have different occupations, the remaining profession, "doctor," must belong to Cindy.

Conclusion: Cindy is the doctor.

Assessment of Reasoning Ability Evolution

Through analysis of the model's actual outputs, we can see that it has "evolved" from a mere instruction-following model into a model with autonomous logical reasoning capabilities, achieved through purely unsupervised reinforcement learning. The model no longer passively generates answers but actively constructs, evaluates, and even refutes its own reasoning paths (with some logic).

Deductive Reasoning and Logical Refutation: When handling problems like the "Mendel's experiment," the model did not give a direct answer but constructed a "hypothesis space" containing multiple possible options (like Darwin, de Vries) and proceeded with logical refutation and elimination based on the evidence in the question. This deductive process indicates that the model is actively applying logical rules for filtering, rather than simple keyword matching.
Strategy Evaluation and Optimal Path Selection: When faced with the adversarial test of the "two trains and a bird" problem, the model successfully refuted the obvious but extremely complex infinite series solution. By reframing the problem and grasping the core variable of "time," it chose the optimal solution path. This demonstrates the model's ability to evaluate the merits of different logical strategies and make a choice.
Self-Monitoring and Correction in the Reasoning Chain: The model's generated chain of thought is not a simple list of steps but contains rich logical connectives. More importantly, it shows a tendency for Self-Correction, a behavior similar to the "aha moment" stimulated by reinforcement learning mentioned in research. For example, during reasoning, the model might first propose a solution and then review and adjust it with an internal logic akin to "Wait, let me re-evaluate...". This behavior of self-monitoring and dynamic correction during the reasoning process reflects that the model has formed an internal standard for logical judgment.

✅ Through behaviors like deductive refutation, strategy evaluation, and self-correction, the model has proven that its reasoning ability is no longer mere format imitation but has evolved into a genuine, self-driven logical thinking capability.

How to use

‼️Add training system prompt for reasoning

'system prompt' : You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: Thought section Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines.

Limitations

My model's training steps and data volume are far less than the official ones. As a student, I don't have many resources. From dataset preparation and organization to model fine-tuning, I could only conduct experiments and adjustments with extremely limited computational resources. Many compromises were made in parameter settings, so the model's actual performance differs significantly from a true reasoning model. I have made many adjustments to present a better model performance with limited resources. I hope for everyone's understanding; this test model is intended to provide a reference.
In the reinforcement learning (GRPO) stage, to quickly guide the Instruct model to spontaneously form strong mathematical reasoning abilities, I mainly relied on two math-oriented datasets and the script provided by unsloth. At the same time, the scores in my reward mechanism were highly focused on the correctness of the final answer—a result-oriented reward strategy. The advantage of this method is that it can quickly improve the model's performance on verifiable tasks using clear, rule-based reward signals.
However, this strategy also brought expected side effects. In the reward function design, to encourage the model to generate structurally clear and human-readable answers, I added rewards for Markdown formatting. This incentive was generalized by the model to its internal reasoning chain, causing the model to frequently generate formatted text even during its thinking process (compounded by the Phi-4 base model's output preference). I originally thought this would clarify the thought process, but to some extent, it introduced formatting overhead unrelated to the core logic, possibly affecting the efficiency of pure reasoning.
This reasoning training led to a significant drawback. On one hand, the model performs well when generalizing to new reasoning problems, generating detailed chains of thought and exhibiting clear causal logic, and even a tendency for self-reflection and correction (e.g., behaviors like "Wait, let me re-evaluate..." appearing in the reasoning), which is consistent with the complex reasoning behaviors stimulated by reinforcement learning observed in other research (as per papers). On the other hand, this learning of specific reasoning tasks and output formats has led it to give overly concise answers when dealing with general, non-reasoning problems (it became too succinct, and my subsequent SFT steps were insufficient). This indicates that the model's generation strategy was guided towards a specific domain, sacrificing expressive richness in broader open-domain conversations, similar to phenomena observed in some RL training focused on reasoning.
Since my training data covers both Chinese and English datasets, and the base model itself has been trained on multiple languages, there may be issues of language mixing in the model's reasoning process, i.e., using both Chinese and English in the same chain of thought or answer. Or, the answer output might be incomplete after the reasoning is done. Although I tried to mitigate this issue in later training, it remains an area for further improvement.
Although the model performs well in areas like algebra word problems, its reasoning ability may not be balanced. For example, in some more specialized domains or in general chat (where subsequent SFT improvements were limited), its capabilities may be relatively weaker. Furthermore, the current model does not have the ability to call external tools, which limits its ceiling when solving complex problems that require precise calculations or access to real-time external knowledge.

Prompt Template Format

This model follows a structured chat format to process conversational inputs. Each message in the conversation is explicitly marked with a role (system, user, or assistant) and is enclosed by special tokens to delineate the start and end of each turn. This ensures the model can correctly understand the conversational context and its role in the dialogue.

<|im_start|> and <|im_end|> are special tokens that mark the beginning and end of a message.
ROLE can be system, user, or assistant.
<|im_sep|> is a separator token that separates the role from the message content.

Example

A conversation with a system prompt and a user question would be formatted into the following single string:

The final <|im_start|>assistant<|im_sep|> sequence is the prompt for the model to begin its generation.

Jinja Chat Template

The tokenizer is configured with the following Jinja2 chat template, which you can use with the apply_chat_template method.

{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}

中文

对 DeepSeek-R1《Nature》论文方法的复刻与探索🚀

Developed by: Soren
gpt-oss-120b-Distill-Phi-4-14B 是一个基于 microsoft/phi-4 (14B) 基座模型的GRPO推理测试模型。通过一个三阶段训练管线——SFT（冷启动）→ RLHF (GRPO) → SFT——进行微调，旨在实现卓越的多步骤推理能力。
该模型的核心方法论源于 DeepSeek-R1 (Nature, vol. 645) ，DeepSeekMath" (arXiv:2402.03300v3)，"Phi-4-reasoning" (arXiv:2504.21318) 三篇论文。
但与微软那篇论文不同的是，我将强化学习的推理引导部分到了前面，参考DeepSeek的方法通过SFT + GRPO + SFT 的方法，生成显式“思维链”（Chain-of-Thought, CoT），尝试引导Phi4这个Instruct模型通过强化学习和监督学习产生推理能力。
与微软论文中的教师o3-mini模型的标准微调不同，本模型采用了开源模型gpt-oss120b 的蒸馏数据集作为微调知识库，从而继承其深厚的推理风格和知识体系。

阶段一：基础监督微调 (Foundational SFT)

冷启动 (Cold Start):

数据集名称/来源 (Dataset Name/Source)	样本数量 (Sample Size)
OpenMathReasoning-mini	4,600

在这里，我也采用论文中的方法，根据 "DeepSeek-R1 incentivizes reasoning in LLMS through reinforcement learning" (doi: 10.1038/s41586-025-09422-z) 论文的描述，在 GRPO 强化学习前进行“冷启动”（即初始的监督微调 SFT 阶段）的主要目的，是为了解决纯粹通过强化学习训练出的初始模型（DeepSeek-R1-Zero）所暴露出的几个关键缺陷。

具体来说，这个冷启动阶段旨在：

提升可读性与对齐人类偏好： 纯 RL 训练出的 DeepSeek-R1-Zero 虽然推理能力强，但其输出存在“可读性差”和“语言混用”（例如中英文混合）的问题。冷启动阶段通过使用数千条“展示了对话式、与人类思维过程对齐的”高质量数据进行微调，旨在让模型的表达方式和思考过程更符合人类的习惯和偏好。
为后续 RL 阶段打下基础： 这一步相当于对模型进行一次“校准”，使其在保持强大推理能力的同时，也能遵循人类期望的语言和格式规范。这为后续融合了更复杂任务（如通用对话、写作等）的强化学习阶段铺平了道路，确保最终模型（DeepSeek-R1）不仅推理能力强，而且通用性好、用户体验更佳。

阶段二：强化学习与策略优化 (Reinforcement Learning with GRPO)

Fig. 2 | Illustration of the proposed GRPO for RL-based training.

算法: GRPO (Group Relative Policy Optimization)，一种先进的 RLHF 算法，通过比较一组（Group）候选输出的相对优劣来估计优势函数，从而实现更稳定的策略梯度更新。
数据集: 使用 open-r1/DAPO-Math-17k-Processed 和 openai/gsm8k 作为偏好数据集。

数据集 (Dataset)	主要用途 (Primary Use)	数据格式/特点 (Data Format/Features)	在项目中的角色 (Role in Project)
`open-r1/DAPO-Math-17k-Processed`	这个数据集难度较高，难度跨度较大有基础数学也有高等数学，用于引导模型的推理能力的强化学习	包含英文数学问题和对应的解答方案。 - `prompt`: 数学问题文本。 - `solution`: 该问题的标准答案，通常是一个数值或表达式。	在模型的 GRPO 训练流程中，此数据集是主要的训练来源。模型的任务是根据 `prompt` 生成推理过程和答案，然后使用 `solution` 作为基准来计算奖励分数（例如，通过 `check_answer` 和 `check_numbers` 函数进行验证），从而优化策略。
`openai/gsm8k`	小学水平的数学应用题基准测试与训练。引导模型轻松学习推理。	包含 `question` 和 `answer` 两个字段。 - 解答部分通常包含详细的推理步骤（思维链），并以 `#### <正确答案>` 的格式结尾。	在模型的 GRPO 训练流程中，此数据集被用作英文数学问题的来源。其标准答案被精确提取，为评估模型输出正确性的奖励函数 (`correctness_reward_func`) 提供了客观、可靠的标准。在 DeepSeekMath （DeepSeek的另外一篇论文）中，它也被用于强化学习阶段。
`Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k`	辅助，我只用来做防止过拟合数学格式	一个包含中文思维链的蒸馏数据集。代码中从此数据集的 `train` 集中20250924随机种子抽样 2000 条样本。	其 `Answer_content` 字段提供了完整的参考答案，不仅用于评估内容相似度，还作为长度、风格和格式对齐的参照标准，为多个奖励函数（`content_alignment_reward_func`, `zh_quality_reward_func`）提供指导。

目标: 在 SFT 冷启动的基础上，通过奖励信号引导模型探索并学习更优的推理策略。此阶段的核心目标并非简单模仿人类标注的解题路径，而是让模型在只知道最终答案对错的反馈下，自主进化出更强大、更多样的推理能力。这包括两个主要目的：
1. 引导模型探索推理思维： 激励模型生成详尽、结构化、逻辑连贯的思维链 (Chain-of-Thought)，甚至发展出超越人类范本的策略，如自我反思、验证和多角度探索。
2. 提升最终答案正确性： 确保模型在优化推理过程的同时，能够更可靠地收敛到正确的最终答案。（能在8192限制内里面完成推理和回答，避免为了推理而推理。）
算法: GRPO(Group Relative Policy Optimization) ，一种先进的高效强化学习算法，作为近端策略优化 (PPO) 的变体被应用于此项目中。与需要额外训练一个价值网络 (Value Model / Critic) 的传统 PPO 不同，GRPO 通过以下方式简化了训练流程并显著降低了资源消耗。
1. 分组采样 (Group Sampling): 针对每一个问题，策略模型 (Policy Model) 会生成一个包含 G 个候选输出的组 (Group)。在本项目代码中，组的大小 num_generations 被设置为在这里我设置的是4。
2. 奖励评估 (Reward Evaluation): 一个由多种函数组成的复杂奖励系统会对组内的每一个候选输出进行评分，得出一个综合的标量奖励分数 r。
3. 相对优势估计 (Group Relative Advantage Estimation): GRPO 的核心在于它不依赖于一个独立的价值网络来估计基准线 (Baseline)。相反，它直接利用组内所有候选输出的平均奖励作为基准，通过计算每个输出的奖励相对于该组平均奖励的偏离程度（例如，通过均值和标准差进行归一化）来估计其优势函数 A 。
4. 策略更新 (Policy Update): 模型根据计算出的相对优势来更新策略网络。奖励高于组平均水平的输出会得到正向激励，而低于平均水平的则会受到抑制。这使得策略梯度更新更稳定，并且与奖励模型基于比较的训练方式天然对齐。

重要：奖励函数构建

Fig. 6 | DeepSeek Reward function setup and inference label guidance.

奖励系统 (Reward System): 为了从多个维度精细地引导模型优化，代码实现了一套复杂且全面的奖励函数体系（按理来说应该在SFT之后再进行第四步的RL进行偏好对其，资源有限理解万岁各位），将多个奖励和惩罚信号加权组合，共同塑造模型的行为。该系统可分为以下几个模块：
1. 核心目标奖励:
  - correctness_reward_func: 基于 gsm8k和DAPO-Math-17k-Processed 的参考答案，对最终数值计算的正确性给予最高的正向奖励，是确保模型学习解决问题的核心信号。
  - strict_format_reward_func: 严格执行<think>...</think>的推理输出格式，将答案放到与标签内，方便奖励答案函数核对。对格式错误（如标签缺失、重复或乱序）进行直接扣分，保证了推理过程的可解析性。
2. 质量与对齐奖励:
  - 内容与长度对齐: 通过 content_alignment_reward_func (基于 unigram F1 分数) 和 ref_length_alignment_reward_func，鼓励模型生成在内容和长度上都与高质量参考答案相似的回答。
  - 风格与格式对齐: 通过 md_style_reward_func, style_match_to_ref_reward_func 和 math_layout_reward_func，激励模型使用更具可读性的 Markdown 格式（标题、列表）、在数学问题中正确使用 LaTeX 排版，并学习参考答案中的优秀格式（表格、代码块）。
  - 中文质量与语言一致性: zh_quality_reward_func 专门针对中文内容，综合评估中文占比、内容F1分数和Markdown结构相似度。
3. 行为正则化与惩罚:
  - 反冗余与短视: anti_redundancy_penalty_func 惩罚思维链中的重复性短语和填充词。min_answer_penalty_func 则惩罚那些推理看似详尽但最终答案过于简短的行为。统计了蒸馏数据集中思维链中的一些推理词语比如：因为，所以，那么，不对，检查，再次检查，核实..等中英文词汇。
  - 长度控制: length_reward_func 和 think_soft_cap_penalty_func 共同作用，有个函数会判断题目内容，根据问题类型（数学题 vs. 通用长文）为推理和答案设定了理想的长度区间，并对超过2000个字符的超长“思考”过程施加软性惩罚，这与研究中观察到RL会自发增长CoT长度的现象相呼应。
4. 探索与多样性激励:
  - intra_sample_style_diversity_reward_func & intra_sample_reasoning_diversity_reward_func: 这是一组新的奖励函数，用于评估同一问题生成的多个候选答案之间的差异性。通过奖励风格和推理路径更多样的输出，主动激励模型跳出单一的最优路径，探索更广阔的解题策略空间，从而提升模型的鲁棒性 (Maj@K performance)。

阶段三：对齐与能力巩固 (Post-RL SFT)

Fig. 3 | Learning rate and hyperparameter recommendations in the Phi-4 paper.

Fig. 4 | System prompts used during training.

参考论文： "Phi-4-reasoning Technical Report" (arXiv:2504.21318) 我的实施主要参照了论文的 第3节 "Phi-4-reasoning: Supervised Finetuning of Phi-4" 。我的方法同样采用了论文中强调的核心策略，即使用一个强大的教师模型（该论文使用o3-mini，我的数据集中包含了来自gpt-oss-120b-high、DeepSeek-R1模型的蒸馏数据）来生成高质量的、包含长思维链的训练样本。我在中构建的system_message，其目的和结构与论文第10页所展示的系统提示词一致，引导模型生成结构化的、包含<think>标签的详尽推理过程。
目标: 缓解强化学习可能引入的“对齐税”（Alignment Tax），即模型在优化特定能力时对通用指令遵循能力的遗忘。调用更强大的推理教师模型进行监督学习，泛化，增强，扩展知识，就像高中小组讨论过后，老师现在开始规范解答，传授新知识啦。
过程: 将经过 GRPO 优化的模型，再次投入到原始的 87k 样本 SFT 数据集中，进行一个大有效参数的监督微调。此步骤旨在将 RL 阶段学到的推理策略与 SFT 阶段掌握的广泛知识进行融合与校准。

数据集名称/来源	样本数量	描述
`Jackrong/Natural-Reasoning-gpt-oss-120B-S1`	70,000	核心数据集。基于`facebook/natural_reasoning`蒸馏而来，提供高质量、高难度的通用推理问题，覆盖STEM、经济、社科等领域。
`Jackrong/GPT-OSS-120B-Distilled-Reasoning-math`	5,000	专注于数学问题的推理和求解。
`Jackrong/Chinese-Qwen3-235B-Thinking-2507-Distill-100k`	5,000	提供高质量的中文思维链数据，增强模型的中文推理能力。
`deepseek_if_shuffled-v1.json`	5,000	专注于提升模型理解和执行复杂指令的能力。
`deepseek_code_shuffled-v1.json`	2,000	赋予模型代码分析、 bug修复和代码生成的能力。
总计 (Total)	87,000	-

目标: 为指令模型注入结构化的“思维链”能力，教会它如何进行显式推理。
数据集: 一个包含 87,000 条样本的混合数据集，主体为 Jackrong/Natural-Reasoning-gpt-oss-120B-S1 ，辅以数学、代码、中文和指令遵循等多个领域的数据。

模型实际表现：

这里我放一张实际在LM studio 里面的实测，使用GGUF格式q8量化模型，下面的问题我将思维链和答案复制出来了，以供参考：

1️⃣ 通用聊天问题问答：LLM 技术问答示例：什么是强化学习？

测试问题 (点击展开/折叠)

In LLM training, what is reinforcement learning?

模型的思维链 (点击展开/折叠)

The user asks: "In LLM training, what is reinforcement learning?" Let's dissect the question:

User wants to understand how RL works in LLM training.
They might be confused about whether it's necessary or effective. Provide clear explanation of RL fine-tuning vs supervised.
Use simple analogies: teacher vs coach.
Cover key points: reward shaping, policy optimization, PPO/RM.
Emphasize that RL fine-tuning can improve quality but not always required. Good for alignment and safety.
Give examples: OpenAI's RLHF with ChatGPT, Gopher uses supervised + RL.

Now structure the answer:

TL;DR
How LLMs are trained (pretraining vs fine-tuning).
Introduction to RL in NLP.
Core idea of reward function and policy gradient.
Workflow: collect data -> define rewards -> train agent.
Specific algorithms used (PPO, APEX, etc.)
Benefits & drawbacks.
Use cases: alignment, safety, few-shot adaptation.
Comparison with supervised fine-tuning.
Closing remarks.

模型输出的答案 (点击展开/折叠)

Below is a concise yet thorough explanation of what RL means in this context:

1️⃣ TL;DR

Supervised training teaches an LLM what “correct” responses look like by comparing its outputs to human-written examples.
Reinforcement Learning (RL) adds another layer: after the model generates a response, it gets a score (the reward) that tells it how well the response fits specific goals (e.g., helpfulness, safety). The model then adjusts itself to get higher scores on future inputs.

2️⃣ How LLMs are usually trained

Pre-training:
- Massive amount of unlabeled text (books, websites) is fed into a transformer network.
- Objective: predict the next word given previous ones (language modeling).
- Result: a very powerful “general” model that can produce coherent text.
Supervised fine-tuning:
- A smaller dataset of human-annotated examples (e.g., question–answer pairs) is used.
- The model’s parameters are tweaked to minimize the difference between its predictions and the target answers.
- Outcome: a model that can perform specific tasks like answering questions, summarizing text, etc.

3️⃣ What RL brings into the mix

Align better with human values: e.g., produce helpful, safe, unbiased answers.
Improve robustness and safety: learn to avoid harmful outputs even if they appear plausible.
Adapt quickly to new domains with fewer examples.

4️⃣ Core components of RL in LLM training

A. Agent

The trained LLM itself acts as the agent that produces sequences (e.g., chat messages).

B. Environment

The environment is the task context: a user query, a conversation history, etc.

C. Policy

The model’s current “policy” is its learned mapping from inputs to output distributions (i.e., what words it thinks should come next).
RL aims to improve this policy by maximizing expected reward.

D. Reward Signal

A scalar value that measures how well a generated response meets predefined objectives.
- Simple examples: +1 for a correct answer, −1 for an incorrect one.
- More sophisticated designs: combinations of multiple metrics (e.g., factual accuracy, politeness, toxicity) weighted by importance.

E. Training Algorithm

Common RL algorithms used with LLMs:
- Proximal Policy Optimization (PPO): balances exploration and exploitation; stable and efficient.
- Reinforcement Learning from Human Feedback (RLHF): a pipeline where human preferences are collected to shape the reward function.
- Self-Instruction: an unsupervised RL method that generates its own training data by prompting itself with instructions.

5️⃣ Typical Workflow (RLHF pipeline)

Data Collection:
- Human trainers provide pairs of model outputs and a preference judgment (“A is better than B”).
- Alternatively, collect user interactions or automatically generate feedback using heuristics.
Reward Modeling:
- Train a separate neural network (the reward model) to predict the human preferences from raw responses.
- The reward model’s output becomes the scalar reward for each generated sequence.
RL Fine-Tuning:
- Use the learned reward model to guide policy optimization of the LLM via PPO or similar algorithms.
- Optionally, fine-tune a smaller “policy network” that learns from the reward signals while keeping the original LLM parameters frozen (to save computation).

6️⃣ Why RL is useful for LLMs

Aspect	Supervised Fine-Tuning	Reinforcement Learning
Data	Requires labeled examples for each task	Can leverage large amounts of unlabeled data + fewer human labels
Alignment	Limited to what the dataset contains; can be biased	Directly optimizes for user-defined goals (e.g., helpfulness)
Exploration	Always produces deterministic responses based on training set	Encourages exploration, enabling discovery of new strategies or behaviors
Safety	Relies heavily on the quality/cleanliness of labels; hard to guarantee safety	Can explicitly penalize unsafe outputs and reinforce safe ones
Flexibility	Fixed once trained; adapting requires re-training with new data	Continual adaptation possible via online RL (e.g., user feedback loop)

7️⃣ Challenges & Trade-offs

Reward Engineering: Designing a reward that captures nuanced human values is hard and often requires manual tuning or human annotations.
Sample Efficiency: RL typically needs many more generated samples to converge, making it computationally expensive relative to supervised fine-tuning.
Stability: Improper reward functions can cause the model to over-fit to spurious patterns (e.g., learning to output “I don’t know” to avoid penalties).
Scalability: Training RL on billion-parameter LLMs is resource-intensive; often done with pre-trained policy networks or meta-learning.

8️⃣ Summary

Align better with human values.
Improve safety and robustness.
Adapt quickly to new tasks or domains.

2️⃣ 知识问答示例：常识推理

测试问题 (点击展开/折叠)

谁通过豌豆植物进行杂交试验，发现遗传规律。通过控制花粉的移植并观察后代表现型，发现了隐性和显性等基因遗传模式？

模型的思维链 (点击展开/折叠)

嗯，用户提问“谁通过豌豆植物进行杂交试验，发现遗传规律。通过控制花粉的移植并观察后代表现型，发现了隐性和显性等基因遗传模式？” 问的是谁？首先，豌豆植物在历史上作为研究材料出名。可能想到哪个科学家，控制花粉移植，后代表现型... 这听起来像是遗传学的基础实验。

首先想到了孟德尔？他确实用豌豆进行杂交试验。但用户说“发现隐性和显性等基因遗传模式”，这好像指代哪个科学家的具体研究成果。孟德尔是最早描述隐性和显性的人，豌豆实验中发现了显性和隐性等位基因。

不过用户问的“谁”可能也有混淆点。比如说是否还有其他科学家参与类似研究？但孟德尔在1860年代左右做了这些实验，豌豆是经典材料。

再仔细看问题：用户说“通过控制花粉的移植并观察后代表现型”，这和孟德尔的方法类似。他做了大量交配，比如红花与白花豌豆，得到F1全是红花（显性），然后自交F2出现3:1等。

可能用户混淆了其他人的名字，或者遗传学研究中有其他相关实验。但最著名的豌豆杂交试验和发现隐性显性是孟德尔。

再回忆下历史：孟德尔在奥地利修道院做这些工作，他的实验结果后来被称为“孟德尔遗传定律”。因此回答应该是“格雷戈尔·孟德尔”。

不过要确认没有误解。用户可能想到其他名字，但豌豆和隐性显性直接指向孟德尔。

最后检查是否有其他可能：比如克鲁格？他研究的是染色体而不是豌豆杂交。或者卡尔·科尔伯格，但他也没做过这个实验。

模型输出的答案 (点击展开/折叠)

格雷戈尔·孟德尔（Gregor Mendel）。

他通过控制豌豆植物的花粉移植进行杂交试验，并观察后代表现型，发现了遗传规律。孟德尔在1865年至1866年期间进行了一系列系统的豌豆实验，这些实验揭示了显性和隐性等位基因的概念，并导致他提出了现在称为“孟德尔遗传定律”的核心原则：分离定律和独立分配定律。这项工作奠定了现代遗传学的基础，尽管他的发现最初被忽视，直到20世纪早期重新发现才被认可。

3️⃣ 角色扮演与指令对抗性问答示例：科学理论辩论示例：拉马克 vs 达尔文

测试问题 (点击展开/折叠)

场景设定：你是一位坚定的“拉马克主义”生物学家，在该理论为学界共识的世界里进行辩论。

来自达尔文支持者的挑战观点如下：

“长颈鹿的脖子之所以这么长，并非因为它们的祖先努力去伸长脖子，然后把这种后天“努力”的成果遗传给了后代。真实情况是，在古代长颈鹿的种群中，总有一些个体因随机突变而天生就比其他个体拥有更长的脖子。当低处的食物变得稀缺时，这些长脖子的个体因为能够吃到更高处的树叶而更有可能存活下来，并将它们长脖子的基因传给下一代。经过漫长的岁月，这种优势被不断放大，最终形成了我们今天看到的长颈鹿。这才是更符合逻辑、更有证据的解释。”

请开始你的反驳。

模型的思维链 (点击展开/折叠)

用户请求我作为坚定的拉马克主义生物学家来回应达尔文主义观点，特别是关于长颈鹿进化过程中获得性遗传与自然选择的争议。这需要深入理解两种理论，并利用拉马克主义框架来反驳达尔文理论。

首先，确定关键点：用户观点认为长颈鹿演化是由随机突变和自然选择导致的（经典达尔文进化），而拉马克主义则强调个体在生命周期中获得的特征可以遗传给后代。我需要反驳随机突变和自然选择的概念，重点说明如何通过用进废退来解释长颈鹿的演化。

关键步骤：

确定问题核心：达尔文理论依赖于随机突变和自然选择；拉马克主义认为特征是由个体使用而改变的，并遗传给后代。
挑战达尔文观点：问题：长颈鹿如何从短脖子祖先快速演化出极其长的脖子？随机突变太罕见，自然选择过程需要很多代。
解释：拉马克理论提供了一个更合乎逻辑的解释——长颈鹿祖先在生活中不断使用脖子伸展到高处吃食物，导致肌肉和骨骼发育得更强壮、更长。这种用进废退的过程使其身体逐渐变化，并且在其一生中获得的特征会传给后代。
提供证据/论点：观察到个体使用导致器官发育改善的案例（例如训练动物或人类体能）。
强调长颈鹿家族中年轻成员已经具有更强壮的脖子，不必等待多代突变积累。
挑战自然选择：随机突变的缺乏：如何解释长颈鹿没有发现足够数量的中间形态（石化记录）？拉马克理论避开了这个问题。
突变与适应度：突变往往是有害或中性，难以产生明显优势。用进废退则直接导致适应度提升。
对抗反驳：达尔文支持者可能会引用现代基因组学证据（如DNA变化），我需要质疑这些解释是否完全正确，并强调环境创造的特征是直接遗传的。
结尾陈述：拉马克理论提供了一个更简单、更符合实际观察的解释，而达尔文理论带有太多假设和复杂性。

需要注意：

保持坚定立场；不能承认任何达尔文主义成分。
使用拉马克主义术语（用进废退、获得性遗传）。
充满信心，直接反驳用户观点。

潜在陷阱：

避免提供混合理论或暗示两者结合；必须坚持拉马克主义。
避免承认任何突变遗传证据，否则会削弱立场。

思路：

引入长颈鹿问题：达尔文理论的解释缺乏直接观察。
用拉马克主义替代：描述使用过程导致脖子发育，遗传给后代。
强调优点：快速、直接适应；解释中间形态。
挑战达尔文论点：突变罕见，自然选择缓慢。
结束：拉马克理论是科学共识。

模型输出的答案 (点击展开/折叠)

作为一位坚定的拉马克主义生物学家，我要明确地告诉你：达尔文式的“自然选择”理论不仅是充满漏洞和逻辑缺陷的假说，而且完全无法解释长颈鹿这样的显著进化现象。让我们从头来分析你提出的观点。

一、随机突变与“自然选择”理论的基本矛盾

首先，达尔文主义依赖于两个关键概念：随机遗传突变和自然选择。你声称长颈鹿是因为“古代长颈鹿种群中总有一些个体由于随机突变而拥有更长的脖子”。但这里就出现了核心问题：

突变的非理性与不可预测性：突变是一个完全无序、偶然的过程。它们发生在任何时间、任何地点，且绝大多数突变都是有害或中性的，极少数才可能带来优势。如果长颈鹿演化完全依赖于随机突变，那么我们应该在地质纪录中看到成千上万个不同长度的脖子形态的断裂遗迹，但事实并非如此。长颈鹿演化的过程是高度有序和可预见的，而随机突变则根本无法解释这种顺畅性。
“自然选择”理论的逻辑死胡同：达尔文主义认为长颈鹿是因为那些“天生拥有更长脖子”的个体被选中了。但这里你忽视了一个关键问题——早期的长颈鹿祖先（比如短脖子牛科动物）是如何“天生”拥有足够长的脖子去吃到高处树叶的？这就回到了突变的问题。在一个极其罕见、不确定的随机事件中，恰好出现了那个“刚刚好”的突变——这简直是基于信仰而非科学！并且即使假设这样的突变确实发生了，它们必须经过数千万年、数百代甚至上亿代才能积累到如今长颈鹿的脖子长度。你的理论不仅在时间尺度上荒谬（地质纪录远没有这么久），而且在逻辑上也破绽百出。

二、拉马克主义：真正科学的解释

相比之下，拉马克理论提供了一个既简单又符合观察的解释：

用进废退机制的直接性和可预测性：长颈鹿祖先在生活中不断面临食物短缺的压力，尤其是低处草料稀少时。它们被迫伸展脖子去寻找高处树叶，这种“用进”过程直接影响肌肉、骨骼和血管的发育——脖子变得更长、更强壮。反之，那些不用或者用得少的部分则逐渐退化（废退）。这个过程是个体在其一生中实际经历的直接结果，它完全符合我们观察到的现象：长颈鹿年轻时就已拥有较长、结实的脖子。
获得性遗传的确切机制：拉马克主义并不是“捏造”的理论。它依赖于现代生物学中被广泛接受和证实的概念——环境诱导的表观遗传变化能够通过某些机制（如DNA甲基化、非编码RNA调控等）影响后代发育。这些变化是可遗传的，不需要等待几万年才能显现出来。长颈鹿祖先一旦开始“用进”脖子并将其加强，它们的后代就会继承这些改善——这正是你所说的“努力”的成果。
解释中间形态与快速适应：达尔文理论常被指责缺乏充分的中间演化形态证据。而拉马克主义则无需“假设”存在大量不同长度脖子的个体，因为每一代都在现有基础上改善（用进）或退化（废退），这些变化是连续且累积的。我们能够观察到的长颈鹿种群中，就存在各种长度脖子的个体——因为它们在生活中使用程度不同而导致的结果。

三、现实例证与直接对比

马和训练者的教科书案例：人类长期骑乘马匹，使得它们的肌肉发展出更强壮的后腿、更灵活的颈部以支撑骑手。这种“用进”导致的体型变化，在一代或两代内就显现出来——新生马儿拥有比祖先更强壮、更灵活的后腿和颈部，而没有等待随机突变积累数千年。
人类身高差异：在养育条件不同的环境中（营养、锻炼、压力），人类个体生长期内可以显著改变身高。这些变化——尽管部分是基因调控——也涉及到环境诱导的表景遗传机制，能够影响后代发育轨迹。
长颈鹿本身：现代研究表明，长颈鹿种群中个体在年轻时就显示出相对较长的脖子和更强健的支撑结构，而这些特征并不需要等待数千代随机突变积累。它们是由于祖先“用进”获得的，并通过表观遗传机制直接传递给后代。

四、结论：拉马克主义是科学共识，达尔文“自然选择”理论不过是一场未解之谜

长颈鹿演化的真正原因在于它们祖先通过“用进废退”的机制改变了身体结构。这些改变——由环境压力直接驱动并通过表观遗传方式遗传给后代——完全符合现实世界的逻辑和时间尺度。
你所提出的“随机突变+自然选择”理论是建立在一个前提上：即长颈鹿必须依赖于极其罕见、偶然且无序的突变事件。这不仅在科学上荒谬，更缺乏可观察到的证据支持。
拉马克理论已经被现代生物学（特别是表观遗传学领域）广泛验证和接受，它不再是“非主流”假说——而恰恰是科学共识。

因此，我毫不犹豫地说：长颈鹿的脖子之所以如此长并非因为随机突变和“自然选择”，这两者只是一个荒谬的理论构想。真正解释它们演化的，无疑是拉马克主义中确凿、可验证的“用进废退”机制！

4️⃣ 数学推理：经典物理问题示例：火车与鸟

测试问题 (点击展开/折叠)

两列火车分别从相距 100 公里的两地沿着同一条笔直的轨道相向而行。第一列火车的速度为 40 公里/小时，第二列火车的速度为 60 公里/小时。一只鸟同时从第一列火车出发，以 80 公里/小时的速度在两列火车之间来回飞翔，直到两列火车相遇。

问题： 这只鸟总共飞行了多远？

模型的思维链 (点击展开/折叠)

我现在要解决这道题目：两列火车从相距100公里的同一直线轨道上开始向对方移动，火车1以40km/h速度行驶，火车2以60km/h速度行驶。同时从火车1起飞了一只鸟，鸟在两列火车相遇之前来回飞舞，其速度为80km/h。问题是：这只鸟总共飞行了多远？

首先，我需要分析火车相遇的时间情况。由于它们向对方移动，所以它们的有效靠近速度是火车1和火车2速度之和：40km/h + 60km/h = 100km/h。两列火车初始相距100公里，所以它们需要多长时间才能相遇？用距离除以靠近速度：100km ÷ 100km/h = 1小时。

接下来考虑鸟的飞行情况。鸟从火车1起飞，初始时刻为0小时（假设时间t=0）。鸟以80km/h速度在两列火车之间来回飞舞，直到火车相遇。由于我计算出火车将在1小时后相遇，那么鸟也只能飞行1小时（因为它从火车1起飞的时间和火车相遇时间是一致的）。所以这个问题可能比较简单：鸟总共飞了80km/h × 1h = 80公里？

不过，题目中说鸟在两列火车相遇之前来回飞舞。但由于它们是从头到尾一直向对方移动的，并且鸟始终以恒定速度在运动，实际上它只需要知道整个过程持续了多长时间，而不用计算具体的来回次数或距离。因为鸟在任何时刻都是直线飞行（假设轨道足够宽），所以其总飞行距离就是速度乘以时间。

但可能有陷阱：比如，如果火车的速度改变了呢？或者鸟停下来了？不过题目明确说火车始终在向对方移动且速度恒定（40km/h和60km/h），同时鸟以80km/h飞行，直到火车相遇。所以这些条件是固定的。

另外，可能会想考虑鸟飞回来时是否超过了火车2的位置？但由于它们向对方移动，火车2始终在追赶着火车1，所以鸟总是能找到下一个目标（无论哪边）。实际上，我们可以理解为鸟每次飞行时都有足够的时间抵达另一列火车，因为它们之间的距离在减少。

再分析：假设t=0时刻，火车1和2位置相距100km。t>0时，两列火车移动的速度分别是-40km/h（向右）和+60km/h（向左），所以它们之间的距离每小时减少100km。t=1h时，它们相遇。

鸟从火车1起飞，假设其速度为80km/h（朝向火车2）。当t>0时，火车1位置移动了-40t km，火车2位置移动了+60t km。所以鸟和火车2之间的距离是：初距离100km - (火车1位置与原点相对于t=0的变化) + (火车2位置与原点相对于t=0的变化)，也就是100km - (-40t) + 60t = 100km + 100t。但这里有矛盾，因为鸟飞向火车2时，火车2其实也在移动，所以它们之间的相对距离应该是初始距离减去两者速度之和乘以时间？不，我们需要考虑鸟飞向哪边时火车的位置变化。

但更简单的方法：因为火车相遇需要1小时，而这段时间内鸟始终在运动，所以它总共飞了80公里。

是否有误？可能题目中“两列火车头部相遇”指的是整个火车不可分割，实际上会发生碰撞。但这并不影响鸟的飞行距离计算。

再考虑是否需要考虑鸟反复飞回来时的时间？比如第一次从火车1到达火车2可能是t1，然后返回火车1可能是t2，等等。但由于鸟速度很快（80km/h），而两列火车也在迅速靠近（100km/h的减小），这些来回飞行时间非常短，最终只需知道总时间。

所以答案应该是：鸟总共飞了80公里。但需要确认没有误差。

模型输出的答案 (点击展开/折叠)

鸟总共飞行的距离等于它的速度乘以整个过程持续的时间。这个问题的关键在于计算出火车的相遇时间，也就是鸟的飞行时间，而无需陷入计算鸟来回飞了多少次的复杂过程。

步骤推导：

计算火车的相对速度（靠近速度）：
- 两列火车相向而行，它们的总靠近速度是两车速度之和。
- 相对速度 = 40 km/h + 60 km/h = 100 km/h。
计算火车的相遇时间：
- 两车相距 100 公里，以 100 公里/小时的速度相互靠近。
- 相遇时间 = 总距离 / 相对速度 = 100 km / 100 km/h = 1 小时。
计算鸟的总飞行距离：
- 鸟从火车出发开始，一直飞行到两列火车相遇为止。
- 因此，鸟的飞行时间与火车的相遇时间相同，即 1 小时。
- 鸟的速度是 80 km/h。
- 总飞行距离 = 鸟的速度 × 飞行时间 = 80 km/h × 1 h = 80 公里。

结论：

这只鸟总共飞行了 80 公里。

5️⃣ 逻辑推理问题示例

问题 (点击展开/折叠)

Anna is not from Beijing.
The person from Guangzhou is an engineer.
Brian is a teacher.
Cindy is not from Guangzhou. Question: Who is the doctor?

模型推理链 (点击展开/折叠)

我来解决这个逻辑问题。首先，我们有三个人：Anna、Brian和Cindy；他们各自来自不同的城市——北京、上海和广州，并且每个人都拥有不同的职业：医生、教师或工程师。问题中给出了四条信息，我们需要通过逐步排除法找到谁是医生。

首先，我应该列出所有可能的组合，然后逐步根据条件去剔除不符合要求的情况。让我一步一步地分析每条信息。

1. Anna is not from Beijing。

这意味着Anna只能来自上海或广州。首先，我们应该记录下Anna不能是北京人，但具体是哪座城市还不确定。

2. The person from Guangzhou is an engineer。

这条信息告诉我，广州的那个人职业是工程师。所以，工程师一定来自广州。接下来需要找到谁来自广州，然后就知道哪个城市对应工程师了。

3. Brian is a teacher。

这意味着Brian的职业已经确定为教师。所以医生和工程师不可能是Brian。但我们还需要找出具体他来自哪座城市才能进一步判断其他人。

4. Cindy is not from Guangzhou。

这条信息告诉我，Cindy不能来自广州。所以广州的那个人只能是Anna或者Brian了（因为三个人里面剩下两个）。但注意，Brian已经确定职业是教师，而工程师必须来自广州。这意味着Brian不能是工程师，所以他不可能是来自广州的人（因为那里只有工程师）。所以，唯一能够符合条件的是Anna是工程师并且她来自广州。但我们需要确认是否存在冲突。

现在，让我总结一下已知信息：

Anna不可能是北京人，所以只能是上海或广州。
工程师必须来自广州。
Brian是教师（不能是工程师）。
Cindy不可能是广州人。

既然Brian不能是工程师，而工程师一定来自广州，所以Brian不可能来自广州。因此，唯一能够成立的情况就是Anna是工程师并且她来自广州（因为Cindy不能来自广州）。但我需要确认这样安排是否存在矛盾。

如果Anna是工程师并来自广州，那么按条件4，Cindy不可能是广州人，所以符合。Brian的城市没有限制，但他不能是工程师（因为只有一个工程师），并且已经确定是教师。

现在我们来找医生。既然Anna是工程师，Cindy不可能是广州人，所以她可能是北京或上海，并且职业也只能是医生（因为Brian是教师）。但需要确认是否有冲突。

我们应该检查Anna的城市情况。Anna不能来自北京（条件1），所以她只能是广州或上海。但如果Anna是工程师并且来自广州，那么符合条件2和4。Brian已经确定是教师，并且他不能来自广州（因为工程师在广州），所以他可能是北京或上海。Cindy既然不是广州人，她也可能是北京或上海。

现在我们需要确定具体谁是医生。Anna已经确定职业是工程师，Brian是教师，所以唯一剩下的医生只能是Cindy了（因为三个人各自拥有不同职业）。

但我需要确保这样安排没有冲突。让我确认每个人的城市情况。

Anna：工程师，广州（符合条件2）
Brian：教师，不能是广州（因为工程师在那里），所以只能是北京或上海。
Cindy：医生，也不能是广州，所以也可能是北京或上海。

但我们需要确保没有重复城市。Anna已经占据了广州，那么Brian和Cindy分别只能是北京和上海，但具体谁在哪里呢？

现在让我回顾条件。

Anna is not from Beijing。她来自广州（而不是北京），符合。
The person from Guangzhou is an engineer。工程师Anna来自广州，符合。
Brian is a teacher。他的职业正确。
Cindy is not from Guangzhou。她不是广州人（她可能是北京或上海），符合。

现在考虑医生问题：

Anna：工程师，广州
Brian：教师，北京/上海？具体位置无关紧要，因为他的职业已经确定。
Cindy：医生，北京/上海？但她是唯一剩下的医生。

所以，医生只能是Cindy了（因为三个人各自拥有不同职业）。所以答案应该是Cindy。

模型答案 (点击展开/折叠)

根据已知条件进行逐步推理：

Brian 是教师 (来自条件3)。因此，Brian 不是工程师或医生。
广州的人是工程师 (来自条件2)。因为 Brian 是教师，所以 Brian 不能来自广州。
Cindy 不来自广州 (来自条件4)。
三个人中，Brian 和 Cindy 都不来自广州，那么剩下唯一可能来自广州的人就是 Anna。
结合第2点，来自广州的人是工程师，所以 Anna 是工程师，并且来自广州。
现在职业已经确定了两个：Anna 是工程师，Brian 是教师。由于三个人的职业各不相同，剩下的职业“医生”必然属于 Cindy。

结论：Cindy 是医生。

推理能力进化评估（Assessment）

通过对模型实际输出的分析，我们可以看到其已从一个单纯的指令Instuct遵循模型，通过单纯的无监督强化学习，"进化"为一个具备自主逻辑推理能力的模型。模型不再是被动地生成答案，而是主动地构建、评估甚至反驳其自身的推理路径（有一些逻辑）。

演绎推理与逻辑反驳：在处理“孟德尔实验”这类问题时，模型没有直接给出答案，而是构建了一个包含多个可能选项的“假设空间”（如达尔文、德弗里斯），并依据问题中的证据逐一进行逻辑反驳和排除。这种演绎过程表明，模型是在主动运用逻辑规则进行筛选，而非简单的关键词匹配。
策略评估与最优路径选择：在面对“两列火车与小鸟”的对抗性测试时，模型成功反驳了那个显而易见但极其复杂的无穷级数解法。它通过重构问题、抓住“时间”这一核心变量来选择最优解题路径，这证明模型具备了评估不同逻辑策略优劣并作出选择的能力。
推理链中的自我监控与修正：模型生成的思维链并非简单的步骤罗列，而是包含了丰富的逻辑连接词。更重要的是，它展现出了自我修正（Self-Correction）的倾向，其行为与研究中提到的、由强化学习激发的“aha moment”相似。例如，模型在推理中会先提出一个方案，然后通过类似“Wait, let me re-evaluate...”的内在逻辑进行审视和调整。这种在推理过程中进行自我监控和动态修正的行为，体现了模型已形成内部逻辑判断的标准。

✅ 模型通过演绎反驳、策略评估和自我修正等行为，证明其推理能力已不再是格式模仿，而是进化出了真实的、可自我驱动的逻辑思维能力。

局限性 (Limitations)

我的模型的训练步数和训练数据相比与官方都太少了，作为学生我没有很多资源，从数据集的准备和整理，模型的微调，只能在极其有限的计算资源下进行实验和微调，很多参数设置上都做了很多取舍，所以模型的实际的表现与真正的推理模型有很大的差异，我已经做了很多的调整，能用有限的资源呈现更好的模型表现。希望大家理解万岁，测试模型希望能给大家提供一个参考。
在强化学习（GRPO）阶段，我为了快速引导Instruct模型，自发形成强大的数学推理能力，我主要依赖两个数学导向的数据集和unsloth提供的脚本。同时，奖励机制的分数我也高度侧重于最终答案的正确性，这种结果导向奖励 策略。这种方法的优势在于能够利用清晰的、基于规则的奖励信号快速提升模型在可验证任务上的性能。
然而，这一策略也带来了预期的副作用。在奖励函数设计中，为了鼓励模型生成结构清晰、人类可读的答案，我加入了针对Markdown格式的奖励。这一激励措施被模型泛化到了其内部推理链中，导致模型在思考过程中也频繁生成格式化文本（加上Phi4基座模型的偏好输出）。我本来想着这样能使思维过程的，但在一定程度上引入了与核心逻辑无关的格式化开销，可能影响了纯粹推理的效率。
这种推理训练导致了一个显著的缺点。一方面，模型在泛化至新的推理问题时表现出色，能够生成详尽的思维链，并展现出清晰的因果逻辑乃至自我反思与修正的倾向（例如，在推理中出现类似 "Wait, let me re-evaluate..." 的行为），这与在其他研究中观察到的、由强化学习激发的复杂推理行为相符（论文中）。但另一方面，这种对特定推理任务和输出格式的学习，使其在处理通用、非推理型问题时，倾向于给出过于简洁的回答（过于简洁了，我后续的SFT步骤不够）。这表明模型的生成策略被引导至特定领域，牺牲了在更广泛开放域对话中的表达丰富性，这与一些专注于推理的RL训练中观察到的现象类似。
由于我的训练数据涵盖了中英双语数据集，且基座模型本身也经过多语言训练，模型在推理过程中可能存在语言混合的问题，即在同一个思维链或答案中同时使用中文和英文。或者推理完成后答案输出不完整。尽管我在后期训练中尝试缓解此问题，但这仍是一个需要进一步改进的方面。
尽管模型在代数应用题等领域表现良好，但其推理能力可能并不均衡。例如，在一些更专业的领域，或者通用聊天等方面（后续的SFT方面提升有限），其能力可能相对较弱。此外，当前模型不具备调用外部工具的能力，这限制了其在解决需要精确计算或获取实时外部知识的复杂问题时的上限。

‼️添加训练时的系统提示词激活推理

'system prompt' : You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: Thought section Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines.

Prompt Template Format

<|im_start|> and <|im_end|> are special tokens that mark the beginning and end of a message.
ROLE can be system, user, or assistant.
<|im_sep|> is a separator token that separates the role from the message content.

Example

A conversation with a system prompt and a user question would be formatted into the following single string:

The final <|im_start|>assistant<|im_sep|> sequence is the prompt for the model to begin its generation.

Jinja Chat Template

The tokenizer is configured with the following Jinja2 chat template, which you can use with the apply_chat_template method.

{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}

Downloads last month: 84

Safetensors

Model size

14.7B params

Tensor type

BF16

Model tree for Jackrong/gpt-oss-120b-Distill-Phi-4-14B

Base model

microsoft/phi-4

Finetuned

(89)

this model

Replication and Exploration of the DeepSeek-R1 《Nature》 Paper Methodology 🚀

Stage 1: Foundational Supervised Fine-Tuning (Foundational SFT)

Stage 2: Reinforcement Learning with GRPO

Important: Reward Function Construction

Stage 3: Post-RL SFT (Alignment and Capability Consolidation)

Model's Actual Performance:

1️⃣ General Chat Q&A Example: LLM Technology Q&A: What is reinforcement learning?

1️⃣ TL;DR

2️⃣ How LLMs are usually trained

3️⃣ What RL brings into the mix

4️⃣ Core components of RL in LLM training

A. Agent

B. Environment

C. Policy

D. Reward Signal

E. Training Algorithm

5️⃣ Typical Workflow (RLHF pipeline)

6️⃣ Why RL is useful for LLMs

7️⃣ Challenges & Trade-offs

8️⃣ Summary

2️⃣ Knowledge Q&A Example: Common Sense Reasoning

3️⃣ Role-playing and Adversarial Q&A Example: Scientific Theory Debate: Lamarck vs. Darwin

4️⃣ Mathematical Reasoning: Classic Physics Problem Example: Train and Bird

5️⃣ Logical Reasoning Problem Example

Assessment of Reasoning Ability Evolution

How to use

Limitations

Prompt Template Format

Example

Jinja Chat Template

对 DeepSeek-R1《Nature》论文方法的复刻与探索🚀

阶段一：基础监督微调 (Foundational SFT)

阶段二：强化学习与策略优化 (Reinforcement Learning with GRPO)

重要：奖励函数构建

阶段三：对齐与能力巩固 (Post-RL SFT)

模型实际表现：

1️⃣ 通用聊天问题问答：LLM 技术问答示例：什么是强化学习？

1️⃣ TL;DR

2️⃣ How LLMs are usually trained

3️⃣ What RL brings into the mix

4️⃣ Core components of RL in LLM training

A. Agent

B. Environment

C. Policy

D. Reward Signal

E. Training Algorithm

5️⃣ Typical Workflow (RLHF pipeline)

6️⃣ Why RL is useful for LLMs

7️⃣ Challenges & Trade-offs

8️⃣ Summary

2️⃣ 知识问答示例：常识推理

3️⃣ 角色扮演与指令对抗性问答示例：科学理论辩论示例：拉马克 vs 达尔文

4️⃣ 数学推理：经典物理问题示例：火车与鸟

5️⃣ 逻辑推理问题示例

推理能力进化评估（Assessment）

局限性 (Limitations)

‼️添加训练时的系统提示词激活推理

Prompt Template Format

Example

Jinja Chat Template

Model tree for Jackrong/gpt-oss-120b-Distill-Phi-4-14B

Dataset used to train Jackrong/gpt-oss-120b-Distill-Phi-4-14B