Learning Smooth Reward Models with Temporal Difference for LLM RL and Inference
Dan Zhang
zd21
AI & ML interests
None yet
Organizations
None yet
models
18
zd21/qwen2.5-7b-td2
Updated
zd21/qwen2.5-7b-baseline-prm
Updated
zd21/DeepSeek-TD1-PRM
Updated
zd21/GLM-Z1-9B-0414-TDRM
9B
•
Updated
•
1
zd21/GLM4-9B-0414-TDRM
9B
•
Updated
zd21/Qwen2.5-1.5B-TDRM
Updated
zd21/Qwen2.5-0.5B-TDRM
Updated
zd21/Qwen2.5-Math-7B-TDRM
Updated
zd21/Qwen2.5-Math-1.5B-TDRM
Updated
zd21/DS-R1-Distill-Qwen-7.5B-TDRM
Updated
datasets
27
zd21/DataSciBench
Preview
•
Updated
•
88
•
1
zd21/TDRM-3-step-TD
Viewer
•
Updated
•
1.41M
•
6
zd21/TDRM-2-step-TD
Viewer
•
Updated
•
1.41M
•
30
zd21/TDRM-1-step-TD
Viewer
•
Updated
•
1.41M
•
9
zd21/ReST-MCTS_SciGLM-6B_Self-Rewarding-DPO_2nd
Viewer
•
Updated
•
1
•
8
zd21/ReST-MCTS_SciGLM-6B_ReST-MCTS_Policy_2nd
Viewer
•
Updated
•
40.9k
•
7
zd21/ReST-MCTS_SciGLM-6B_ReST-EM-CoT_2nd
Viewer
•
Updated
•
28.9k
•
8
zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_Self-Rewarding-DPO_2nd
Viewer
•
Updated
•
1
•
4
zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-MCTS_2nd
Viewer
•
Updated
•
26k
•
6
zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-EM-CoT_2nd
Viewer
•
Updated
•
36.6k
•
7