VoladorLuYu
's Collections
Super Alignment
updated
Trusted Source Alignment in Large Language Models
Paper
•
2311.06697
•
Published
•
10
Diffusion Model Alignment Using Direct Preference Optimization
Paper
•
2311.12908
•
Published
•
47
SuperHF: Supervised Iterative Learning from Human Feedback
Paper
•
2310.16763
•
Published
•
1
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Paper
•
2311.15657
•
Published
•
2
Using Human Feedback to Fine-tune Diffusion Models without Any Reward
Model
Paper
•
2311.13231
•
Published
•
26
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
Paper
•
2310.03739
•
Published
•
21
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback
Paper
•
2309.00267
•
Published
•
47
Aligning Language Models with Offline Reinforcement Learning from Human
Feedback
Paper
•
2308.12050
•
Published
•
1
Q-Transformer: Scalable Offline Reinforcement Learning via
Autoregressive Q-Functions
Paper
•
2309.10150
•
Published
•
24
Secrets of RLHF in Large Language Models Part I: PPO
Paper
•
2307.04964
•
Published
•
28
Efficient RLHF: Reducing the Memory Usage of PPO
Paper
•
2309.00754
•
Published
•
13
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper
•
2309.14525
•
Published
•
30
Nash Learning from Human Feedback
Paper
•
2312.00886
•
Published
•
14
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback
Paper
•
2312.00849
•
Published
•
8
Training Chain-of-Thought via Latent-Variable Inference
Paper
•
2312.02179
•
Published
•
8
Reinforcement Learning from Diffusion Feedback: Q* for Image Search
Paper
•
2311.15648
•
Published
•
1
OneLLM: One Framework to Align All Modalities with Language
Paper
•
2312.03700
•
Published
•
20
Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback
Paper
•
2204.05862
•
Published
•
2
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
42
Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences
Paper
•
2404.03715
•
Published
•
60
Dataset Reset Policy Optimization for RLHF
Paper
•
2404.08495
•
Published
•
8
Learn Your Reference Model for Real Good Alignment
Paper
•
2404.09656
•
Published
•
82
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
•
2405.07863
•
Published
•
67
Iterative Reasoning Preference Optimization
Paper
•
2404.19733
•
Published
•
47
Iterative Length-Regularized Direct Preference Optimization: A Case
Study on Improving 7B Language Models to GPT-4 Level
Paper
•
2406.11817
•
Published
•
13
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
•
2406.18629
•
Published
•
41
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human
Annotations
Paper
•
2312.08935
•
Published
•
4
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
Reasoning
Paper
•
2407.00782
•
Published
•
23
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Paper
•
2410.18451
•
Published
•
13