Agustín Piqueres Lajarín

plaguss

AI & ML interests

None yet

Recent Activity

Articles

Organizations

Hugging Face's profile picture SomosNLP's profile picture Hugging Face H4's profile picture Argilla's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture Argilla Explorers's profile picture distilabel-internal-testing's profile picture Data Is Better Together's profile picture LLHF's profile picture SLLHF's profile picture Hugging Quants's profile picture argilla-internal-testing's profile picture Argilla Warehouse's profile picture Hugging Face FineVideo's profile picture smol-explorers's profile picture Hugging Face Science's profile picture Data Is Better Together Contributor's profile picture

plaguss's activity

upvoted an article 4 days ago
upvoted an article 13 days ago
view article
Article

Process Reinforcement through Implicit Rewards

By ganqu
16
reacted to lewtun's post with 🔥 19 days ago
view post
Post
2130
This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
  • 1 reply
·
liked a Space 22 days ago
liked a Space about 1 month ago