Improve model card for Vision-SR1-7B with metadata and links

#1
by nielsr HF Staff - opened

This PR significantly improves the model card for the Osilly/Vision-R1-7B model, which corresponds to the Vision-SR1-7B model introduced in the paper "Self-Rewarding Vision-Language Model via Reasoning Decomposition".

The changes include:

  • Adding library_name: transformers to the metadata. This enables the automated "Use in Transformers" code snippet, leveraging the detected Qwen2_5_VLForConditionalGeneration architecture and transformers dependency mentioned in the config.json and GitHub README.
  • Adding pipeline_tag: image-text-to-text to the metadata, which ensures the model is discoverable under the appropriate task on the Hugging Face Hub (e.g., https://huggingface.co/models?pipeline_tag=image-text-to-text). This aligns with the model's functionality as a Vision-Language Model.
  • Integrating the paper's abstract and key information from the GitHub README to provide a comprehensive "About Vision-SR1" section.
  • Including direct links to the official paper (https://huggingface.co/papers/2508.19652) and the GitHub repository (https://github.com/zli12321/Vision-SR1).
  • Embedding relevant figures from the GitHub repository to visually explain the method and dataset.
  • Adding links to associated models and datasets on the Hugging Face Hub, as found in the GitHub README.
  • Providing installation and training setup instructions based on the GitHub repository.
  • Noting the incorrect BibTeX citation in the original GitHub README for the Vision-SR1 paper and including the correct citation for the EasyR1 codebase.

Please review these updates.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment