Introduction to MedVideoCap-55K: A New, Large-Scale, High-Quality Medical Video-Caption Pair Dataset
This blog introduces MedVideoCap-55K, the first large-scale, high-quality medical video dataset with detailed captions, comprising over 55,000 clips across diverse medical scenarios. Built on this dataset, the authors develop MedGen, a medical video generation model that achieves strong performance in both visual quality and medical accuracy.
We will open-source our models, data, and code. You can access it later.
In recent years, although end-to-end text-to-video (T2V) generation technology has made significant progress in general domains, it often produces serious errors when applied to the medical field, such as anatomical distortions and misaligned surgical steps. These issues make it difficult to meet the professional requirements of clinical training, surgical simulation, and patient education. The core bottleneck lies in the lack of large-scale, high-quality medical video data, which prevents video generation models from replicating the expertise of experienced physicians—accurately capturing key anatomical details while rigorously preserving the temporal logic of medical procedures.
MedVideoCap-55K was created to address this challenge. It is the first dataset built specifically for medical video generation, containing over 55,000 finely annotated samples. Based on this dataset, MedGen was trained—a medical video generation model that rivals commercial systems and marks a significant step toward professional-grade video generation in the medical domain.
A comparison between the MedVideoCap-55K dataset and existing medical video datasets. MedVideoCap-55K demonstrates significant improvements in data scale, video resolution, annotation granularity, and diversity of video types.
1.Data Foundation: Construction of the MedVideoCap-55K Dataset
To drive breakthrough advancements in medical video generation, we developed MedVideoCap-55K, the first large-scale, high-quality, and finely annotated medical video dataset. We initially collected approximately 25 million raw medical-related video resources from platforms such as YouTube, covering more than ten real-world medical scenarios—including clinical diagnosis and treatment, medical imaging, surgical education, and scientific animation. Through a rigorous process of selection, cleaning, and fine-grained annotation, we ultimately curated 55,000 high-quality video samples, forming the MedVideoCap-55K dataset. This dataset provides a solid data foundation for the development of multimodal medical video generation models.
1.1.Data Collection and Annotation: A Four-Step Pipeline to Ensure Medical Accuracy
Data Collection, Processing, and Annotation Workflow.
We designed a four-stage data processing workflow to ensure that each data sample faithfully reflects real-world medical environments:
Initial Semantic Filtering: Using a medical keyword dictionary (including anatomical terms, surgical terminology, etc.) and a medical-related text classifier for dual validation, we filtered 25 million videos down to 37,000 preliminary medically relevant videos.
Channel-Level Mining: For verified medical videos, we performed retrospective collection of their publishers' channels, retrieving an additional 140,000 hours of related content. This expanded the dataset’s coverage of medical scenarios to 98.6%.
Frame-Level Quality Control: A frame classification model, trained jointly with the CLIP model and human-annotated data, was used to analyze video content at 1 frame per second (FPS). Only segments with over 6 consecutive seconds of medically relevant content were retained, resulting in 111,000 candidate clips.
Multimodal Annotation with GPT-4o: To support structured learning of medical knowledge by generation models, each video clip was paired with a detailed text description. We uniformly sampled multiple frames from each clip and combined them with the video title, description, and speech transcription. This multimodal input was fed into GPT-4o to generate comprehensive captions, covering key aspects such as environmental setup, anatomical structures, pathological features, and procedural standards.
1.2.Enhancing Data Quality: A Four-Step Strategy to Ensure Clean Training Data
Data Retention and Filtering Changes at Each Stage of the Pipeline.
Although we previously filtered out clearly irrelevant and low-resolution videos, issues such as black borders, subtitles, and frame shaking still remained—severely affecting data quality and the performance of downstream models. To address this, we designed a four-stage data filtering pipeline to enhance the reliability of the dataset:
- Black Border Removal: Medical videos with black borders were identified and filtered using OpenCV-based detection.
- Subtitle Occlusion Filtering: EasyOCR was applied to detect subtitles, and videos with excessive subtitle occlusion were excluded.
- Aesthetic Quality Screening: The LAION aesthetic predictor was used to evaluate video frames and remove low-quality videos affected by blurriness, overexposure, or heavy watermarks.
- Technical Score Filtering: The Dover scoring system was employed to exclude videos with severe camera shake or visual artifacts.
- Joint Filtering: Combining Dover scores and LAION aesthetic scores, we ensured both technical and visual quality, while retaining visually simple yet clinically valuable medical content.
Sample from MedVideoCap-55K. Each data point consists of a medical video clip, a brief caption, and a detailed caption.
With this data refinement pipeline, we finalized the MedVideoCap-55K dataset, which exhibits the following key characteristics:
Balanced Duration: All video clips are between 6 and 10 seconds long, providing a complete view of medical procedures while maintaining efficiency for model training.
Standardized Resolution: All videos are formatted at a mainstream resolution of 720×480, balancing visual clarity and computational efficiency.
Domain Coverage: The dataset spans diverse medical use cases, including:
- Clinical practice (26.12%)
- Medical education (55.93%)
- Medical imaging (2.39%)
- Public health education (9.41%)
- Medical animation (6.15%)
Quality Metrics: Aesthetic scores for all videos range between 4 and 7 (out of 10), and most Dover scores exceed 0.5 (out of 1.0), significantly surpassing the quality of existing medical video datasets.
The MedVideoCap-55K dataset provides a solid foundation for medical video generation research, combining both scale and professional depth. It addresses critical pain points in the field, such as data scarcity and coarse annotations. Looking ahead, researchers can build even larger medical video datasets based on our data pipeline, and further extend MedVideoCap-55K to other tasks, such as image-to-video (I2V) generation.
2.Data Validation: The Birth of MedGen Medical Video Generation Model
To verify two key insights:
- (1) The scarcity of medical video data is the core bottleneck limiting the advancement of medical video generation research;
- (2) The release of the MedVideoCap-55K dataset will significantly drive progress in the field of medical video generation.
We conducted systematic experiments centered around MedVideoCap-55K to provide solid data support and a research foundation for the development of this domain.
Based on the open-source general video generation model HunyuanVideo and our constructed dataset MedVideoCap-55K, we trained the first general-purpose medical video generation model—MedGen. MedGen demonstrated strong competitiveness in both automated benchmark evaluations and human assessments.
MedGen’s Performance on the Med-Vbench Automated Evaluation Benchmark.
Comparison of MedGen with Closed-Source Commercial Video Generation Models in Human Evaluation.
In automatic benchmark evaluations, MedGen outperformed 15 mainstream open-source video generation models and produced fewer deformation errors in generated videos.
- In human evaluations, MedGen demonstrated performance comparable to commercial closed-source video generation models in terms of medical factual consistency, text alignment, and visual quality.
Currently, commercial closed-source systems lead overall performance compared to most open-source models, mainly due to advantages in training resources. However, MedGen’s outstanding results on medical-specific metrics have significantly narrowed this gap, demonstrating that high-quality specialized data and effective transfer learning enable open-source models to be highly competitive in professional domains.
Furthermore, model size is not the sole determinant of performance. The smaller-parameter model Wan2.1-T2V-1.3B achieved comparable results to the larger CogVideoX-5B across multiple metrics, highlighting the critical importance of domain adaptation and training data quality in medical video generation.
3.Value Realization: Exploring MedGen’s Multi-Scenario Applications
As the first medical video generation model, MedGen not only ensures the accuracy of medical content but also achieves high-quality video generation, demonstrating broad application potential in surgical simulation, medical education, scientific animation, and remote consultation.
3.1.Data Augmentation for Downstream Tasks
Medical video supervision plays a crucial role in scenarios such as surgical workflow recognition, lesion detection, and diagnostic assistance, helping clinicians improve both efficiency and accuracy. However, these tasks commonly face challenges including scarce labeled data, imbalanced samples, and privacy concerns, which limit model generalization and real-world effectiveness. Utilizing high-quality synthetic data for augmentation has become a key approach to enhancing the performance of downstream supervised models.
Based on MedGen, we explored its application as a data augmentation tool across various medical video classification tasks. By combining MedGen-generated videos with the original training data, we demonstrated its positive impact on improving downstream task performance.
Comparison of MedGen and HunyuanVideo as Data Augmentation for Performance Improvement in Three Medical Video Supervised Downstream Tasks.
Trained using data generated by MedGen, the model shows significant improvements across various metrics on three benchmark datasets: MedVidCL, HyperKvasir, and SurgVisDom. This demonstrates that MedGen, as a high-quality and domain-specific data augmentation approach, has great potential to significantly enhance medical video understanding capabilities.
3.2.Medical Popular Science, Medical Education, and Patient Simulation
Applications of MedGen in Patient Simulation, Public Health Education, Surgical Training, Medical Animation, and Medical Imaging.
MedGen is capable of generating diverse and high-quality video content covering areas such as public health education, surgical simulation, medical imaging presentation, teaching materials, and patient interaction simulation, greatly enriching the forms of medical video resources. Especially in scenarios where real video data is scarce, privacy protection is strict, or data collection is costly, MedGen’s visually coherent and medically relevant generated videos serve as an important aid for medical content creation, simulation, and healthcare communication. This further expands the application boundaries and innovation potential of medical imaging technologies.
Currently, MedVideoCap-55K and MedGen have been open-sourced on GitHub and Hugging Face, and the related papers have been uploaded to arXiv.
Links:
- GitHub: https://github.com/FreedomIntelligence/MedGen
- Paper: https://arxiv.org/pdf/2507.05675
- Dataset: https://huggingface.co/datasets/FreedomIntelligence/MedVideoCap-55K
- Model: https://huggingface.co/FreedomIntelligence/MedGen
- Blog: https://huggingface.co/blog/wangrongsheng/medvideocap-55k