Medical Finetuning Datasets
Viewer • Updated • 134M • 148 • 1Note LEVEL 1 HUGE 130M
FremyCompany/AGCT-Dataset
Viewer • Updated • 421k • 40 • 17Note This dataset contains 422,070 short, computer-generated definitions for SnomedCT concepts, covering various domains such as diseases, procedures, drugs, and anatomy. To do so, we prompted the OpenAI Turbo model, a variant of GPT 3.5, using a high-quality verbalization of the SnomedCT relationships of the to-be-defined concept. Good for CP LEVEL 1
uiyunkim-hub/pubmed-abstract
Viewer • Updated • 27.7M • 625 • 4Note Level 1 base vocab pretrain LEVEL 1
dmariko/clinical-trials-xml-2018-2024
Viewer • Updated • 628k • 59 • 4Note good long texts for CP clinical trials LEVEL 1
darknight054/pubmed_clean
Viewer • Updated • 4.36M • 442 • 1Note Best one for continued pretraining
Jonas7/pubmed_full
Viewer • Updated • 6.62M • 116Note Anchor , positive, negative - might be useful in RL downstream tasks
rntc/mm-icd-notes
Viewer • Updated • 122k • 216 • 6Note Clean 122k notes but lot is missing from original 330k. This is good for SFT and later stages of ICD code prediction
yanjx21/PubMed
Viewer • Updated • 19.5M • 36 • 1Note 81% under 2000 char and 99% under 4000 char. good abstract list. 19m ? don't know how they collected.
TomTBT/pmc_open_access_section
Viewer • Updated • 7.22M • 91 • 3Note full pubmed texts with commercial, non commercial and others - too big to handle
EleutherAI/pile
Updated • 1.27k • 434Note The big PILE 800 gbs of book data. Suitable for LLM training, but not medical domain
starmpcc/Asclepius-Synthetic-Clinical-Notes
Viewer • Updated • 158k • 492 • 89Note Good for SFT, and also CP but needs to manually check data quality. I have doubts
ravistech/clinical-trial-llm-Open_condition_Cleaned_dup_NCT_ID
Viewer • Updated • 277k • 85 • 1Note Good clinical trial quality data with inclusion and exclusion criteria
laion/medrXiv-pdf
Viewer • Updated • 57.6k • 47 • 3Note need to write script to pdf to text but 80gb of text is too much. Can be converted to a knowledge graph for insights maybe. A RAG on this will be exceptional
FreedomIntelligence/medical-o1-reasoning-SFT
Viewer • Updated • 90.1k • 9.11k • 764FreedomIntelligence/DxBench
Viewer • Updated • 2.79k • 100 • 8FreedomIntelligence/Disease_Database
Viewer • Updated • 19.2k • 173 • 25FreedomIntelligence/CoD-PatientSymDisease
Viewer • Updated • 76.4k • 206 • 14FreedomIntelligence/XMedbench
Viewer • Updated • 21.3k • 278 • 11stellalisy/MediQ_AskDocs_preference
Viewer • Updated • 193k • 46 • 2empirischtech/med-qa-orpo-dpo
Viewer • Updated • 357k • 39 • 5hw-hwei/MedThoughts-8K
Viewer • Updated • 7.72k • 20 • 3FremyCompany/BioLORD-Dataset
Viewer • Updated • 25M • 15 • 18FreedomIntelligence/ApolloMoEDataset
Viewer • Updated • 293k • 319 • 5rntc/open-clinical-cases-pubmed
Viewer • Updated • 913k • 80 • 2blue-blues/medical_cot
Viewer • Updated • 403k • 74 • 6xz97/MedInstruct
Viewer • Updated • 216 • 107 • 19dmis-lab/meerkat-instructions
Viewer • Updated • 440k • 102 • 4dvssr/umls
Viewer • Updated • 150M • 280 • 4adlbh/umls-concepts
Viewer • Updated • 475k • 33 • 4Jaafer/cleaned_umls_corpus
Viewer • Updated • 1.46M • 37 • 1Jaafer/biomedical_question_one_disease_word
Viewer • Updated • 1.19k • 32 • 1Jaafer/ICD_ontology_dataset
Viewer • Updated • 15.1k • 26 • 1Jaafer/disease_ontology_dataset
Viewer • Updated • 11.5k • 29 • 1epfl-llm/guidelines
Viewer • Updated • 38k • 1.08k • 130MedRAG/textbooks
Viewer • Updated • 126k • 462 • 40RecurvAI/Recurv-Clinical-Dataset
Viewer • Updated • 12.6k • 97 • 4GeneralReasoning/GeneralThought-430K
Viewer • Updated • 431k • 5.49k • 48FreedomIntelligence/Medical-R1-Distill-Data
Viewer • Updated • 22k • 498 • 51UCSC-VLAA/m23k-tokenized
Viewer • Updated • 23.5k • 189 • 4UCSC-VLAA/MedReason
Viewer • Updated • 32.7k • 1.46k • 70openlifescienceai/medmcqa
Viewer • Updated • 193k • 14.7k • 161bigbio/med_qa
Updated • 2.49k • 100leowei31/MIMIC_IV_lab_test_individual
Viewer • Updated • 633k • 13 • 1