Intelligent-Internet
/

II-Medical-8B-1706

Text Generation

text-generation-inference

Model card Files Files and versions

hoanganhpham commited on 2 days ago

Commit

071e9c8

·

verified ·

1 Parent(s): cd752ab

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -40,14 +40,14 @@ For RL stage we setup training with:
 ## III. Evaluation Results
-Our II-Medical-8B model achieved a 41% score on [HealthBench](https://openai.com/index/healthbench/), a comprehensive open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date. We provide a comparison to models available in ChatGPT below.
-![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/61f2636488b9b5abbe184a8e/5r2O4MtzffVYfuUZJe5FO.jpeg)
-Detailed result for HealthBench can be found [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
-![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
-We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, HealthBench, medical related questions from MMLU-Pro, small QA sets from Lancet and the New England
 Journal of Medicine,  4 Options  and 5 Options splits from the MedBullets platform and MedXpertQA.
 | Model                   | MedMC | MedQA | PubMed | MMLU-P | HealthBench | Lancet | MedB-4 | MedB-5 | MedX  | NEJM  | Avg   |

 ## III. Evaluation Results
+Our II-Medical-8B-1706 model achieved a 46.8% score on [HealthBench](https://openai.com/index/healthbench/), a comprehensive open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date. We provide a comparison to models available in ChatGPT below.
+<!-- ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/61f2636488b9b5abbe184a8e/5r2O4MtzffVYfuUZJe5FO.jpeg) -->
+Detailed result for HealthBench can be found [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1).
+<!-- ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png) -->
+We also evaluate on nine other medical QA benchmarks include MedMCQA, MedQA, PubMedQA, HealthBench, medical related questions from MMLU-Pro, small QA sets from Lancet and the New England
 Journal of Medicine,  4 Options  and 5 Options splits from the MedBullets platform and MedXpertQA.
 | Model                   | MedMC | MedQA | PubMed | MMLU-P | HealthBench | Lancet | MedB-4 | MedB-5 | MedX  | NEJM  | Avg   |