Update README.md
Browse files
README.md
CHANGED
@@ -40,14 +40,14 @@ For RL stage we setup training with:
|
|
40 |
|
41 |
## III. Evaluation Results
|
42 |
|
43 |
-
Our II-Medical-8B model achieved a
|
44 |
|
45 |
-

|
46 |
-
Detailed result for HealthBench can be found [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
|
47 |
|
48 |
-

|
49 |
|
50 |
-
We evaluate on
|
51 |
Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platform and MedXpertQA.
|
52 |
|
53 |
| Model | MedMC | MedQA | PubMed | MMLU-P | HealthBench | Lancet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
|
|
|
40 |
|
41 |
## III. Evaluation Results
|
42 |
|
43 |
+
Our II-Medical-8B-1706 model achieved a 46.8% score on [HealthBench](https://openai.com/index/healthbench/), a comprehensive open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date. We provide a comparison to models available in ChatGPT below.
|
44 |
|
45 |
+
<!--  -->
|
46 |
+
Detailed result for HealthBench can be found [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1).
|
47 |
|
48 |
+
<!--  -->
|
49 |
|
50 |
+
We also evaluate on nine other medical QA benchmarks include MedMCQA, MedQA, PubMedQA, HealthBench, medical related questions from MMLU-Pro, small QA sets from Lancet and the New England
|
51 |
Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platform and MedXpertQA.
|
52 |
|
53 |
| Model | MedMC | MedQA | PubMed | MMLU-P | HealthBench | Lancet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
|