Update README.md
Browse files
README.md
CHANGED
@@ -161,13 +161,13 @@ We tested models on seven medical benchmarks: [MedQA](https://arxiv.org/abs/2009
|
|
161 |
|
162 |
| **Model** | **Average** | **MedQA** | **USMLE** | **Medbullets-4** | **Medbullets-5** | **MedMCQA** | **MMLU-Medical** |
|
163 |
|:--------------------------------|:-----------:|:---------:|:---------:|:----------------:|:----------------:|:-----------:|:----------------:|
|
164 |
-
| GPT-4 |
|
165 |
-
| GPT-3.5 | 54.
|
166 |
| MediTron-70B (Ensemble, 5 runs) | - | 70.2 | - | - | - | 66.0 | 78.0 |
|
167 |
|*Open-source (7B)*|
|
168 |
-
| MediTron-7B |
|
169 |
-
| BioMistral-7B |
|
170 |
-
| Meerkat-7B | 62.
|
171 |
| Meerkat-8B (**New**) | **67.3** | **74.0** | **74.2** | **62.3** | **55.5** | **62.7** | **75.2** |
|
172 |
|
173 |
Please note that the scores in MMLU-Medical were calculated based on the average accuracies across six medical-related subjects in the original MMLU benchmark, and each result for a single subject is presented below.
|
|
|
161 |
|
162 |
| **Model** | **Average** | **MedQA** | **USMLE** | **Medbullets-4** | **Medbullets-5** | **MedMCQA** | **MMLU-Medical** |
|
163 |
|:--------------------------------|:-----------:|:---------:|:---------:|:----------------:|:----------------:|:-----------:|:----------------:|
|
164 |
+
| GPT-4 | 76.6 | 81.4 | 86.6 | 68.8 | 63.3 | 72.4 | 87.1 |
|
165 |
+
| GPT-3.5 | 54.8 | 53.6 | 58.5 | 51.0 | 47.4 | 51.0 | 67.3 |
|
166 |
| MediTron-70B (Ensemble, 5 runs) | - | 70.2 | - | - | - | 66.0 | 78.0 |
|
167 |
|*Open-source (7B)*|
|
168 |
+
| MediTron-7B | 51.0 | 50.2 | 44.6 | 51.1 | 45.5 | 57.9 | 56.7 |
|
169 |
+
| BioMistral-7B | 55.4 | 54.3 | 51.4 | 52.3 | 48.7 | 61.1 | 64.6 |
|
170 |
+
| Meerkat-7B | 62.6 | 70.6 | 70.3 | 58.7 | 52.9 | 60.6 | 70.5 |
|
171 |
| Meerkat-8B (**New**) | **67.3** | **74.0** | **74.2** | **62.3** | **55.5** | **62.7** | **75.2** |
|
172 |
|
173 |
Please note that the scores in MMLU-Medical were calculated based on the average accuracies across six medical-related subjects in the original MMLU benchmark, and each result for a single subject is presented below.
|