dmis-lab
/

llama-3-meerkat-8b-v1.0

@@ -161,13 +161,13 @@ We tested models on seven medical benchmarks: [MedQA](https://arxiv.org/abs/2009
 | **Model**                       | **Average** | **MedQA** | **USMLE** | **Medbullets-4** | **Medbullets-5** | **MedMCQA** | **MMLU-Medical** |
 |:--------------------------------|:-----------:|:---------:|:---------:|:----------------:|:----------------:|:-----------:|:----------------:|
-| GPT-4                           | 75.2        | 81.4      | 86.6      | 68.8             | 63.3             | 72.4        | 87.1             |
-| GPT-3.5                         | 54.1        | 53.6      | 58.5      | 51.0             | 47.4             | 51.0        | 67.3             |
 | MediTron-70B (Ensemble, 5 runs) | -           | 70.2      | -         | -                | -                | 66.0        | 78.0             |
 |*Open-source (7B)*|
-| MediTron-7B                     | 50.8        | 50.2      | 44.6      | 51.1             | 45.5             | 57.9        | 56.7             |
-| BioMistral-7B                   | 54.4        | 54.3      | 51.4      | 52.3             | 48.7             | 61.1        | 64.6             |
-| Meerkat-7B                      | 62.4        | 70.6      | 70.3      | 58.7             | 52.9             | 60.6        | 70.5             |
 | Meerkat-8B (**New**)            | **67.3**    | **74.0**  | **74.2**  | **62.3**         | **55.5**         | **62.7**    | **75.2**         |
 Please note that the scores in MMLU-Medical were calculated based on the average accuracies across six medical-related subjects in the original MMLU benchmark, and each result for a single subject is presented below.

 | **Model**                       | **Average** | **MedQA** | **USMLE** | **Medbullets-4** | **Medbullets-5** | **MedMCQA** | **MMLU-Medical** |
 |:--------------------------------|:-----------:|:---------:|:---------:|:----------------:|:----------------:|:-----------:|:----------------:|
+| GPT-4                           | 76.6        | 81.4      | 86.6      | 68.8             | 63.3             | 72.4        | 87.1             |
+| GPT-3.5                         | 54.8        | 53.6      | 58.5      | 51.0             | 47.4             | 51.0        | 67.3             |
 | MediTron-70B (Ensemble, 5 runs) | -           | 70.2      | -         | -                | -                | 66.0        | 78.0             |
 |*Open-source (7B)*|
+| MediTron-7B                     | 51.0        | 50.2      | 44.6      | 51.1             | 45.5             | 57.9        | 56.7             |
+| BioMistral-7B                   | 55.4        | 54.3      | 51.4      | 52.3             | 48.7             | 61.1        | 64.6             |
+| Meerkat-7B                      | 62.6        | 70.6      | 70.3      | 58.7             | 52.9             | 60.6        | 70.5             |
 | Meerkat-8B (**New**)            | **67.3**    | **74.0**  | **74.2**  | **62.3**         | **55.5**         | **62.7**    | **75.2**         |
 Please note that the scores in MMLU-Medical were calculated based on the average accuracies across six medical-related subjects in the original MMLU benchmark, and each result for a single subject is presented below.