Update README.md
Browse files
README.md
CHANGED
@@ -139,6 +139,7 @@ with torch.no_grad():
|
|
139 |
Rating: [[0]]
|
140 |
```
|
141 |
|
|
|
142 |
## Evaluation
|
143 |
|
144 |
### Diff
|
@@ -150,22 +151,22 @@ The score is calculated by:
|
|
150 |
2. Assigning full points for a difference of 0, and half a point for a difference of 1.
|
151 |
3. The total score is the sum of all points divided by the number of data points.
|
152 |
|
153 |
-
| |
|
154 |
-
|
155 |
-
| 0 | keval-2-9b
|
156 |
-
| 1 | keval-2-3b
|
157 |
-
| 2 | gpt-4o
|
158 |
-
| 3 | keval-2-1b
|
159 |
-
| 4 | gpt-4o-mini
|
160 |
|
161 |
### Accuracy
|
162 |
|
163 |
The `score` column represents the ratio of correctly predicted labels to the total number of data points. The `wrong` column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.
|
164 |
|
165 |
-
| |
|
166 |
-
|
167 |
-
| 0 | keval-2-9b
|
168 |
-
| 1 | keval-2-3b
|
169 |
-
| 2 | keval-2-1b
|
170 |
-
| 3 | gpt-4o
|
171 |
-
| 4 | gpt-4o-mini
|
|
|
139 |
Rating: [[0]]
|
140 |
```
|
141 |
|
142 |
+
|
143 |
## Evaluation
|
144 |
|
145 |
### Diff
|
|
|
151 |
2. Assigning full points for a difference of 0, and half a point for a difference of 1.
|
152 |
3. The total score is the sum of all points divided by the number of data points.
|
153 |
|
154 |
+
| | model | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
155 |
+
|---:|:------------|:---------|:--------|---------:|:-----------|:-----------|:----------|:----------|:---------|:---------|:---------|:---------|----:|----:|:---------|
|
156 |
+
| 0 | keval-2-9b | 0 (0.0%) | 61.4% | 22 | 11 (50.0%) | 5 (22.7%) | 2 (9.1%) | 3 (13.6%) | 0 | 0 | 0 | 0 | 0 | 0 | 1 (4.5%) |
|
157 |
+
| 1 | keval-2-3b | 0 (0.0%) | 59.1% | 22 | 10 (45.5%) | 6 (27.3%) | 4 (18.2%) | 2 (9.1%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
158 |
+
| 2 | gpt-4o | 0 (0.0%) | 54.5% | 22 | 7 (31.8%) | 10 (45.5%) | 2 (9.1%) | 2 (9.1%) | 1 (4.5%) | 0 | 0 | 0 | 0 | 0 | 0 |
|
159 |
+
| 3 | keval-2-1b | 0 (0.0%) | 43.2% | 22 | 8 (36.4%) | 3 (13.6%) | 5 (22.7%) | 2 (9.1%) | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 | 2 (9.1%) |
|
160 |
+
| 4 | gpt-4o-mini | 1 (4.5%) | 36.4% | 22 | 4 (18.2%) | 8 (36.4%) | 4 (18.2%) | 3 (13.6%) | 0 | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 |
|
161 |
|
162 |
### Accuracy
|
163 |
|
164 |
The `score` column represents the ratio of correctly predicted labels to the total number of data points. The `wrong` column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.
|
165 |
|
166 |
+
| | model | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
167 |
+
|---:|:------------|:---------|:--------|---------:|:-----------|:----------|:-----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:-----------|
|
168 |
+
| 0 | keval-2-9b | 0 (0.0%) | 50.0% | 22 | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) | 0 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) |
|
169 |
+
| 1 | keval-2-3b | 0 (0.0%) | 45.5% | 22 | 2 (100.0%) | 1 (50.0%) | 0 | 0 | 2 (100.0%) | 1 (50.0%) | 0 | 1 (50.0%) | 1 (50.0%) | 0 | 2 (100.0%) |
|
170 |
+
| 2 | keval-2-1b | 0 (0.0%) | 36.4% | 22 | 0 | 1 (50.0%) | 2 (100.0%) | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 2 (100.0%) |
|
171 |
+
| 3 | gpt-4o | 0 (0.0%) | 31.8% | 22 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 0 | 2 (100.0%) |
|
172 |
+
| 4 | gpt-4o-mini | 1 (4.5%) | 18.2% | 22 | 2 (100.0%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 (50.0%) | 0 | 1 (50.0%) |
|