davidkim205 commited on
Commit
8b586eb
·
verified ·
1 Parent(s): a1c1487

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -150,22 +150,22 @@ The score is calculated by:
150
  2. Assigning full points for a difference of 0, and half a point for a difference of 1.
151
  3. The total score is the sum of all points divided by the number of data points.
152
 
153
- | | file | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
154
- |---:|:------------------|:---------|:--------|---------:|:-----------|:-----------|:----------|:----------|:---------|:---------|:---------|:---------|----:|----:|:---------|
155
- | 0 | keval-2-9b.jsonl | 0 (0.0%) | 61.4% | 22 | 11 (50.0%) | 5 (22.7%) | 2 (9.1%) | 3 (13.6%) | 0 | 0 | 0 | 0 | 0 | 0 | 1 (4.5%) |
156
- | 1 | keval-2-3b.jsonl | 0 (0.0%) | 59.1% | 22 | 10 (45.5%) | 6 (27.3%) | 4 (18.2%) | 2 (9.1%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
157
- | 2 | gpt-4o.jsonl | 0 (0.0%) | 54.5% | 22 | 7 (31.8%) | 10 (45.5%) | 2 (9.1%) | 2 (9.1%) | 1 (4.5%) | 0 | 0 | 0 | 0 | 0 | 0 |
158
- | 3 | keval-2-1b.jsonl | 0 (0.0%) | 43.2% | 22 | 8 (36.4%) | 3 (13.6%) | 5 (22.7%) | 2 (9.1%) | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 | 2 (9.1%) |
159
- | 4 | gpt-4o-mini.jsonl | 1 (4.5%) | 36.4% | 22 | 4 (18.2%) | 8 (36.4%) | 4 (18.2%) | 3 (13.6%) | 0 | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 |
160
 
161
  ### Accuracy
162
 
163
  The `score` column represents the ratio of correctly predicted labels to the total number of data points. The `wrong` column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.
164
 
165
- | | file | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
166
- |---:|:------------------|:---------|:--------|---------:|:-----------|:----------|:-----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:-----------|
167
- | 0 | keval-2-9b.jsonl | 0 (0.0%) | 50.0% | 22 | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) | 0 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) |
168
- | 1 | keval-2-3b.jsonl | 0 (0.0%) | 45.5% | 22 | 2 (100.0%) | 1 (50.0%) | 0 | 0 | 2 (100.0%) | 1 (50.0%) | 0 | 1 (50.0%) | 1 (50.0%) | 0 | 2 (100.0%) |
169
- | 2 | keval-2-1b.jsonl | 0 (0.0%) | 36.4% | 22 | 0 | 1 (50.0%) | 2 (100.0%) | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 2 (100.0%) |
170
- | 3 | gpt-4o.jsonl | 0 (0.0%) | 31.8% | 22 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 0 | 2 (100.0%) |
171
- | 4 | gpt-4o-mini.jsonl | 1 (4.5%) | 18.2% | 22 | 2 (100.0%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 (50.0%) | 0 | 1 (50.0%) |
 
150
  2. Assigning full points for a difference of 0, and half a point for a difference of 1.
151
  3. The total score is the sum of all points divided by the number of data points.
152
 
153
+ | | model | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
154
+ |---:|:------------|:---------|:--------|---------:|:-----------|:-----------|:----------|:----------|:---------|:---------|:---------|:---------|----:|----:|:---------|
155
+ | 0 | keval-2-9b | 0 (0.0%) | 61.4% | 22 | 11 (50.0%) | 5 (22.7%) | 2 (9.1%) | 3 (13.6%) | 0 | 0 | 0 | 0 | 0 | 0 | 1 (4.5%) |
156
+ | 1 | keval-2-3b | 0 (0.0%) | 59.1% | 22 | 10 (45.5%) | 6 (27.3%) | 4 (18.2%) | 2 (9.1%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
157
+ | 2 | gpt-4o | 0 (0.0%) | 54.5% | 22 | 7 (31.8%) | 10 (45.5%) | 2 (9.1%) | 2 (9.1%) | 1 (4.5%) | 0 | 0 | 0 | 0 | 0 | 0 |
158
+ | 3 | keval-2-1b | 0 (0.0%) | 43.2% | 22 | 8 (36.4%) | 3 (13.6%) | 5 (22.7%) | 2 (9.1%) | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 | 2 (9.1%) |
159
+ | 4 | gpt-4o-mini | 1 (4.5%) | 36.4% | 22 | 4 (18.2%) | 8 (36.4%) | 4 (18.2%) | 3 (13.6%) | 0 | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 |
160
 
161
  ### Accuracy
162
 
163
  The `score` column represents the ratio of correctly predicted labels to the total number of data points. The `wrong` column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.
164
 
165
+ | | model | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
166
+ |---:|:------------|:---------|:--------|---------:|:-----------|:----------|:-----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:-----------|
167
+ | 0 | keval-2-9b | 0 (0.0%) | 50.0% | 22 | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) | 0 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) |
168
+ | 1 | keval-2-3b | 0 (0.0%) | 45.5% | 22 | 2 (100.0%) | 1 (50.0%) | 0 | 0 | 2 (100.0%) | 1 (50.0%) | 0 | 1 (50.0%) | 1 (50.0%) | 0 | 2 (100.0%) |
169
+ | 2 | keval-2-1b | 0 (0.0%) | 36.4% | 22 | 0 | 1 (50.0%) | 2 (100.0%) | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 2 (100.0%) |
170
+ | 3 | gpt-4o | 0 (0.0%) | 31.8% | 22 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 0 | 2 (100.0%) |
171
+ | 4 | gpt-4o-mini | 1 (4.5%) | 18.2% | 22 | 2 (100.0%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 (50.0%) | 0 | 1 (50.0%) |