matsuo-lab
commited on
Commit
•
1c63404
1
Parent(s):
96db70d
Update README.md
Browse files
README.md
CHANGED
@@ -47,7 +47,21 @@ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 bi
|
|
47 |
|
48 |
# Benchmarking
|
49 |
|
50 |
-
* **Japanese benchmark**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
- *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
|
53 |
- *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
|
|
|
47 |
|
48 |
# Benchmarking
|
49 |
|
50 |
+
* **Japanese benchmark : JGLUE 8-task (2023-08-27)**
|
51 |
+
|
52 |
+
- *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
|
53 |
+
- *The 8-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, JSQuAD-1.1, jaqket_v2-0.2, xlsum_ja-1.0, xwinograd_ja, and mgsm-1.0.*
|
54 |
+
- *model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
|
55 |
+
- *The number of few-shots is 3,3,3,2,1,1,0,5.*
|
56 |
+
- *special_tokens_map.json is modified to avoid errors during the evaluation of the second half benchmarks. As a result, the results of the first half benchmarks became slightly different.*
|
57 |
+
|
58 |
+
model | average | jcommonsenseqa | jnli | marc_ja | jsquad | jaqket_v2 | xlsum_ja | xwinograd_ja | mgsm
|
59 |
+
| :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
|
60 |
+
weblab-10b-instruction-sft | 59.11 | 74.62 | 66.56 | 95.49 | 78.34 | 63.32 | 20.57 | 71.95 | 2
|
61 |
+
weblab-10b | 50.74 | 66.58 | 53.74 | 82.07 | 62.94 | 56.19 | 10.03 | 71.95 | 2.4
|
62 |
+
|
63 |
+
|
64 |
+
* **Japanese benchmark : JGLUE 4-task (2023-08-18)**
|
65 |
|
66 |
- *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
|
67 |
- *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
|