Update README.md
Browse files
README.md
CHANGED
@@ -15,28 +15,62 @@ model-index:
|
|
15 |
type: text-generation
|
16 |
dataset:
|
17 |
type: tianyang/repobench_python_v1.1
|
18 |
-
name: RepoBench 1.1 (Python)
|
19 |
metrics:
|
20 |
-
- name:
|
21 |
type: pass@1
|
22 |
value: 0.2820
|
23 |
verified: false
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
type: pass@1
|
26 |
value: 0.2795
|
27 |
verified: false
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
type: pass@1
|
30 |
value: 0.2777
|
31 |
verified: false
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
type: pass@1
|
34 |
value: 0.2453
|
35 |
verified: false
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
type: pass@1
|
38 |
value: 0.2110
|
39 |
verified: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
- task:
|
41 |
type: text-generation
|
42 |
dataset:
|
@@ -106,43 +140,41 @@ Keep in mind that base model is not fine-tuned for downstream tasks out-of-the-b
|
|
106 |
# Benchmarks
|
107 |
In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
|
108 |
|
109 |
-
## RepoBench
|
110 |
- Type: single-line
|
111 |
- Languages: Python and Java
|
112 |
- Metric: Exact Match (EM), %
|
113 |
|
114 |
### Python Subset
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
| Mellum-4b-
|
119 |
-
| Mellum-4b-base | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 28.37% | 27.97% |
|
120 |
|
121 |
### Java Subset
|
122 |
-
|
123 |
| Model | 2K Context | 4K Context | 8K Context |
|
124 |
|---------------|------------|------------|------------|
|
125 |
| Mellum-4b-base | 33.15% | 33.48% | 27.79% |
|
126 |
|
127 |
-
## SAFIM
|
128 |
- Type: mix of multi-line and single-line
|
129 |
- Languages: multi-language
|
130 |
- Metric: pass@1, %
|
131 |
|
132 |
| Model | Algorithmic | Control | API | Average |
|
133 |
|----------------------|-------------|---------|--------|---------|
|
134 |
-
| Mellum-4b-sft-python
|
135 |
-
| Mellum-4b-base
|
136 |
|
137 |
## HumanEval Infilling
|
138 |
- Type: single-line and multi-line
|
139 |
- Languages: Python
|
140 |
- Metric: pass@1, %
|
141 |
|
142 |
-
| Model | Single-
|
143 |
|----------------------|-------------|------------|-------------|
|
144 |
-
| Mellum-4b-sft-python
|
145 |
-
| Mellum-4b-base
|
146 |
|
147 |
# Limitations
|
148 |
- Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.
|
|
|
15 |
type: text-generation
|
16 |
dataset:
|
17 |
type: tianyang/repobench_python_v1.1
|
18 |
+
name: RepoBench 1.1 (Python, 2k)
|
19 |
metrics:
|
20 |
+
- name: pass@1
|
21 |
type: pass@1
|
22 |
value: 0.2820
|
23 |
verified: false
|
24 |
+
- task:
|
25 |
+
type: text-generation
|
26 |
+
dataset:
|
27 |
+
type: tianyang/repobench_python_v1.1
|
28 |
+
name: RepoBench 1.1 (Python, 4k)
|
29 |
+
metrics:
|
30 |
+
- name: pass@1
|
31 |
type: pass@1
|
32 |
value: 0.2795
|
33 |
verified: false
|
34 |
+
- task:
|
35 |
+
type: text-generation
|
36 |
+
dataset:
|
37 |
+
type: tianyang/repobench_python_v1.1
|
38 |
+
name: RepoBench 1.1 (Python, 8k)
|
39 |
+
metrics:
|
40 |
+
- name: pass@1
|
41 |
type: pass@1
|
42 |
value: 0.2777
|
43 |
verified: false
|
44 |
+
- task:
|
45 |
+
type: text-generation
|
46 |
+
dataset:
|
47 |
+
type: tianyang/repobench_python_v1.1
|
48 |
+
name: RepoBench 1.1 (Python, 12k)
|
49 |
+
metrics:
|
50 |
+
- name: pass@1
|
51 |
type: pass@1
|
52 |
value: 0.2453
|
53 |
verified: false
|
54 |
+
- task:
|
55 |
+
type: text-generation
|
56 |
+
dataset:
|
57 |
+
type: tianyang/repobench_python_v1.1
|
58 |
+
name: RepoBench 1.1 (Python, 16k)
|
59 |
+
metrics:
|
60 |
+
- name: pass@1
|
61 |
type: pass@1
|
62 |
value: 0.2110
|
63 |
verified: false
|
64 |
+
- task:
|
65 |
+
type: text-generation
|
66 |
+
dataset:
|
67 |
+
type: tianyang/repobench_python_v1.1
|
68 |
+
name: RepoBench 1.1 (Python)
|
69 |
+
metrics:
|
70 |
+
- name: pass@1
|
71 |
+
type: pass@1
|
72 |
+
value: 0.2591
|
73 |
+
verified: false
|
74 |
- task:
|
75 |
type: text-generation
|
76 |
dataset:
|
|
|
140 |
# Benchmarks
|
141 |
In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
|
142 |
|
143 |
+
## RepoBench 1.1
|
144 |
- Type: single-line
|
145 |
- Languages: Python and Java
|
146 |
- Metric: Exact Match (EM), %
|
147 |
|
148 |
### Python Subset
|
149 |
+
| Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
|
150 |
+
|----------------------|--------|--------|--------|--------|--------|--------|----------|
|
151 |
+
| Mellum-4b-sft-python | 29.24% | 30.60% | 29.77% | 26.80% | 25.43% | 28.37% | 29.87% |
|
152 |
+
| Mellum-4b-base | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 25.91% | 27.97% |
|
|
|
153 |
|
154 |
### Java Subset
|
|
|
155 |
| Model | 2K Context | 4K Context | 8K Context |
|
156 |
|---------------|------------|------------|------------|
|
157 |
| Mellum-4b-base | 33.15% | 33.48% | 27.79% |
|
158 |
|
159 |
+
## Syntax-Aware Fill-in-the-Middle (SAFIM)
|
160 |
- Type: mix of multi-line and single-line
|
161 |
- Languages: multi-language
|
162 |
- Metric: pass@1, %
|
163 |
|
164 |
| Model | Algorithmic | Control | API | Average |
|
165 |
|----------------------|-------------|---------|--------|---------|
|
166 |
+
| Mellum-4b-sft-python | 33.16% | 36.11% | 57.10% | 42.12% |
|
167 |
+
| Mellum-4b-base | 25.30% | 38.39% | 50.65% | 38.11% |
|
168 |
|
169 |
## HumanEval Infilling
|
170 |
- Type: single-line and multi-line
|
171 |
- Languages: Python
|
172 |
- Metric: pass@1, %
|
173 |
|
174 |
+
| Model | Single-Line | Multi-Line | Random Span |
|
175 |
|----------------------|-------------|------------|-------------|
|
176 |
+
| Mellum-4b-sft-python | 80.45% | 48.19% | 37.68% |
|
177 |
+
| Mellum-4b-base | 66.21% | 38.52% | 29.70% |
|
178 |
|
179 |
# Limitations
|
180 |
- Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.
|