dustalov commited on
Commit
55fe2ca
·
verified ·
1 Parent(s): 1637e54

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -19
README.md CHANGED
@@ -15,28 +15,62 @@ model-index:
15
  type: text-generation
16
  dataset:
17
  type: tianyang/repobench_python_v1.1
18
- name: RepoBench 1.1 (Python)
19
  metrics:
20
- - name: 2k
21
  type: pass@1
22
  value: 0.2820
23
  verified: false
24
- - name: 4k
 
 
 
 
 
 
25
  type: pass@1
26
  value: 0.2795
27
  verified: false
28
- - name: 8k
 
 
 
 
 
 
29
  type: pass@1
30
  value: 0.2777
31
  verified: false
32
- - name: 12k
 
 
 
 
 
 
33
  type: pass@1
34
  value: 0.2453
35
  verified: false
36
- - name: 16k
 
 
 
 
 
 
37
  type: pass@1
38
  value: 0.2110
39
  verified: false
 
 
 
 
 
 
 
 
 
 
40
  - task:
41
  type: text-generation
42
  dataset:
@@ -106,43 +140,41 @@ Keep in mind that base model is not fine-tuned for downstream tasks out-of-the-b
106
  # Benchmarks
107
  In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
108
 
109
- ## RepoBench
110
  - Type: single-line
111
  - Languages: Python and Java
112
  - Metric: Exact Match (EM), %
113
 
114
  ### Python Subset
115
-
116
- | Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
117
- |------------------------|--------|--------|--------|--------|--------|--------|----------|
118
- | Mellum-4b-sft-python | 29.24% | 30.60% | 29.77% | 26.80% | 25.43% | 25.91% | 29.87% |
119
- | Mellum-4b-base | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 28.37% | 27.97% |
120
 
121
  ### Java Subset
122
-
123
  | Model | 2K Context | 4K Context | 8K Context |
124
  |---------------|------------|------------|------------|
125
  | Mellum-4b-base | 33.15% | 33.48% | 27.79% |
126
 
127
- ## SAFIM Benchmark
128
  - Type: mix of multi-line and single-line
129
  - Languages: multi-language
130
  - Metric: pass@1, %
131
 
132
  | Model | Algorithmic | Control | API | Average |
133
  |----------------------|-------------|---------|--------|---------|
134
- | Mellum-4b-sft-python | 33.16% | 36.11% | 57.10% | 42.12% |
135
- | Mellum-4b-base | 25.30% | 38.39% | 50.65% | 38.11% |
136
 
137
  ## HumanEval Infilling
138
  - Type: single-line and multi-line
139
  - Languages: Python
140
  - Metric: pass@1, %
141
 
142
- | Model | Single-line | Multi-line | Random Span |
143
  |----------------------|-------------|------------|-------------|
144
- | Mellum-4b-sft-python | 80.45% | 48.19% | 37.68% |
145
- | Mellum-4b-base | 66.21% | 38.52% | 29.70% |
146
 
147
  # Limitations
148
  - Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.
 
15
  type: text-generation
16
  dataset:
17
  type: tianyang/repobench_python_v1.1
18
+ name: RepoBench 1.1 (Python, 2k)
19
  metrics:
20
+ - name: pass@1
21
  type: pass@1
22
  value: 0.2820
23
  verified: false
24
+ - task:
25
+ type: text-generation
26
+ dataset:
27
+ type: tianyang/repobench_python_v1.1
28
+ name: RepoBench 1.1 (Python, 4k)
29
+ metrics:
30
+ - name: pass@1
31
  type: pass@1
32
  value: 0.2795
33
  verified: false
34
+ - task:
35
+ type: text-generation
36
+ dataset:
37
+ type: tianyang/repobench_python_v1.1
38
+ name: RepoBench 1.1 (Python, 8k)
39
+ metrics:
40
+ - name: pass@1
41
  type: pass@1
42
  value: 0.2777
43
  verified: false
44
+ - task:
45
+ type: text-generation
46
+ dataset:
47
+ type: tianyang/repobench_python_v1.1
48
+ name: RepoBench 1.1 (Python, 12k)
49
+ metrics:
50
+ - name: pass@1
51
  type: pass@1
52
  value: 0.2453
53
  verified: false
54
+ - task:
55
+ type: text-generation
56
+ dataset:
57
+ type: tianyang/repobench_python_v1.1
58
+ name: RepoBench 1.1 (Python, 16k)
59
+ metrics:
60
+ - name: pass@1
61
  type: pass@1
62
  value: 0.2110
63
  verified: false
64
+ - task:
65
+ type: text-generation
66
+ dataset:
67
+ type: tianyang/repobench_python_v1.1
68
+ name: RepoBench 1.1 (Python)
69
+ metrics:
70
+ - name: pass@1
71
+ type: pass@1
72
+ value: 0.2591
73
+ verified: false
74
  - task:
75
  type: text-generation
76
  dataset:
 
140
  # Benchmarks
141
  In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
142
 
143
+ ## RepoBench 1.1
144
  - Type: single-line
145
  - Languages: Python and Java
146
  - Metric: Exact Match (EM), %
147
 
148
  ### Python Subset
149
+ | Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
150
+ |----------------------|--------|--------|--------|--------|--------|--------|----------|
151
+ | Mellum-4b-sft-python | 29.24% | 30.60% | 29.77% | 26.80% | 25.43% | 28.37% | 29.87% |
152
+ | Mellum-4b-base | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 25.91% | 27.97% |
 
153
 
154
  ### Java Subset
 
155
  | Model | 2K Context | 4K Context | 8K Context |
156
  |---------------|------------|------------|------------|
157
  | Mellum-4b-base | 33.15% | 33.48% | 27.79% |
158
 
159
+ ## Syntax-Aware Fill-in-the-Middle (SAFIM)
160
  - Type: mix of multi-line and single-line
161
  - Languages: multi-language
162
  - Metric: pass@1, %
163
 
164
  | Model | Algorithmic | Control | API | Average |
165
  |----------------------|-------------|---------|--------|---------|
166
+ | Mellum-4b-sft-python | 33.16% | 36.11% | 57.10% | 42.12% |
167
+ | Mellum-4b-base | 25.30% | 38.39% | 50.65% | 38.11% |
168
 
169
  ## HumanEval Infilling
170
  - Type: single-line and multi-line
171
  - Languages: Python
172
  - Metric: pass@1, %
173
 
174
+ | Model | Single-Line | Multi-Line | Random Span |
175
  |----------------------|-------------|------------|-------------|
176
+ | Mellum-4b-sft-python | 80.45% | 48.19% | 37.68% |
177
+ | Mellum-4b-base | 66.21% | 38.52% | 29.70% |
178
 
179
  # Limitations
180
  - Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.