JetBrains
/

Mellum-4b-base

@@ -15,28 +15,62 @@ model-index:
       type: text-generation
     dataset:
       type: tianyang/repobench_python_v1.1
-      name: RepoBench 1.1 (Python)
     metrics:
-    - name: 2k
       type: pass@1
       value: 0.2820
       verified: false
-    - name: 4k
       type: pass@1
       value: 0.2795
       verified: false
-    - name: 8k
       type: pass@1
       value: 0.2777
       verified: false
-    - name: 12k
       type: pass@1
       value: 0.2453
       verified: false
-    - name: 16k
       type: pass@1
       value: 0.2110
       verified: false
   - task:
       type: text-generation
     dataset:
@@ -106,43 +140,41 @@ Keep in mind that base model is not fine-tuned for downstream tasks out-of-the-b
 # Benchmarks
 In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
-## RepoBench
 - Type: single-line
 - Languages: Python and Java
 - Metric: Exact Match (EM), %
 ### Python Subset
-| Model                  |   2k   |   4k   |   8k   |  12k   |  16k   |  Avg   | Avg ≤ 8k |
-|------------------------|--------|--------|--------|--------|--------|--------|----------|
-| Mellum-4b-sft-python   | 29.24% | 30.60% | 29.77% | 26.80% | 25.43% | 25.91% |  29.87%  |
-| Mellum-4b-base         | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 28.37% |  27.97%  |
 ### Java Subset
 | Model         | 2K Context | 4K Context | 8K Context |
 |---------------|------------|------------|------------|
 | Mellum-4b-base | 33.15%     | 33.48%     | 27.79%     |
-## SAFIM Benchmark
 - Type: mix of multi-line and single-line
 - Languages: multi-language
 - Metric: pass@1, %
 | Model                | Algorithmic | Control | API    | Average |
 |----------------------|-------------|---------|--------|---------|
-| Mellum-4b-sft-python  | 33.16%      | 36.11%  | 57.10% | 42.12%  |
-| Mellum-4b-base        | 25.30%      | 38.39%  | 50.65% | 38.11%  |
 ## HumanEval Infilling
 - Type: single-line and multi-line
 - Languages: Python
 - Metric: pass@1, %
-| Model                | Single-line | Multi-line | Random Span |
 |----------------------|-------------|------------|-------------|
-| Mellum-4b-sft-python  | 80.45%      | 48.19%     | 37.68%      |
-| Mellum-4b-base        | 66.21%      | 38.52%     | 29.70%      |
 # Limitations
 - Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.

       type: text-generation
     dataset:
       type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 2k)
     metrics:
+    - name: pass@1
       type: pass@1
       value: 0.2820
       verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 4k)
+    metrics:
+    - name: pass@1
       type: pass@1
       value: 0.2795
       verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 8k)
+    metrics:
+    - name: pass@1
       type: pass@1
       value: 0.2777
       verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 12k)
+    metrics:
+    - name: pass@1
       type: pass@1
       value: 0.2453
       verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 16k)
+    metrics:
+    - name: pass@1
       type: pass@1
       value: 0.2110
       verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.2591
+      verified: false
   - task:
       type: text-generation
     dataset:
 # Benchmarks
 In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
+## RepoBench 1.1
 - Type: single-line
 - Languages: Python and Java
 - Metric: Exact Match (EM), %
 ### Python Subset
+| Model                |   2k   |   4k   |   8k   |  12k   |  16k   |  Avg   | Avg ≤ 8k |
+|----------------------|--------|--------|--------|--------|--------|--------|----------|
+| Mellum-4b-sft-python | 29.24% | 30.60% | 29.77% | 26.80% | 25.43% | 28.37% |  29.87%  |
+| Mellum-4b-base       | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 25.91% |  27.97%  |
 ### Java Subset
 | Model         | 2K Context | 4K Context | 8K Context |
 |---------------|------------|------------|------------|
 | Mellum-4b-base | 33.15%     | 33.48%     | 27.79%     |
+## Syntax-Aware Fill-in-the-Middle (SAFIM)
 - Type: mix of multi-line and single-line
 - Languages: multi-language
 - Metric: pass@1, %
 | Model                | Algorithmic | Control | API    | Average |
 |----------------------|-------------|---------|--------|---------|
+| Mellum-4b-sft-python | 33.16%      | 36.11%  | 57.10% | 42.12%  |
+| Mellum-4b-base       | 25.30%      | 38.39%  | 50.65% | 38.11%  |
 ## HumanEval Infilling
 - Type: single-line and multi-line
 - Languages: Python
 - Metric: pass@1, %
+| Model                | Single-Line | Multi-Line | Random Span |
 |----------------------|-------------|------------|-------------|
+| Mellum-4b-sft-python | 80.45%      | 48.19%     | 37.68%      |
+| Mellum-4b-base       | 66.21%      | 38.52%     | 29.70%      |
 # Limitations
 - Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.