Transformers
GGUF
code
topshik commited on
Commit
3f7fb2a
·
verified ·
1 Parent(s): c177089

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +252 -3
README.md CHANGED
@@ -1,3 +1,252 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - bigcode/the-stack
5
+ - bigcode/the-stack-v2
6
+ - bigcode/starcoderdata
7
+ - bigcode/commitpack
8
+ library_name: transformers
9
+ tags:
10
+ - code
11
+ base_model:
12
+ - JetBrains/Mellum-4b-base
13
+ model-index:
14
+ - name: Mellum-4b-sft-python
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ dataset:
19
+ type: tianyang/repobench_python_v1.1
20
+ name: RepoBench 1.1 (Python)
21
+ metrics:
22
+ - name: EM
23
+ type: exact_match
24
+ value: 0.2837
25
+ verified: false
26
+ - name: EM ≤ 8k
27
+ type: exact_match
28
+ value: 0.2987
29
+ verified: false
30
+ - task:
31
+ type: text-generation
32
+ dataset:
33
+ type: tianyang/repobench_python_v1.1
34
+ name: RepoBench 1.1 (Python, 2k)
35
+ metrics:
36
+ - name: EM
37
+ type: exact_match
38
+ value: 0.2924
39
+ verified: false
40
+ - task:
41
+ type: text-generation
42
+ dataset:
43
+ type: tianyang/repobench_python_v1.1
44
+ name: RepoBench 1.1 (Python, 4k)
45
+ metrics:
46
+ - name: EM
47
+ type: exact_match
48
+ value: 0.3060
49
+ verified: false
50
+ - task:
51
+ type: text-generation
52
+ dataset:
53
+ type: tianyang/repobench_python_v1.1
54
+ name: RepoBench 1.1 (Python, 8k)
55
+ metrics:
56
+ - name: EM
57
+ type: exact_match
58
+ value: 0.2977
59
+ verified: false
60
+ - task:
61
+ type: text-generation
62
+ dataset:
63
+ type: tianyang/repobench_python_v1.1
64
+ name: RepoBench 1.1 (Python, 12k)
65
+ metrics:
66
+ - name: EM
67
+ type: exact_match
68
+ value: 0.2680
69
+ verified: false
70
+ - task:
71
+ type: text-generation
72
+ dataset:
73
+ type: tianyang/repobench_python_v1.1
74
+ name: RepoBench 1.1 (Python, 16k)
75
+ metrics:
76
+ - name: EM
77
+ type: exact_match
78
+ value: 0.2543
79
+ verified: false
80
+ - task:
81
+ type: text-generation
82
+ dataset:
83
+ type: gonglinyuan/safim
84
+ name: SAFIM
85
+ metrics:
86
+ - name: pass@1
87
+ type: pass@1
88
+ value: 0.4212
89
+ verified: false
90
+ - task:
91
+ type: text-generation
92
+ dataset:
93
+ type: gonglinyuan/safim
94
+ name: SAFIM (Algorithmic)
95
+ metrics:
96
+ - name: pass@1
97
+ type: pass@1
98
+ value: 0.3316
99
+ verified: false
100
+ - task:
101
+ type: text-generation
102
+ dataset:
103
+ type: gonglinyuan/safim
104
+ name: SAFIM (Control)
105
+ metrics:
106
+ - name: pass@1
107
+ type: pass@1
108
+ value: 0.3611
109
+ verified: false
110
+ - task:
111
+ type: text-generation
112
+ dataset:
113
+ type: gonglinyuan/safim
114
+ name: SAFIM (API)
115
+ metrics:
116
+ - name: pass@1
117
+ type: pass@1
118
+ value: 0.5710
119
+ verified: false
120
+ - task:
121
+ type: text-generation
122
+ dataset:
123
+ type: loubnabnl/humaneval_infilling
124
+ name: HumanEval Infilling (Single-Line)
125
+ metrics:
126
+ - name: pass@1
127
+ type: pass@1
128
+ value: 0.8045
129
+ verified: false
130
+ - task:
131
+ type: text-generation
132
+ dataset:
133
+ type: loubnabnl/humaneval_infilling
134
+ name: HumanEval Infilling (Multi-Line)
135
+ metrics:
136
+ - name: pass@1
137
+ type: pass@1
138
+ value: 0.4819
139
+ verified: false
140
+ - task:
141
+ type: text-generation
142
+ dataset:
143
+ type: loubnabnl/humaneval_infilling
144
+ name: HumanEval Infilling (Random Span)
145
+ metrics:
146
+ - name: pass@1
147
+ type: pass@1
148
+ value: 0.3768
149
+ verified: false
150
+ ---
151
+
152
+ # Model Description
153
+ Mellum-4b-sft-python is a fine-tuned version of JetBrains' first open-source large language model (LLM) optimized for code-related tasks.
154
+
155
+ Pre-trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, and then fine-tuned, Mellum-4b-sft-python is tailored specifically for code completion in Python.
156
+ The model follows a LLaMA-style architecture with 4 billion parameters, making it efficient for both cloud inference (e.g., via vLLM) and local deployment (e.g., using llama.cpp or Ollama).
157
+
158
+ Mellum was trained using Automatic Mixed Precision (AMP) with bf16 precision.
159
+ The uploaded version on Hugging Face retains the bf16 format for public use.
160
+
161
+ Designed for integration into professional developer tooling (e.g., intelligent code suggestions in IDEs), AI-powered coding assistants, and research on code understanding and generation, Mellum is also well-suited for educational applications and fine-tuning experiments.
162
+
163
+ # Limitations
164
+ - Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.
165
+ - Security: Code suggestions should not be assumed to be secure or free of vulnerabilities.
166
+
167
+ # Sample Usage
168
+ Here are examples of how to run and sample from the model.
169
+
170
+ ## Generic generaion
171
+ ```python
172
+ import json
173
+ from transformers import AutoTokenizer, AutoModelForCausalLM
174
+
175
+ example = """
176
+ import sys
177
+ import os
178
+ import time
179
+
180
+ sys.path.append(os.getcwd())
181
+
182
+ from cluster.prepare_data import get_headers_pairs_list, write_dist_matrix
183
+ from cluster.token_edit_distance import get_distance_matrix
184
+
185
+ if len(sys.argv) < 3:
186
+ print(
187
+ "Too few arguments. You should provide: \n1. dataset_filename" +
188
+ "\n2. output_data_filename"
189
+ )
190
+ sys.exit()
191
+
192
+ start = time.perf_counter()
193
+ dataset_filename_ = sys.argv[1]
194
+ output_data_filename_ = sys.argv[2]
195
+
196
+ headers_pairs = get_headers_pairs_list(dataset_filename_, verbose=True)
197
+
198
+ dist_matrix, max_dist = get_distance_matrix(
199
+ list(map(lambda x: x[1], headers_pairs)),
200
+ verbose=True
201
+ )
202
+
203
+ write_dist_matrix(dist_matrix, max_dist, output_data_filename_, verbose=True)
204
+
205
+ end = time.perf_counter()
206
+ """
207
+
208
+ tokenizer = AutoTokenizer.from_pretrained('JetBrains/Mellum-4b-base')
209
+ model = AutoModelForCausalLM.from_pretrained('JetBrains/Mellum-4b-base')
210
+ encoded_input = tokenizer(example, return_tensors='pt', return_token_type_ids=False)
211
+ input_len = len(encoded_input["input_ids"][0])
212
+ out = model.generate(
213
+ **encoded_input,
214
+ max_new_tokens=100,
215
+ )
216
+ print("### Context")
217
+ print(tokenizer.decode(out[0][:input_len]))
218
+ print("### Prediction")
219
+ print(tokenizer.decode(out[0][input_len:]))
220
+ ```
221
+
222
+ ## Fill in the middle generation
223
+ ```python
224
+ prefix = """
225
+ def fibonacci(n: int) -> int:
226
+ """
227
+
228
+ suffix = """
229
+ if __name__ == "__main__":
230
+ print(fibonacci(10))
231
+ """
232
+
233
+ encoded_input = tokenizer(f"<fim_suffix>{suffix}<fim_prefix>{prefix}<fim_middle>", return_tensors='pt', return_token_type_ids=False)
234
+ out = model.generate(
235
+ **encoded_input,
236
+ max_new_tokens=100,
237
+ )
238
+ ```
239
+
240
+ # Citation
241
+ If you use this model, please cite:
242
+
243
+ ```bibtex
244
+ @misc{Mellum-4b-base,
245
+ title = {Mellum-4b-base},
246
+ author = {Pavlichenko, Nikita and Nazarov, Iurii and Dolgov, Ivan and Reshetnikova, Julia and Garanina, Ekaterina and Lasocki, Karol and Boitsov, Sergei and Karaeva, Dariia and Bondyrev, Ivan and Sheptyakov, Maksim and Ustalov, Dmitry and Abramov, Nikita and Kolomyttseva, Olga and Lysaniuk, Kseniia and Zavidnyi, Ilia and Semenkin, Anton and Sazanovich, Uladzislau},
247
+ year = {2025},
248
+ }
249
+ ```
250
+
251
+ # Contact
252
+ For questions, collaborations and requests reach us out via [email protected]