File size: 3,371 Bytes
7b20ef6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d088908
7b20ef6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
datasets:
- bigcode/the-stack
- bigcode/the-stack-v2
- bigcode/starcoderdata
- bigcode/commitpack
- nvidia/OpenCodeReasoning
library_name: transformers
tags:
- code
license: mit
pipeline_tag: text-generation
---

# Spec Coder V1
**Spec Coder** is a cutting-edge, open-source AI model designed to assist with fundamental coding tasks. It is built on the **Llama architecture**, allowing seamless access via tools like **llama.cpp** and **Ollama**. This makes **Spec Coder** highly compatible with a variety of systems, enabling flexible deployment both locally and in the cloud.

Trained on vast datasets, **Spec Coder** excels in generating code, completing code snippets, and understanding programming tasks across multiple languages. It can be used for code completion, debugging, and automated code generation, acting as a versatile assistant for developers. 

**Spec Coder** is optimized for integration into developer tools, providing intelligent coding assistance and facilitating research in programming languages. Its advanced transformer-based architecture, with 4 billion parameters, allows it to perform tasks across different environments efficiently.

The model supports various downstream tasks including supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its performance for specific programming tasks.

# Training Data
- Total Training Tokens: ~4.3 trillion tokens
- Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, OpenCodeReasoning, English Wikipedia

# Training Details
- Context Window: 8,192 tokens
- Optimization: Standard language modeling objective
- Hardware: Cluster of 5 x RTX 4090 GPUs
- Training Duration: ~140 days (approximately 6 months)

# Benchmarks
## RepoBench 1.1 (Python)
| Model              | 2k    | 4k    | 8k    | 12k   | 16k   | Avg   | Avg ≤ 8k |
|--------------------|-------|-------|-------|-------|-------|-------|----------|
| Spec-Coder-4b-V1 | 30.42%| 38.55%| 36.91%| 32.75%| 30.34%| 34.59%| 36.23%   |

## Syntax-Aware Fill-in-the-Middle (SAFIM)
| Model                | Algorithmic | Control | API    | Average |
|----------------------|-------------|---------|--------|---------|
| Spec-Coder-4b-V1   | 38.22%      | 41.18%  | 60.45% | 46.28%  |

## HumanEval Infilling
| Model                | Single-Line | Multi-Line | Random Span |
|----------------------|-------------|------------|-------------|
| Spec-Coder-4b-V1   | 72.34%      | 45.65%     | 39.12%      |

# Limitations
- **Biases**: The model may reflect biases present in the public codebases.
- **Security**: Code generated by the model may contain security vulnerabilities. It is essential to verify and audit the code generated by the model for any potential risks.

# Sample Usage
Here are examples of how to run and interact with **Spec Coder**:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SVECTOR-CORPORATION/Spec-Coder-4b-V1"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_code = "def factorial(n):\n    if n == 0:"

inputs = tokenizer(input_code, return_tensors="pt")

outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)

generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Python code:\n", generated_code)
```