File size: 3,371 Bytes
7b20ef6 d088908 7b20ef6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
---
datasets:
- bigcode/the-stack
- bigcode/the-stack-v2
- bigcode/starcoderdata
- bigcode/commitpack
- nvidia/OpenCodeReasoning
library_name: transformers
tags:
- code
license: mit
pipeline_tag: text-generation
---
# Spec Coder V1
**Spec Coder** is a cutting-edge, open-source AI model designed to assist with fundamental coding tasks. It is built on the **Llama architecture**, allowing seamless access via tools like **llama.cpp** and **Ollama**. This makes **Spec Coder** highly compatible with a variety of systems, enabling flexible deployment both locally and in the cloud.
Trained on vast datasets, **Spec Coder** excels in generating code, completing code snippets, and understanding programming tasks across multiple languages. It can be used for code completion, debugging, and automated code generation, acting as a versatile assistant for developers.
**Spec Coder** is optimized for integration into developer tools, providing intelligent coding assistance and facilitating research in programming languages. Its advanced transformer-based architecture, with 4 billion parameters, allows it to perform tasks across different environments efficiently.
The model supports various downstream tasks including supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its performance for specific programming tasks.
# Training Data
- Total Training Tokens: ~4.3 trillion tokens
- Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, OpenCodeReasoning, English Wikipedia
# Training Details
- Context Window: 8,192 tokens
- Optimization: Standard language modeling objective
- Hardware: Cluster of 5 x RTX 4090 GPUs
- Training Duration: ~140 days (approximately 6 months)
# Benchmarks
## RepoBench 1.1 (Python)
| Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
|--------------------|-------|-------|-------|-------|-------|-------|----------|
| Spec-Coder-4b-V1 | 30.42%| 38.55%| 36.91%| 32.75%| 30.34%| 34.59%| 36.23% |
## Syntax-Aware Fill-in-the-Middle (SAFIM)
| Model | Algorithmic | Control | API | Average |
|----------------------|-------------|---------|--------|---------|
| Spec-Coder-4b-V1 | 38.22% | 41.18% | 60.45% | 46.28% |
## HumanEval Infilling
| Model | Single-Line | Multi-Line | Random Span |
|----------------------|-------------|------------|-------------|
| Spec-Coder-4b-V1 | 72.34% | 45.65% | 39.12% |
# Limitations
- **Biases**: The model may reflect biases present in the public codebases.
- **Security**: Code generated by the model may contain security vulnerabilities. It is essential to verify and audit the code generated by the model for any potential risks.
# Sample Usage
Here are examples of how to run and interact with **Spec Coder**:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "SVECTOR-CORPORATION/Spec-Coder-4b-V1"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_code = "def factorial(n):\n if n == 0:"
inputs = tokenizer(input_code, return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Python code:\n", generated_code)
``` |