|
--- |
|
datasets: |
|
- bigcode/the-stack |
|
- bigcode/the-stack-v2 |
|
- bigcode/starcoderdata |
|
- bigcode/commitpack |
|
- nvidia/OpenCodeReasoning |
|
library_name: transformers |
|
tags: |
|
- code |
|
license: mit |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Spec Coder V1 |
|
**Spec Coder** is a cutting-edge, open-source AI model designed to assist with fundamental coding tasks. It is built on the **Llama architecture**, allowing seamless access via tools like **llama.cpp** and **Ollama**. This makes **Spec Coder** highly compatible with a variety of systems, enabling flexible deployment both locally and in the cloud. |
|
|
|
Trained on vast datasets, **Spec Coder** excels in generating code, completing code snippets, and understanding programming tasks across multiple languages. It can be used for code completion, debugging, and automated code generation, acting as a versatile assistant for developers. |
|
|
|
**Spec Coder** is optimized for integration into developer tools, providing intelligent coding assistance and facilitating research in programming languages. Its advanced transformer-based architecture, with 4 billion parameters, allows it to perform tasks across different environments efficiently. |
|
|
|
The model supports various downstream tasks including supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its performance for specific programming tasks. |
|
|
|
# Training Data |
|
- Total Training Tokens: ~4.3 trillion tokens |
|
- Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, OpenCodeReasoning, English Wikipedia |
|
|
|
# Training Details |
|
- Context Window: 8,192 tokens |
|
- Optimization: Standard language modeling objective |
|
- Hardware: Cluster of 5 x RTX 4090 GPUs |
|
- Training Duration: ~140 days (approximately 6 months) |
|
|
|
# Benchmarks |
|
## RepoBench 1.1 (Python) |
|
| Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k | |
|
|--------------------|-------|-------|-------|-------|-------|-------|----------| |
|
| Spec-Coder-4b-V1 | 30.42%| 38.55%| 36.91%| 32.75%| 30.34%| 34.59%| 36.23% | |
|
|
|
## Syntax-Aware Fill-in-the-Middle (SAFIM) |
|
| Model | Algorithmic | Control | API | Average | |
|
|----------------------|-------------|---------|--------|---------| |
|
| Spec-Coder-4b-V1 | 38.22% | 41.18% | 60.45% | 46.28% | |
|
|
|
## HumanEval Infilling |
|
| Model | Single-Line | Multi-Line | Random Span | |
|
|----------------------|-------------|------------|-------------| |
|
| Spec-Coder-4b-V1 | 72.34% | 45.65% | 39.12% | |
|
|
|
# Limitations |
|
- **Biases**: The model may reflect biases present in the public codebases. |
|
- **Security**: Code generated by the model may contain security vulnerabilities. It is essential to verify and audit the code generated by the model for any potential risks. |
|
|
|
# Sample Usage |
|
Here are examples of how to run and interact with **Spec Coder**: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "SVECTOR-CORPORATION/Spec-Coder-4b-V1" |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
input_code = "def factorial(n):\n if n == 0:" |
|
|
|
inputs = tokenizer(input_code, return_tensors="pt") |
|
|
|
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1) |
|
|
|
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print("Generated Python code:\n", generated_code) |
|
``` |