File size: 3,120 Bytes
0682419
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25f4183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0682419
25f4183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
datasets:
- bigcode/starcoderdata
- bigcode/starcoder2data-extras
language:
- en
tags:
- code
- python
- java
- javascript
- typescript
- go
- rust
- php
- ruby
- cpp
- c
- sql
---
# CodeModernBERT-Crow-v1-Pre

## Model Description

**CodeModernBERT-Crow-v1-Pre** is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language.
It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.

* **License**: Apache-2.0
* **Supported Languages**: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL
* **Datasets**:

  * [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)
  * [bigcode/starcoder2data-extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras)
* **Pipeline tag**: `fill-mask`

This model is a **pretraining checkpoint**, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.

---

## Training Objective

The model was pretrained on large-scale multilingual code corpora with the following goals:

* Learn robust code representations across multiple programming languages.
* Capture semantic relations between code tokens and natural language descriptions.
* Provide a strong initialization point for fine-tuning on code-related downstream tasks.

---

## Tokenizer

A custom **BPE tokenizer** was trained for code and docstrings.

* **Vocabulary size**: 50,368
* **Special tokens**: Standard Hugging Face special tokens + custom tokens for code/document structure.
* **Training process**:

  * Up to 1M examples per dataset.
  * Each example truncated to 10,000 characters.
  * Trained with files from multiple datasets (see above).

---

## Architecture

* **Base**: ModernBERT
* **Hidden size**: 768
* **Number of layers**: 12
* **Attention heads**: 12
* **Intermediate size**: 3072
* **Max sequence length: 8192** (during training, inputs were limited to 1024 tokens)
* **RoPE positional encoding**: supported

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")

inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
outputs = model(**inputs)
```

The model can be fine-tuned for:

* Code search (query ↔ code retrieval)
* Code clone detection
* Code summarization (docstring prediction)
* Bug detection and repair (masked language modeling or cloze-style)

---

## Limitations

* The model is not optimized for direct code generation.
* Pretraining does not guarantee correctness of code execution.
* Fine-tuning is recommended for specific downstream applications.

---

## Intended Use

* Research in software engineering and natural language processing for code.
* Educational exploration of pretrained models for code tasks.
* Baseline for continued pretraining or fine-tuning.

---