README.md · Shuu12121/CodeModernBERT-Crow-v1-Pre at 0682419a8860cb6c8ee0d45aba14b6efaf2fd962

File size: 3,120 Bytes

---
license: apache-2.0
datasets:
- bigcode/starcoderdata
- bigcode/starcoder2data-extras
language:
- en
tags:
- code
- python
- java
- javascript
- typescript
- go
- rust
- php
- ruby
- cpp
- c
- sql
---
# CodeModernBERT-Crow-v1-Pre

## Model Description

**CodeModernBERT-Crow-v1-Pre** is a pretrained language model based on the ModernBERT architecture, specifically adapted for source code and docstring style natural language.
It supports multiple programming languages and is trained using large-scale code datasets curated from open-source repositories.

* **License**: Apache-2.0
* **Supported Languages**: Python, JavaScript, TypeScript, Java, Go, Rust, PHP, Ruby, C++, C, SQL
* **Datasets**:

  * [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)
  * [bigcode/starcoder2data-extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras)
* **Pipeline tag**: `fill-mask`

This model is a **pretraining checkpoint**, designed for further fine-tuning on downstream tasks such as semantic code search, bug detection, or code summarization.

---

## Training Objective

The model was pretrained on large-scale multilingual code corpora with the following goals:

* Learn robust code representations across multiple programming languages.
* Capture semantic relations between code tokens and natural language descriptions.
* Provide a strong initialization point for fine-tuning on code-related downstream tasks.

---

## Tokenizer

A custom **BPE tokenizer** was trained for code and docstrings.

* **Vocabulary size**: 50,368
* **Special tokens**: Standard Hugging Face special tokens + custom tokens for code/document structure.
* **Training process**:

  * Up to 1M examples per dataset.
  * Each example truncated to 10,000 characters.
  * Trained with files from multiple datasets (see above).

---

## Architecture

* **Base**: ModernBERT
* **Hidden size**: 768
* **Number of layers**: 12
* **Attention heads**: 12
* **Intermediate size**: 3072
* **Max sequence length: 8192** (during training, inputs were limited to 1024 tokens)
* **RoPE positional encoding**: supported

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Crow-v1-Pre")

inputs = tokenizer("def add(a, b): return a + b", return_tensors="pt")
outputs = model(**inputs)
```

The model can be fine-tuned for:

* Code search (query ↔ code retrieval)
* Code clone detection
* Code summarization (docstring prediction)
* Bug detection and repair (masked language modeling or cloze-style)

---

## Limitations

* The model is not optimized for direct code generation.
* Pretraining does not guarantee correctness of code execution.
* Fine-tuning is recommended for specific downstream applications.

---

## Intended Use

* Research in software engineering and natural language processing for code.
* Educational exploration of pretrained models for code tasks.
* Baseline for continued pretraining or fine-tuning.

---