ModernGBERT 1B
ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT codebase. ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of RedPajama V2 as our LLäMmlein decoder family.
We provide two model sizes:
ModernGBERT 1B ← You are here
28 layers, hidden size 2,048, 1 billion parametersModernGBERT 134M
22 layers, hidden size 768, 134 million parameters
Find more details in our preprint!
Usage
You can use ModernGBERT with the transformers
library from version v4.48.0 onwards.
(Optional: install flash-attn
to achieve highest efficiency.)
Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via AutoModelForMaskedLM
. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.
Example using AutoModelForMaskedLM
:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "LSX-UniWue/ModernGBERT_1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "Die Hauptstadt von Frankreich ist [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: Paris
NOTE: If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:
from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
task_type="TOKEN_CLS", r=8, lora_alpha=32,
target_modules=["Wqkv", "Wi", "Wo"],
)
model = get_peft_model(model, peft_config)
Intermediate Checkpoints
In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. A specific checkpoint can be loaded like this:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "LSX-UniWue/ModernGBERT_1B"
revision = "base-head-12000-ckpt"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision)
Performance
We evaluate our models across a broad range of tasks. For natural language understanding, we use the SuperGLEBer benchmark, and for embedding capabilities, we use the German MTEB benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our preprint for more details about the evaluation.
Model | SuperGLEBer Avg | MTEB Avg |
---|---|---|
ModernGBERT 1B (you are here) |
0.808 | 0.551 |
ModernGBERT 134M | 0.749 | 0.501 |
GBERT-base | 0.718 | 0.500 |
GBERT-large | 0.768 | 0.521 |
GeBERTa-base | 0.716 | 0.493 |
GeBERTa-large | 0.749 | 0.494 |
GeBERTa-xlarge | 0.767 | 0.521 |
Gerturax-3 | 0.740 | 0.472 |
XLM-RoBERTa-large | 0.730 | 0.460 |
XLM-RoBERTa-xlarge | 0.758 | 0.479 |
License
We release the ModernGBERT models under a research-only RAIL-M license. See license.md for details.
- Downloads last month
- 754