ModernGBERT 134M

ModernGBERT 134M is a German ModernBERT language model with 134 million parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT codebase. ModernGBERT 134M has been pre-trained on 470 billion tokens from the German portion of RedPajama V2, a subset of the same training dataset used for our LLäMmlein decoder family.

We provide two model sizes:

ModernGBERT 1B
28 layers, hidden size 2,048, 1 billion parameters
ModernGBERT 134M ← You are here
22 layers, hidden size 768, 134 million parameters

Find more details in our preprint!

Usage

You can use ModernGBERT with the transformers library from version v4.48.0 onwards. (Optional: install flash-attn to achieve highest efficiency.)

Since ModernGBERT 134M is a Masked Language Model (MLM), you can load it via AutoModelForMaskedLM. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.

Example using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "LSX-UniWue/ModernGBERT_134M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "Die Hauptstadt von Frankreich ist [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  Paris

NOTE: If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:

from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
    task_type="TOKEN_CLS", r=8, lora_alpha=32,
    target_modules=["Wqkv", "Wi", "Wo"],
)
model = get_peft_model(model, peft_config)

Intermediate Checkpoints

In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. A specific checkpoint can be loaded like this:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "LSX-UniWue/ModernGBERT_134M"
revision = "base-100000-ckpt"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision)

Performance

We evaluate our models across a broad range of tasks. For natural language understanding, we use the SuperGLEBer benchmark, and for embedding capabilities, we use the German MTEB benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our preprint for more details about the evaluation.

Model	SuperGLEBer Avg	MTEB Avg
ModernGBERT 1B	0.808	0.551
ModernGBERT 134M (you are here)	0.749	0.501
GBERT-base	0.718	0.500
GBERT-large	0.768	0.521
GeBERTa-base	0.716	0.493
GeBERTa-large	0.749	0.494
GeBERTa-xlarge	0.767	0.521
Gerturax-3	0.740	0.472
XLM-RoBERTa-large	0.730	0.460
XLM-RoBERTa-xlarge	0.758	0.479

License

We release the ModernGBERT models under a research-only RAIL-M license. See license.md for details.

LSX-UniWue
/

ModernGBERT_134M

ModernGBERT 134M

Usage

Intermediate Checkpoints

Performance

License

Dataset used to train LSX-UniWue/ModernGBERT_134M

Collection including LSX-UniWue/ModernGBERT_134M

ModernGBERT