ModernGBERT 134M
ModernGBERT 134M is a German ModernBERT language model with 134 million parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT codebase. ModernGBERT 134M has been pre-trained on 470 billion tokens from the German portion of RedPajama V2, a subset of the same training dataset used for our LLäMmlein decoder family.
We provide two model sizes:
ModernGBERT 1B
28 layers, hidden size 2,048, 1 billion parametersModernGBERT 134M ← You are here
22 layers, hidden size 768, 134 million parameters
Find more details in our preprint!
Usage
You can use ModernGBERT with the transformers
library from version v4.48.0 onwards.
(Optional: install flash-attn
to achieve highest efficiency.)
Since ModernGBERT 134M is a Masked Language Model (MLM), you can load it via AutoModelForMaskedLM
. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.
Example using AutoModelForMaskedLM
:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "LSX-UniWue/ModernGBERT_134M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "Die Hauptstadt von Frankreich ist [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: Paris
NOTE: If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:
from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
task_type="TOKEN_CLS", r=8, lora_alpha=32,
target_modules=["Wqkv", "Wi", "Wo"],
)
model = get_peft_model(model, peft_config)
Intermediate Checkpoints
In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. A specific checkpoint can be loaded like this:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "LSX-UniWue/ModernGBERT_134M"
revision = "base-100000-ckpt"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision)
Performance
We evaluate our models across a broad range of tasks. For natural language understanding, we use the SuperGLEBer benchmark, and for embedding capabilities, we use the German MTEB benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our preprint for more details about the evaluation.
Model | SuperGLEBer Avg | MTEB Avg |
---|---|---|
ModernGBERT 1B | 0.808 | 0.551 |
ModernGBERT 134M (you are here) |
0.749 | 0.501 |
GBERT-base | 0.718 | 0.500 |
GBERT-large | 0.768 | 0.521 |
GeBERTa-base | 0.716 | 0.493 |
GeBERTa-large | 0.749 | 0.494 |
GeBERTa-xlarge | 0.767 | 0.521 |
Gerturax-3 | 0.740 | 0.472 |
XLM-RoBERTa-large | 0.730 | 0.460 |
XLM-RoBERTa-xlarge | 0.758 | 0.479 |
License
We release the ModernGBERT models under a research-only RAIL-M license. See license.md for details.
- Downloads last month
- 444