Encoder Model from "Should We Still Pretrain Encoders with Masked Language Modeling?"
This repository contains an encoder model, part of the research presented in the paper "Should We Still Pretrain Encoders with Masked Language Modeling?".
This paper investigates the effectiveness of Masked Language Modeling (MLM) versus Causal Language Modeling (CLM) for pretraining text encoders to achieve high-quality text representations. It demonstrates that while MLM generally yields better performance, CLM-trained models are more data-efficient. The research further proposes a biphasic training strategy that sequentially applies CLM and then MLM, achieving optimal performance under a fixed computational budget.
- Paper: Should We Still Pretrain Encoders with Masked Language Modeling?
- Project Page: https://hf.co/MLMvsCLM
- Code: https://github.com/Nicolas-BZRD/EuroBERT
Model Description
This model is an encoder designed to produce robust text representations for a wide range of natural language processing tasks. It is trained as part of an extensive study on encoder pretraining objectives, focusing on the trade-offs and benefits of MLM and CLM, and the effectiveness of a biphasic training approach. The model architecture is an SLModel
as identified in the config.json
.
Usage
You can use this model for feature extraction with the Hugging Face transformers
library. Since this model might use a custom architecture (SLModel
), you may need to install the associated EuroBERT
package and use trust_remote_code=True
when loading the model.
First, install the EuroBERT
package:
pip install git+https://github.com/Nicolas-BZRD/EuroBERT.git
Then, you can load and use the model as follows:
from transformers import AutoTokenizer, AutoModel
import torch
# Replace with the actual model ID if different, e.g., "AhmedAliHassan/MLMvsCLM-Biphasic-210M"
# This placeholder assumes the current repository is the model you want to load.
model_name = "<YOUR_MODEL_ID_HERE>"
# Load the tokenizer and model, ensuring trust_remote_code for custom architectures
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = "This is an example sentence to extract features from."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# The last hidden state contains the token embeddings (features)
last_hidden_state = outputs.last_hidden_state
print(f"Shape of last hidden state: {last_hidden_state.shape}")
# For sentence-level embeddings, common approaches include:
# 1. Averaging the token embeddings (excluding special tokens)
# 2. Using the embedding of the [CLS] token (if applicable for the model's architecture)
# Example: Mean pooling (simple average over non-padding tokens)
attention_mask = inputs["attention_mask"]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
mean_pooled_embedding = sum_embeddings / sum_mask
print(f"Shape of mean pooled embedding: {mean_pooled_embedding.shape}")
- Downloads last month
- 4