Encoder Model from "Should We Still Pretrain Encoders with Masked Language Modeling?"

This repository contains an encoder model, part of the research presented in the paper "Should We Still Pretrain Encoders with Masked Language Modeling?".

This paper investigates the effectiveness of Masked Language Modeling (MLM) versus Causal Language Modeling (CLM) for pretraining text encoders to achieve high-quality text representations. It demonstrates that while MLM generally yields better performance, CLM-trained models are more data-efficient. The research further proposes a biphasic training strategy that sequentially applies CLM and then MLM, achieving optimal performance under a fixed computational budget.

Paper: Should We Still Pretrain Encoders with Masked Language Modeling?
Project Page: https://hf.co/MLMvsCLM
Code: https://github.com/Nicolas-BZRD/EuroBERT

Model Description

This model is an encoder designed to produce robust text representations for a wide range of natural language processing tasks. It is trained as part of an extensive study on encoder pretraining objectives, focusing on the trade-offs and benefits of MLM and CLM, and the effectiveness of a biphasic training approach. The model architecture is an SLModel as identified in the config.json.

Usage

You can use this model for feature extraction with the Hugging Face transformers library. Since this model might use a custom architecture (SLModel), you may need to install the associated EuroBERT package and use trust_remote_code=True when loading the model.

First, install the EuroBERT package:

pip install git+https://github.com/Nicolas-BZRD/EuroBERT.git

Then, you can load and use the model as follows:

from transformers import AutoTokenizer, AutoModel
import torch

# Replace with the actual model ID if different, e.g., "AhmedAliHassan/MLMvsCLM-Biphasic-210M"
# This placeholder assumes the current repository is the model you want to load.
model_name = "<YOUR_MODEL_ID_HERE>" 

# Load the tokenizer and model, ensuring trust_remote_code for custom architectures
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = "This is an example sentence to extract features from."

inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# The last hidden state contains the token embeddings (features)
last_hidden_state = outputs.last_hidden_state
print(f"Shape of last hidden state: {last_hidden_state.shape}")

# For sentence-level embeddings, common approaches include:
# 1. Averaging the token embeddings (excluding special tokens)
# 2. Using the embedding of the [CLS] token (if applicable for the model's architecture)
# Example: Mean pooling (simple average over non-padding tokens)
attention_mask = inputs["attention_mask"]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
mean_pooled_embedding = sum_embeddings / sum_mask
print(f"Shape of mean pooled embedding: {mean_pooled_embedding.shape}")

MLMvsCLM
/

610m-clm-11k-mlm40-22k

Encoder Model from "Should We Still Pretrain Encoders with Masked Language Modeling?"

Model Description

Usage

Collection including MLMvsCLM/610m-clm-11k-mlm40-22k

MLM vs CLM