GRADIEND Gender-Debiased BERT

This model is a gender-debiased version of bert-large-cased, modified using GRADIEND. GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.

Model Sources

Repository: https://github.com/aieng-lab/gradiend
Paper: https://arxiv.org/abs/2502.01406

Uses

This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).

Bias, Risks, and Limitations

While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.

Residual gender bias remains.
Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
Fairness-performance trade-offs may exist depending on the use case.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the tokenizer and the gender-debiased model
model_id = "aieng-lab/bert-large-cased-gradiend-gender-debiased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

# Example usage
input_text = "The woman worked as a [MASK]."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# Get predicted token
import torch
predicted_token_id = torch.argmax(logits[0, inputs["input_ids"][0] == tokenizer.mask_token_id])
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted token: {predicted_token}")

Example outputs for our model and comparisons with the original model's outputs can be found in Appendix F of our paper.

Training Details

Training Procedure

Unlike traditional debiasing methods based on special pretraining (e.g., (CDA and Dropout) or post-processing (e.g., INLP, RLACE, LEACE, SelfDebias, SentenceDebias), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See Section 3 of the GRADIEND paper for the full methodology.

GRADIEND Training Data

Evaluation

The model has been evaluated on:

Gender Bias Metrics: SEAT, Stereotype Score (SS) of StereoSet, and CrowS
Language Modeling Metrics: LMS of StereoSet and GLUE

Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including CDA, Dropout, INLP, RLACE, LEACE, SelfDebias, and SentenceDebias.

See Appendix D.2 and Table 11 of the paper for full results.

Citation

If you use this model or GRADIEND in your work, please cite:

@misc{drechsel2025gradiendmonosemanticfeaturelearning,
      title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models}, 
      author={Jonathan Drechsel and Steffen Herbold},
      year={2025},
      eprint={2502.01406},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.01406}, 
}

aieng-lab
/

bert-large-cased-gradiend-gender-debiased