LGBeTO_detection_Model

This is LGBeTO model. Corresponding to a fine-tuned version of dccuchile/bert-base-spanish-wwm-uncased(Cañete et al., 2023). It achieves the following results on the evaluation set:

Accuracy: 0.835
F1: 0.8533
Precision: 0.8205
Recall: 0.8889

Authors

Developed by: Claudia Martínez-Araneda, Mariella Gutiérrez V., Pedro Gómez M., Diego Maldonado M., Alejandra Segura N., Christian Vidal-Castro
Model type: BERT-based sentiment analysis, BERT-based text classification.
Language(s) (NLP): Spanish
License: CC BY 4.0
Finetuned from model: BETO (Cañete et al., 2023)

Cite as:

@misc{claudia_martínez-araneda_2025, author = { Claudia Martínez-Araneda and Mariella Gutiérrez V. and Pedro Gómez M. and Diego Maldonado M. and Alejandra Segura N. and Christian Vidal-Castro }, title = { LGBeTO_detection_Model (Revision a8b5b38) }, year = 2025, url = { https://huggingface.co/LaProfeClaudis/LGBeTO_detection_Model }, doi = { 10.57967/hf/5406 }, publisher = { Hugging Face } }

Model description

LGBeTO was designed to detect discriminatory or hateful language directed toward the LGBTQIA+ community, aiming to support safer and more inclusive online environments.

Intended uses & limitations

This model was created for a study conducted strictly for academic and research purposes. The target of hate speech has been anonymised, and there is no intent to harm the perpetrators in any way. We prioritise protecting the privacy and confidentiality of vulnerable individuals. We carefully remove identifying data, such as user IDs, phone numbers, and addresses, to safeguard privacy before sharing the data with our annotators. All data collected comes from public sources.

As authors, we affirm our deep respect for all individuals and explicitly state that we have no intention of prejudicing, biasing, or disrespecting the LGBTQIA+ community or any group. Our work seeks to contribute constructively to inclusive and ethical research in artificial intelligence.

Training and evaluation data

LGBeTO was fine-tuned using comments collected from digital media, such as Twitter, Instagram, websites, and YouTube comments. The dataset is available in the Zenodo Repository.

Cite as: Martínez-Araneda, C., Maldonado Montiel, D., Gutiérrez Valenzuela, M., Gómez Meneses, P., Segura Navarrete, A., & Vidal-Castro, C. (2025). LGBTQIAphobia dataset (augmented and balanced) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15385622

Training procedure

step 1: Load the dataSet
step 2: Tokenization and model generation
step 3: Split train-validation
step 4: Training configuration
step 5: Training/Evaluation

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1	Precision	Recall
0.4655	1.0	50	0.5517	0.755	0.7538	0.8242	0.6944
0.1928	2.0	100	0.4830	0.825	0.8523	0.7829	0.9352
0.0718	3.0	150	0.5393	0.835	0.8533	0.8205	0.8889

Framework versions

Transformers 4.51.3
Pytorch 2.6.0+cu124
Datasets 3.6.0
Tokenizers 0.21.1

LaProfeClaudis
/

LGBeTO_detection_Model