roberta_nli_ensemble

A fine-tuned RoBERTa model designed for an Natural Language Inference (NLI) task, classifying the relationship between pairs of sentences given a premise and a hypothesis.

Model Details

Model Description

This model builds upon the roberta-base architecture, adding a multi-layer classification head for NLI. It computes average pooled representations of premise and hypothesis tokens (identified via token_type_ids) and concatenates them before passing through additional linear and non-linear layers. The final output is used to classify the pair of sentences into one of three classes.

Developed by: Dev Soneji and Patrick Mermelstein Lyons
Language(s): English
Model type: Supervised
Model architecture: RoBERTa encoder with a multi-layer classification head
Finetuned from model: roberta-base

Model Resources

Repository: Devtrick/roberta_nli_ensemble
Paper or documentation: RoBERTa: A Robustly Optimized BERT Pretraining Approach

Training Details

Training Data

The model was trained on a dataset located in train.csv. This dataset comprised of 24K premise-hypothesis pairs, with a label to determine if the hypothesis is true based on the premise. The label was binary, 0 = hypothesis is false, 1 = hypothesis is true. No further details were given on the origin and validity of this dataset.

The data was passed through a tokenizer (AutoTokenizer), as part of the standard hugging face library. No other pre-processing was done, aside from relabelling columns to match the expected format.

Training Procedure

The model was trained in the following way:

The model was trained on the following data (Training Data), with renaming of columns and tokenization.
The model was initialised with a custom configuration class, roBERTaConfig, setting essential parameters. The model itself, roBERTaClassifier extends the pretrained RoBERTa model to include multiple linear layers for classification and pooling.
Hyperparameter selection was carried out in a seperate grid search to identify the best performing hyperparameters. This resulted in the following parameters - Training Hyperparameters.
The model was validated with the following test data, giving the following results.
Checkpoints were saved after each epoch, and finally the best checkpoint was reloaded and pushed to the Hugging Face Hub.

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 128
eval_batch_size: 128
weight_decay: 0.01
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 10

Speeds, Sizes, Times

Training time: This model took 12 minutes 17 seconds to train on the hardware specified below. It was trained on 10 epochs, however early stopping caused only 5 epochs to train.

Model size: 126M parameteres.

Evaluation

Testing Data & Metrics

Testing Data

The development (and effectively testing) dataset is located in dev.csv. This is 6K pairs as validation data, in the same format of the training data. No further details were given on the origin and validity of this dataset.

The data was passed through a tokenizer (AutoTokenizer), as part of the standard hugging face library. No other pre-processing was done, aside from relabelling columns to match the expected format.

Metrics

Accuracy: Proportion of correct predictions.
Matthews Correlation Coefficient (MCC): Correlation coefficient between predicted and true labels, ranging from -1 to 1.

Results

Final results on the evaluation set:

Loss: 0.4849
Accuracy: 0.8848
Mcc: 0.7695

Training Loss	Epoch	Step	Validation Loss	Accuracy	Mcc
0.6552	1.0	191	0.3383	0.8685	0.7377
0.2894	2.0	382	0.3045	0.8778	0.7559
0.1891	3.0	573	0.3255	0.8854	0.7705
0.1209	4.0	764	0.3963	0.8829	0.7657
0.0843	5.0	955	0.4849	0.8848	0.7695

Technical Specifications

Hardware

PC specs the model was trained on:

CPU: AMD Ryzen 7 7700X
GPU: NVIDIA GeForce RTX 5070 Ti
Memory: 32GB DDR5
Motherboard: MSI MAG B650 TOMAHAWK WIFI Motherboard

Software

Transformers 4.50.2
Pytorch 2.8.0.dev20250326+cu128
Datasets 3.5.0
Tokenizers 0.21.1

Bias, Risks, and Limitations

The model's performance and biases depend on the data on which it was trained, however no details of the data's origin is known so this cannot be commented on.
The risk lies in trusting any labelling with confidence, without manual verification. Models can make mistakes, verify the outputs.
This is limited by the training data not being comprehensive of all possible premise-hypothesis combinations, however this is possible in real life. Additional training and validation data would have been useful.

Additional Information

This model was pushed to the Hugging Face Hub with trainer.push_to_hub() after training locally.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Metadata error: specify a dataset to view leaderboard