metadata

license: apache-2.0
base_model: distilroberta-base
tags:
  - generated_from_trainer
  - rejection
  - no_answer
  - chatgpt
metrics:
  - accuracy
  - recall
  - precision
  - f1
model-index:
  - name: distilroberta-base-rejection-v1
    results: []
language:
  - en
pipeline_tag: text-classification
co2_eq_emissions:
  emissions: 0.07987621556153969
  source: code carbon
  training_type: fine-tuning
datasets:
  - argilla/notus-uf-dpo-closest-rejected

Model Card: distilroberta-base-rejection-v1

This model was originally developed and fine-tuned by Protect AI. It is a fine-tuned version of distilroberta-base, trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.

The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:

0: Normal output
1: Rejection detected

On the evaluation set, the model achieves:

Loss: 0.0544
Accuracy: 0.9887
Recall: 0.9810
Precision: 0.9279
F1 Score: 0.9537

Model Details

Developed & fine-tuned by: ProtectAI.com
Base model: distilroberta-base
Language(s): English
License: Apache 2.0
Task: Text classification (Rejection detection)

Intended Use & Limitations

The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.

Limitations:

Performance depends on the quality and domain of the training data.
May underperform on text styles or topics underrepresented in training.
Being based on distilroberta-base, it is case-sensitive.

Usage

With Hugging Face Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))