rejection_detection / README.md
wu981526092's picture
Update README.md
e8ef484 verified
metadata
license: apache-2.0
base_model: distilroberta-base
tags:
  - generated_from_trainer
  - rejection
  - no_answer
  - chatgpt
metrics:
  - accuracy
  - recall
  - precision
  - f1
model-index:
  - name: distilroberta-base-rejection-v1
    results: []
language:
  - en
pipeline_tag: text-classification
co2_eq_emissions:
  emissions: 0.07987621556153969
  source: code carbon
  training_type: fine-tuning
datasets:
  - argilla/notus-uf-dpo-closest-rejected

Model Card: distilroberta-base-rejection-v1

This model was originally developed and fine-tuned by Protect AI. It is a fine-tuned version of distilroberta-base, trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.

The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:

  • 0: Normal output
  • 1: Rejection detected

On the evaluation set, the model achieves:

  • Loss: 0.0544
  • Accuracy: 0.9887
  • Recall: 0.9810
  • Precision: 0.9279
  • F1 Score: 0.9537

Model Details

  • Developed & fine-tuned by: ProtectAI.com
  • Base model: distilroberta-base
  • Language(s): English
  • License: Apache 2.0
  • Task: Text classification (Rejection detection)

Intended Use & Limitations

The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.

Limitations:

  • Performance depends on the quality and domain of the training data.
  • May underperform on text styles or topics underrepresented in training.
  • Being based on distilroberta-base, it is case-sensitive.

Usage

With Hugging Face Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))