BERT Probe for Unsafe Reasoning Detection

This model is a BERT-based probe trained to detect "unsafe" reasoning patterns in mathematical problem-solving.

Model Details

  • Base Model: bert-base-uncased
  • Task: Binary classification (safe vs unsafe reasoning)
  • Training: Fine-tuned on mathematical reasoning examples
  • Use Case: Research into AI safety and reasoning patterns

Usage

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("ksw1/bert-probe-unsafe-reasoning")
model = BertForSequenceClassification.from_pretrained("ksw1/bert-probe-unsafe-reasoning")

# Example usage
text = "To solve this problem, I'll work step by step..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
    outputs = model(**inputs)
    prob_unsafe = torch.sigmoid(outputs.logits[:, 1]).item()

print(f"Probability of unsafe reasoning: {prob_unsafe:.3f}")

Training Data

Trained on mathematical reasoning examples with labels for safe/unsafe reasoning patterns.

Intended Use

This model is intended for research purposes only, specifically for studying reasoning patterns in AI systems.

Downloads last month
220
Safetensors
Model size
109M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support