BERT Probe for Unsafe Reasoning Detection
This model is a BERT-based probe trained to detect "unsafe" reasoning patterns in mathematical problem-solving.
Model Details
- Base Model: bert-base-uncased
- Task: Binary classification (safe vs unsafe reasoning)
- Training: Fine-tuned on mathematical reasoning examples
- Use Case: Research into AI safety and reasoning patterns
Usage
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained("ksw1/bert-probe-unsafe-reasoning")
model = BertForSequenceClassification.from_pretrained("ksw1/bert-probe-unsafe-reasoning")
# Example usage
text = "To solve this problem, I'll work step by step..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prob_unsafe = torch.sigmoid(outputs.logits[:, 1]).item()
print(f"Probability of unsafe reasoning: {prob_unsafe:.3f}")
Training Data
Trained on mathematical reasoning examples with labels for safe/unsafe reasoning patterns.
Intended Use
This model is intended for research purposes only, specifically for studying reasoning patterns in AI systems.
- Downloads last month
- 220
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support