Update README.md

40d10c3 verified 4 months ago

2.86 kB

metadata

license: mit
pipeline_tag: text-classification
library_name: transformers
base_model: answerdotai/ModernBERT-large
tags:
  - academic
  - reasoning
  - verification
  - weaver
  - cross-encoder
  - mmlu
language:
  - en

Weaver Distilled for MMLU-Pro

This is a distilled cross-encoder model based on ModernBERT-large, trained to predict the correctness of answers on MMLU Pro. This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models.

Model Details

Base Model: answerdotai/ModernBERT-large (395M parameters)
Architecture: Cross-encoder with MLP head (1024 → 512 → 256 → 1)
Max Sequence Length: 4096 tokens
Training Data: MMLU-Pro problems with Weaver scores from 35 LM judges and reward models
Task: Binary classification for answer correctness prediction

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "hazyresearch/Weaver_Distilled_for_MMLU-Pro"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
instruction = "Which of the following is NOT a fundamental force in physics? A) Electromagnetic force B) Weak nuclear force C) Strong nuclear force D) Centrifugal force"
response = "The answer is D) Centrifugal force. Centrifugal force is not a fundamental force but rather a fictitious force that appears in rotating reference frames..."

# Tokenize input pair
inputs = tokenizer(
    instruction, 
    response,
    truncation=True,
    max_length=4096,
    padding=True,
    return_tensors="pt"
)

# Get correctness score
with torch.no_grad():
    outputs = model(**inputs)
    score = torch.sigmoid(outputs.logits).item()
    
print(f"Correctness score: {score:.3f}")
print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}")

Training Details

This model was trained using the Weaver distillation pipeline. For training your own distilled models, see the distillation README.

Evaluation

Evaluate this model using:

python evaluate_crossencoder.py \
  --model_name "answerdotai/ModernBERT-large" \
  --checkpoint_path "hazyresearch/Weaver_Distilled_for_MMLU-Pro" \
  --dataset_path "hazyresearch/MMLU-Pro_with_Llama_3.1_70B_Instruct_v1" \
  --dataset_split "data" \
  --max_length 4096 \
  --batch_size 64

Citation

@article{weaver2025,
  title={Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers},
  author={},
  journal={arXiv preprint},
  year={2025}
}