File size: 2,627 Bytes
0f4f35c
4972ff1
8f352d6
 
 
 
 
 
 
 
 
 
 
0f4f35c
 
8f352d6
0f4f35c
7155c7a
0f4f35c
 
 
8f352d6
0f4f35c
8f352d6
 
 
0f4f35c
8f352d6
0e40d5d
8f352d6
 
 
0e40d5d
8f352d6
 
 
 
0f4f35c
8f352d6
 
 
0f4f35c
8f352d6
 
 
 
0f4f35c
8f352d6
 
0f4f35c
 
0e40d5d
8f352d6
0f4f35c
8f352d6
 
 
 
 
0f4f35c
 
8f352d6
0f4f35c
8f352d6
0f4f35c
8f352d6
0f4f35c
8f352d6
0f4f35c
8f352d6
 
 
 
 
 
 
 
 
0f4f35c
8f352d6
0f4f35c
 
8f352d6
 
 
 
 
 
0f4f35c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: mit
pipeline_tag: text-classification
library_name: transformers
base_model: answerdotai/ModernBERT-large
tags:
- math
- reasoning
- verification
- weaver
- cross-encoder
language:
- en
---

# Weaver Distilled for MATH500

This is a distilled cross-encoder model based on ModernBERT-large, trained to predict the correctness of answers on MATH500. This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models.

## Model Details

- **Base Model**: [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) (395M parameters)
- **Architecture**: Cross-encoder with MLP head (1024 → 512 → 256 → 1)
- **Max Sequence Length**: 4096 tokens
- **Training Data**: MATH500 problems with Weaver scores from 35 LM judges and reward models
- **Task**: Binary classification for answer correctness prediction

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "hazyresearch/Weaver_Distilled_for_MATH500"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
instruction = "Solve: What is the derivative of x^2 + 3x + 2?"
response = "The derivative is 2x + 3. Using the power rule..."

# Tokenize input pair
inputs = tokenizer(
    instruction, 
    response,
    truncation=True,
    max_length=4096,
    padding=True,
    return_tensors="pt"
)

# Get correctness score
with torch.no_grad():
    outputs = model(**inputs)
    score = torch.sigmoid(outputs.logits).item()
    
print(f"Correctness score: {score:.3f}")
print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}")
```

## Training Details

This model was trained using the [Weaver distillation pipeline](https://github.com/ScalingIntelligence/scaling-verification/tree/main/distillation). For training your own distilled models, see the [distillation README](https://github.com/ScalingIntelligence/scaling-verification/blob/main/distillation/README.md).

## Evaluation

Evaluate this model using:

```bash
python evaluate_crossencoder.py \
  --model_name "answerdotai/ModernBERT-large" \
  --checkpoint_path "hazyresearch/Weaver_Distilled_for_MATH500" \
  --dataset_path "hazyresearch/MATH500_with_Llama_3.1_70B_Instruct_v1" \
  --dataset_split "data" \
  --max_length 4096 \
  --batch_size 64
```

## Citation

```bibtex
@article{weaver2025,
  title={Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers},
  author={},
  journal={arXiv preprint},
  year={2025}
}
```