roberta-base-unified-mcqa-pairwise

This model is a fine-tuned version of roberta-base on the pszemraj/unified-mcqa-all nomath dataset. It achieves the following results on the evaluation set:

Loss: 0.3167
Mcqa Accuracy: 0.8
Binary Accuracy: 0.8670
Num Input Tokens Seen: 2340133056

usage

Use the pairwise_inference.py script:

Click to expand

#!/usr/bin/env python
"""
Pairwise Classification Inference Script for Multiple Choice Question Answering.

This script loads a Hugging Face Transformer model fine-tuned for pairwise
binary classification. For a given question, context, and a list of answer choices,
it processes each choice individually against the question and context.
The model predicts which choice is the most likely correct answer.

The input model is expected to be a sequence classification model with two output
labels (e.g., predicting if a (context, question+choice) pair is "correct" or "incorrect").
The fine-tuning process should have saved the model configuration appropriately,
including the number of labels (which should be 2 for this pairwise setup).

Dependencies:
  - torch
  - transformers

Usage:
  python pairwise_inference.py <model_name_or_path> [options]

Examples:
  1. Basic usage with a model from Hugging Face Hub (replace with your actual model):
     python pairwise_inference.py "your-username/your-finetuned-pairwise-mcqa-model"

  2. Using a local model path:
     python pairwise_inference.py "./path/to/your/local_model_directory"

  3. With custom context, question, and choices:
     python pairwise_inference.py "your-model" \
       --context "The capital of France is Paris." \
       --question "What is the capital of France?" \
       --choices "London" "Berlin" "Paris" "Madrid"

  4. If the model requires trusting remote code (e.g., custom model architectures):
     python pairwise_inference.py "user/model-with-custom-code" --trust_remote_code

  5. To disable CUDA and force CPU usage:
     python pairwise_inference.py "your-model" --no_cuda
"""
import argparse
import logging

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)


def get_parser():
    """Configures and returns the argument parser."""
    parser = argparse.ArgumentParser(
        description="Pairwise Classification Inference Script for Multiple Choice Question Answering.",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument(
        "model_name_or_path",
        type=str,
        help="Name or path of the fine-tuned model from the Hugging Face Hub or a local directory.",
    )
    parser.add_argument(
        "--context",
        type=str,
        default="In Shakespeare's 'Hamlet,' the protagonist is famously indecisive. His delay in avenging his father's murder is a central theme, prompting much critical debate. Some argue his inaction stems from a melancholic disposition or an over-intellectualizing nature that paralyzes action. Others suggest his hesitation is a rational response to the ambiguous nature of the Ghost's command and the political complexities of the Danish court. Ultimately, his internal conflict and external pressures contribute to the tragic outcome.",
        help="Context for the question."
    )
    parser.add_argument(
        "--question",
        type=str,
        default="The passage suggests that critical interpretations of Hamlet's indecisiveness primarily diverge on whether his inaction is rooted in:",
        help="The question to answer."
    )
    parser.add_argument(
        "--choices",
        nargs='+',
        default=[
            "His desire for political power versus his moral obligations.",
            "Personal psychological traits versus a reasoned assessment of his situation.",
            "The clarity of the Ghost's message versus the opinions of other characters.",
            "His fear of death versus his duty to his father."
        ],
        help="A list of potential answer choices."
    )
    parser.add_argument(
        "--max_length",
        type=int,
        default=512,
        help="Maximum sequence length for tokenization.",
    )
    parser.add_argument(
        "--no_cuda",
        action="store_true",
        help="Disable CUDA, use CPU even if CUDA is available.",
    )
    parser.add_argument(
        "--trust_remote_code",
        action="store_true",
        help="Allow loading models with custom code from the Hub.",
    )
    return parser


def perform_pairwise_inference(
    model_name_or_path: str,
    context: str,
    question: str,
    choices: list[str],
    device: str = "cpu",
    max_length: int = 512,
    trust_remote_code: bool = False,
):
    """
    Performs pairwise classification inference for a given question, context, and choices.

    Args:
        model_name_or_path (str): Name or path of the fine-tuned model.
        context (str): The context for the question. Can be empty.
        question (str): The question to be answered.
        choices (list[str]): A list of potential answer choices.
        device (str): Device to run the model on ('cuda' or 'cpu').
        max_length (int): Maximum sequence length for tokenization.
        trust_remote_code (bool): Whether to trust remote code when loading from Hub.

    Returns:
        tuple: (predicted_choice_text, predicted_choice_score, all_choice_scores)
               - predicted_choice_text (str): The text of the predicted best choice.
               - predicted_choice_score (float): The score (logit for class 1) of the predicted choice.
               - all_choice_scores (list[dict]): A list of dictionaries, each containing 'choice' and 'score'.
    """
    if not choices:
        logger.error("No choices provided.")
        return None, float("-inf"), []

    try:
        logger.info(f"Loading tokenizer from '{model_name_or_path}'...")
        tokenizer = AutoTokenizer.from_pretrained(
            model_name_or_path, trust_remote_code=trust_remote_code
        )
        logger.info(f"Loading model from '{model_name_or_path}'...")
        # Assumes the model was fine-tuned for sequence classification with 2 labels
        # and its configuration reflects this.
        model = AutoModelForSequenceClassification.from_pretrained(
            model_name_or_path, trust_remote_code=trust_remote_code
        )
        model.to(device)
        model.eval()
        logger.info("Model and tokenizer loaded successfully.")
    except Exception as e:
        logger.error(
            f"Error loading model or tokenizer from '{model_name_or_path}': {e}"
        )
        raise

    first_sentences = []
    second_sentences = []
    for choice in choices:
        first_sentences.append(context if context else "")
        second_sentences.append(f"{question} {choice}".strip())

    try:
        logger.info("Tokenizing inputs...")
        inputs = tokenizer(
            first_sentences,
            second_sentences,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=max_length,
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        logger.info(f"Inputs tokenized for {len(choices)} choices.")
    except Exception as e:
        logger.error(f"Error during tokenization: {e}")
        raise

    with torch.no_grad():
        try:
            logger.info("Running model inference...")
            outputs = model(**inputs)
            logits = outputs.logits  # Shape: [num_choices, 2]
            logger.info("Inference completed.")
        except Exception as e:
            logger.error(f"Error during model inference: {e}")
            raise

    # The score for each choice is the logit for the "correct" class (index 1)
    choice_scores = logits[:, 1].tolist()

    all_choice_scores_detailed = []
    for i, choice_text in enumerate(choices):
        all_choice_scores_detailed.append(
            {"choice": choice_text, "score": choice_scores[i]}
        )

    if not choice_scores:
        logger.warning("No scores were generated.")
        return None, float("-inf"), all_choice_scores_detailed

    best_choice_idx = choice_scores.index(max(choice_scores))
    predicted_choice_text = choices[best_choice_idx]
    predicted_choice_score = choice_scores[best_choice_idx]

    return predicted_choice_text, predicted_choice_score, all_choice_scores_detailed


def main():
    parser = get_parser()
    args = parser.parse_args()

    if not args.no_cuda and torch.cuda.is_available():
        device = "cuda"
        logger.info("CUDA is available. Using GPU.")
    else:
        device = "cpu"
        if not args.no_cuda and not torch.cuda.is_available():
            logger.warning("CUDA is not available on this system. Falling back to CPU.")
        else:
            logger.info("Using CPU.")

    logger.info(f"Model Name/Path: {args.model_name_or_path}")
    logger.info(f'Context: "{args.context}"')
    logger.info(f'Question: "{args.question}"')
    logger.info(f"Choices: {args.choices}")
    if args.trust_remote_code:
        logger.info("Trust remote code: Enabled")

    predicted_choice, predicted_score, all_scores = perform_pairwise_inference(
        model_name_or_path=args.model_name_or_path,
        context=args.context,
        question=args.question,
        choices=args.choices,
        device=device,
        max_length=args.max_length,
        trust_remote_code=args.trust_remote_code,
    )

    if predicted_choice is not None:
        print("\n--- Inference Results ---")
        print(f"Predicted Best Choice: {predicted_choice}")
        print(f"Score (logit for class 1): {predicted_score:.4f}")
        print("\nScores for all choices (sorted best to worst):")
        for item in sorted(all_scores, key=lambda x: x["score"], reverse=True):
            print(f"  - \"{item['choice']}\": {item['score']:.4f}")
    else:
        print("\n--- Inference Failed ---")
        print("Could not determine a prediction. Check logs for details.")

if __name__ == "__main__":
    main()

outputs will look like:

2025-05-10 21:42:08,096 - INFO - Model Name/Path: pszemraj/roberta-base-unified-mcqa-all-nomath-pairwise
2025-05-10 21:42:08,096 - INFO - Context: "In Shakespeare's 'Hamlet,' the protagonist is famously indecisive. His delay in avenging his father's murder is a central theme, prompting much critical debate. Some argue his inaction stems from a melancholic disposition or an over-intellectualizing nature that paralyzes action. Others suggest his hesitation is a rational response to the ambiguous nature of the Ghost's command and the political complexities of the Danish court. Ultimately, his internal conflict and external pressures contribute to the tragic outcome."
2025-05-10 21:42:08,096 - INFO - Question: "The passage suggests that critical interpretations of Hamlet's indecisiveness primarily diverge on whether his inaction is rooted in:"
2025-05-10 21:42:08,096 - INFO - Choices: ['His desire for political power versus his moral obligations.', 'Personal psychological traits versus a reasoned assessment of his situation.', "The clarity of the Ghost's message versus the opinions of other characters.", 'His fear of death versus his duty to his father.']
2025-05-10 21:42:08,096 - INFO - Loading tokenizer from 'pszemraj/roberta-base-unified-mcqa-all-nomath-pairwise'...
2025-05-10 21:42:10,449 - INFO - Loading model from 'pszemraj/roberta-base-unified-mcqa-all-nomath-pairwise'...
--- Inference Results ---
Predicted Best Choice: Personal psychological traits versus a reasoned assessment of his situation.
Score (logit for class 1): 1.0856

Scores for all choices (sorted best to worst):
  - "Personal psychological traits versus a reasoned assessment of his situation.": 1.0856
  - "His fear of death versus his duty to his father.": -0.9173
  - "His desire for political power versus his moral obligations.": -1.1767
  - "The clarity of the Ghost's message versus the opinions of other characters.": -1.7393

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 16
eval_batch_size: 16
seed: 69
gradient_accumulation_steps: 4
total_train_batch_size: 64
optimizer: Use adamw_torch_fused with betas=(0.9,0.99) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: inverse_sqrt
lr_scheduler_warmup_steps: 500
num_epochs: 2.0

pszemraj
/

roberta-base-unified-mcqa-pairwise

roberta-base-unified-mcqa-pairwise

usage

Training procedure

Training hyperparameters

Model tree for pszemraj/roberta-base-unified-mcqa-pairwise

Dataset used to train pszemraj/roberta-base-unified-mcqa-pairwise

Evaluation results