mistralai/Ministral-8B-Instruct-2410 · Calibration of Ministral-8B-Instruct Logprobs

Hey everyone,

I'm currently working on Uncertainty Quantification for LLMs for my master thesis.

In the GPT-4 Technical Report they used multiple choice question answering on MMLU to determine the calibration of the logprobs. They showed that the calibration for the instruction tuned model was significantly worse that the base model, which had a very good calibration.

I replicated the test for other models to check if this applies for large language models in general. I also ran it for Ministral-8B-Instruct. Unfortunatley there is no base model available to compare the two, nonetheless Ministral-8B-Instruct seems to be very well calibrated regarding MMLU, which is surprising:

I was wondering if anyone (perhaps from mistral directly) could provide any information on why this is the case? Was there put effort specifically into ensuring calibration? Earlier models and base models from mistral are not well calibrated compared to this one.

Thanks in advance!