Qwen3-0.6B-MNLP_M2_mcqa_model

This model is a fine-tuned version of unsloth/Qwen3-0.6B-Base on an unknown dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

Training was done on the training splits of

  • MEDMCQA
  • MMLU
  • Sciq
  • Ai2 Arc
  • Math_qa
  • ScienceQa
  • Openbookqa

Training procedure

The procedure for training was to only leave the question that have only 4 choices to chose from, and from there we do the training by only grabbing the last logit form doing a feedforward on the whole prompt (question with choices) and we do cross entropy loss on this last logit with the 4 options to choose 4 from (so we don't do cross entyropy on the whole vocabulary we only do it on the tokens of the letters of the 4 options (A, B, C and D))

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.04
  • num_epochs: 2

Evaluation Results

The model was evaluated on a suite of Multiple Choice Question Answering (MCQA) benchmarks (on its validation and test sets repsectively for each one), and NLP4education is only the approximated 1000 question and answers given to use.

Important Note on MCQA Evals Benchmark:

The performance on these benchmarks is as follows:

First evaluation: The tests where done with this prompt (type 5):

This question assesses challenging STEM problems as found on graduate standardized tests. Carefully evaluate the options and select the correct answer.

---
[Insert Question Here]
---
[Insert Choices Here, e.g.:
A. Option 1
B. Option 2
C. Option 3
D. Option 4]
---

Your response should include the letter and the exact text of the correct choice.
Example: B. Entropy increases.
Answer:

And the teseting was done on [Letter]. [Text answer]

Benchmark Accuracy (Acc) Normalized Accuracy (Acc Norm)
ARC Challenge 66.28% 64.92%
ARC Easy 84.22% 81.33%
GPQA 38.84% 36.61%
Math QA 25.03% 24.67%
MCQA Evals 43.51% 40.91%
MMLU 52.17% 52.17%
MMLU Pro 16.45% 15.04%
MuSR 53.17% 52.25%
NLP4Education 44.45% 42.65%
Overall 47.12% 45.62%

Second evaluation: (type 0)

The following are multiple choice questions (with answers) about knowledge and skills in advanced master-level STEM courses.

---
*[Insert Question Here]*
---
*[Insert Choices Here, e.g.:*
*A. Option 1*
*B. Option 2*
*C. Option 3*
*D. Option 4]*
---
Answer:

And the teseting was done on [Letter]. [Text answer]

Benchmark Accuracy (Acc) Normalized Accuracy (Acc Norm)
ARC Challenge 69.95% 65.33%
ARC Easy 84.45% 78.51%
GPQA 31.92% 28.57%
Math QA 27.02% 26.88%
MCQA Evals 43.90% 35.32%
MMLU 52.17% 52.17%
MMLU Pro 15.04% 13.27%
MuSR 53.17% 52.25%
NLP4Education 49.14% 42.85%
Overall 47.42% 43.91%

Third evaluation: (type 2)


This is part of an assessment on graduate-level science, technology, engineering, and mathematics (STEM) concepts. Each question is multiple-choice and requires a single correct answer.

---
*[Insert Question Here]*
---
*[Insert Choices Here, e.g.:*
*A. Option 1*
*B. Option 2*
*C. Option 3*
*D. Option 4]*
---
For grading purposes, respond with: [LETTER]. [VERBATIM TEXT]
Example: D. Planck constant
Your Response:

And the teseting was done on [Letter]. [Text answer]

Benchmark Accuracy (Acc) Normalized Accuracy (Acc Norm)
ARC Challenge 55.34% 55.34%
ARC Easy 74.00% 74.00%
GPQA 29.69% 29.69%
Math QA 22.35% 22.35%
MCQA Evals 37.92% 37.92%
MMLU 52.14% 52.14%
MMLU Pro 12.98% 12.98%
MuSR 53.04% 53.04%
NLP4Education 36.36% 36.36%
Overall 41.53% 41.53%

First evaluation: (type 0)

The following are multiple choice questions (with answers) about knowledge and skills in advanced master-level STEM courses.

---
*[Insert Question Here]*
---
*[Insert Choices Here, e.g.:*
*A. Option 1*
*B. Option 2*
*C. Option 3*
*D. Option 4]*
---
Answer:

And the teseting was done on [Letter]

Benchmark Accuracy (Acc) Normalized Accuracy (Acc Norm)
ARC Challenge 70.63% 70.63%
ARC Easy 85.13% 85.13%
GPQA 25.45% 25.45%
Math QA 27.35% 27.35%
MCQA Evals 45.97% 45.97%
MMLU 52.14% 52.14%
MMLU Pro 14.97% 14.97%
MuSR 53.04% 53.04%
NLP4Education 50.86% 50.86%
Overall 47.28% 47.28%

Framework versions

  • Transformers 4.51.3
  • Pytorch 2.5.1+cu121
  • Datasets 3.6.0
  • Tokenizers 0.21.0
Downloads last month
178
Safetensors
Model size
596M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andresnowak/MNLP_M2_mcqa_model

Finetuned
(98)
this model
Quantizations
1 model

Datasets used to train andresnowak/MNLP_M2_mcqa_model

Collections including andresnowak/MNLP_M2_mcqa_model