Safetensors
llama
text-generation-inference

Model consistently outputting the first definition label

#1
by Oddtwang - opened

I've been trying the fine-tuned LLM-wsd-FT-ALL model out for disambiguation of literal and idiomatic instances of expressions like "red flag", but the model seems to be consistently outputting the label for the first definition included in the instructions, regardless of the order they're presented in or what sense is used in the input sentence.
Trying the same with the example provided on the model card produces the same effect - if I swap the order of the definitions and/or amend the input sentence to use a different sense of "hurry" (without fixing the seed), the output always seems to be "1" (or the label given to the first definition).

Has anyone observed similar behaviour? I'm not seeing the same thing with the base Llama Instruct model when using the same instructions.

When you run this model's evaluation on data/llmwsd-test/FT/test-en.data.xml.ml.jsonl it gets an impressive 80% right. Shuffle the answers in the validation dataset randomly, and it only gets ~52% right. So I'd say the chosen validation dataset is not really representative of real performance of this model (or apparently any other model which does not randomize it; anecdotally, I managed to get ~50% right by training mamba-130m on a very, very small subset of the training data while keeping the validation unshuffled; and it also dropped to 32% with the shuffle). These models (also) learn structural artefacts of the training/validation data rather than what they are supposed to learn.

SWAP Research Group@UNIBA org

Did you carefully analyze the dataset to confirm your hypothesis? We shuffle answers during the creation of training and testing to avoid 'first sense' bias. If you analyse training and test data, you can discover that the distribution of answers is as follows (we report only the first five senses):

  • training data TT: 52%, 20%, 10%, 5%, 3%
  • training data FT: 58%, 20%, 8%, 4%, 2%
  • testing data TT: 42%, 20%, 12%, 7%, 5%
  • testing data FT: 45%, 19%, 12%, 7%, 4%
    If the model always gives the first answer, the performance should be around 42-45%.
    Maybe the problem is the prompt. The system may be sensitive to the prompt format. Anyway, training, test and output of all models are available here: https://zenodo.org/records/15007563
  1. Why is there, as you describe, such a big skew in the dataset toward the first answer being correct if the data was shuffled? Maybe the shuffle was equally wrong in both the training and the validation set (e.g. same random seed or some such)?
  2. The https://github.com/swapUniba/LLM-wsd/blob/main/eval.py script contains no shuffling logic (so I assume the shuffle you refer to was either one-time or you used a different evaluation method in paper than in GitHub). Indeed this is where I added it to get prediction quality down from 80% to 52% in your model, on the FT/test-en.data.xml.ml.jsonl.
  3. The system prompt produced by the chat template is variable (it contains current date), but does not appear to be an issue because I used the same system prompt for both the 80% and 52% result.

Sign up or log in to comment