Safetensors
llama

Benchmark evaluation

#3
by jglowa - opened

Please provide benchmark results, e.g. EuroEval or Eurolinuga. Perplexity doesn't tell much.

Tilde org

We are working on it as of now.
We currently have issues with the LLM Evaluation harness, as the results we get from using it do not match those we get from a plain HuggingFace reimplementation of the same tests.

Using lm-eval-harness or lighteval by HF allows for standardized evaluation that comes close to an industry standard and is battle tested. Plain HuggingFace reimplementation (whatever might be meant with that) will not allow proper comparison, as it will be prone to errors that were eradicated through many years of testing for both mentioned frameworks.

Using lm-eval-harness or lighteval by HF allows for standardized evaluation that comes close to an industry standard and is battle tested. Plain HuggingFace reimplementation (whatever might be meant with that) will not allow proper comparison, as it will be prone to errors that were eradicated through many years of testing for both mentioned frameworks.

This is the reason why we are waiting for results from lm-eval-harness instead of publishing what we got with HF. The problem was with how the tokeniser is loaded in lm-eval-harness.

Sign up or log in to comment