TildeAI/TildeOpen-30b · Benchmark evaluation

jglowa

1 day ago

•

edited 1 day ago

Please provide benchmark results, e.g. EuroEval or Eurolinuga. Perplexity doesn't tell much.

TildeSIA

Tilde org 1 day ago

We are working on it as of now.
We currently have issues with the LLM Evaluation harness, as the results we get from using it do not match those we get from a plain HuggingFace reimplementation of the same tests.

JJitsev

about 24 hours ago

Using lm-eval-harness or lighteval by HF allows for standardized evaluation that comes close to an industry standard and is battle tested. Plain HuggingFace reimplementation (whatever might be meant with that) will not allow proper comparison, as it will be prone to errors that were eradicated through many years of testing for both mentioned frameworks.

TBergmanis

about 18 hours ago

•

edited about 18 hours ago

Using lm-eval-harness or lighteval by HF allows for standardized evaluation that comes close to an industry standard and is battle tested. Plain HuggingFace reimplementation (whatever might be meant with that) will not allow proper comparison, as it will be prone to errors that were eradicated through many years of testing for both mentioned frameworks.

This is the reason why we are waiting for results from lm-eval-harness instead of publishing what we got with HF. The problem was with how the tokeniser is loaded in lm-eval-harness.