Independent analysis
Hi
I'm pleased to share our independent evaluation of the model using our implementation of the MMLU-Pro benchmark. I hope you find this useful.
The results demonstrate impressive performance for the model across multiple categories compared with other models, including many surprising ones (see 'Unity Subjects' tab for detailed breakdown).
We will release additional benchmarks and cost/performance data as time permits.
thanks for your insights. Any chance you published a research paper for the evaluation?
@theupsider I didn't think about it - do you think there's anything interesting / innovative enough here, to justify a paper?
Also - if there are other models that anyone would like us to run the analysis on, please let us know.
@theupsider I didn't think about it - do you think there's anything interesting / innovative enough here, to justify a paper?
Maybe an article on arXiv? Replication is really important in science
Also - if there are other models that anyone would like us to run the analysis on, please let us know.
Thats really nice of you. Really appreciate it. Did you think about evaluating deepseek?
Again, concerning your website: For us scientists its not ideal to cite a source which does not transparently display on how their benchmarks were run. But since you are a company I assume my request is not the top priority for you haha. Thanks anyway for sharing the data! its very insightful.
@theupsider
We can share the source code of the benchmark, and the raw results database if that helps. That makes it completely reproducible.
As for Deepseek, with pleasure, if someone is willing to cover the costs (tokens).
The smaller models - we just host and run on our infrastructure, no problem
@theupsider We can share the source code of the benchmark, and the raw results database if that helps. That makes it completely reproducible.
As for Deepseek, with pleasure, if someone is willing to cover the costs (tokens).
The smaller models - we just host and run on our infrastructure, no problem
Yes that would actually help. I can reference this in my paper.
As for the costs, I really hope someone will be willing to pay you for it. Thanks for the offer!