Commit History

Upload from GitHub Actions: Exclude TruthfulQA from proficiency score
3fbff09
verified

davidpomerenke commited on

Upload from GitHub Actions: TruthfulQA translation WIP
fd102e9
verified

davidpomerenke commited on

Upload from GitHub Actions: Scatterplot
353f761
verified

davidpomerenke commited on

Upload from GitHub Actions: Get more results, compute average based on all tasks
98c6811
verified

davidpomerenke commited on

Upload from GitHub Actions: Translate MMLU and evaluate
4c5c136
verified

davidpomerenke commited on

Upload from GitHub Actions: Correlation plot
b0aa389
verified

davidpomerenke commited on

Upload from GitHub Actions: Evaluate on autotranslated GSM dataset
f3a09a2
verified

davidpomerenke commited on

Upload from GitHub Actions: Evaluate Google Translate
338dc9b
verified

davidpomerenke commited on

Upload from GitHub Actions: More models and languages
a73f888
verified

davidpomerenke commited on

Upload from GitHub Actions: Improve UX and style
53d2039
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge remote changes and apply terminology updates: Commercial->closed-source, Open->open-source
ebaf279
verified

davidpomerenke commited on

Upload from GitHub Actions: Use task subset for average score
b1e5b40
verified

davidpomerenke commited on

Upload from GitHub Actions: Eavaluate on 40 languages
941d5c5
verified

davidpomerenke commited on

Upload from GitHub Actions: Add math benchmarks
549360a
verified

davidpomerenke commited on

Upload from GitHub Actions: More results
52abc5b
verified

davidpomerenke commited on

Upload from GitHub Actions: Update model ranking fetching
f840423
verified

davidpomerenke commited on

Upload from GitHub Actions: Use FLORES+ via Huggingface
913253a
verified

davidpomerenke commited on

Upload from GitHub Actions: Quick fixes
9c2c019
verified

davidpomerenke commited on

Upload from GitHub Actions: More models
0bd935e
verified

davidpomerenke commited on

Upload from GitHub Actions: Increase n_models
d09b095
verified

davidpomerenke commited on

Upload from GitHub Actions: New results
b311dd5
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #4 from datenlabor-bmz/jonas-dev
7c6a118
verified

davidpomerenke commited on

Upload from GitHub Actions: Fix vibecoding
75010c2
verified

davidpomerenke commited on

Upload from GitHub Actions: Ugly fix for CI errors
adc94d7
verified

davidpomerenke commited on

Upload from GitHub Actions: Try moving `cache` calls that cause CI issues
bc4afa0
verified

davidpomerenke commited on

Upload from GitHub Actions: Exclude free models from evals
c9e9db6
verified

davidpomerenke commited on

Upload from GitHub Actions: Display N/A scores as such
1e8952a
verified

davidpomerenke commited on

Block gemini-2.5-pro-exp-03-25
092c06a

David Pomerenke commited on

Pass through kwargs
5fa433f

David Pomerenke commited on

Fix dataset loading
c990cb9

David Pomerenke commited on

Temporarily disable classification task
a48ff53

David Pomerenke commited on

Fix path and dev group declaration
1614427

David Pomerenke commited on

Fix import paths
c567aee

David Pomerenke commited on

added download function and edited INFO
f529b7b

jonas commited on

Use most popular current + historical models
9983b5f

David Pomerenke commited on

Only run tasks for which there is no result yet
2f9dee1

David Pomerenke commited on

Run on 40 languages, additional models
260c1a3

David Pomerenke commited on

Shorter classification prompt + error handling
0384b92

David Pomerenke commited on

Move functions for sharing them
55406ba

David Pomerenke commited on

Fix response when no evals data is available
32d50b0

David Pomerenke commited on

Fix: don't cache model metadata forever
c29b8da

David Pomerenke commited on

Run on 15 languages
f8a3dad

David Pomerenke commited on

Update models
8941a67

David Pomerenke commited on

Implement MMLU task
a683732

David Pomerenke commited on

MMLU data loader for 3 parallel datasets
47170a5

David Pomerenke commited on

Analyze MMLU datasets
031925d

David Pomerenke commited on

Add Global MMLU benchmark
ce2acb0

David Pomerenke commited on

Translation both from and to
731eddd

David Pomerenke commited on

Get popular models from OpenRouter
a32a92f

David Pomerenke commited on

Add OpenRouter metadata to models
9002fc2

David Pomerenke commited on