Non-reproducible Retail Domain Results

#3
by dipta007 - opened

I tried out the 32B model with cleaned tau-bench repo, claude3.5 as user model and got the following results for pass@1

46.09, 39.13, 38.26, 40.87, 38.26 - avg 40.52

Any idea why?

Sign up or log in to comment