@s-emanuilov on Hugging Face: "A new benchmark (DPAB-α) has been released that evaluates LLM function calling…"

Post

719

A new benchmark (DPAB-α) has been released that evaluates LLM function calling in both Pythonic and JSON approaches.

It shows that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.

Key findings from benchmarks:
— Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON
— Smaller models show impressive results (Dria-Agent-α-3B: 72% Pythonic)
— Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)

If you're building or using LLM agents, these results suggest that how you implement function calling could impact performance - might be worth reconsidering JSON-only approaches.

The benchmark: https://github.com/firstbatchxyz/function-calling-eval
Blog post: https://huggingface.co/blog/andthattoo/dpab-a

Join the conversation