NetPress: Dynamically Generated LLM Benchmarks for Network Applications
Abstract
NetPress generates dynamic benchmarks for evaluating large language model agents in network operations, providing realistic tests across correctness, safety, and latency.
Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.
Community
NetPress is the first benchmark specifically designed for evaluating LLMs in network and system applications. It dynamically generates benchmark datasets with over 10,000 unique queries per application, covering a wide range of complexities. NetPress also offers automated and comprehensive evaluation metrics for LLM outputs, including correctness, safety, and latency, all assessed using real emulators.
We welcome your testing and feedback on trying the NetPress benchmark! Please feel free to leave comments here (or Github issues) if you have any questions!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation (2025)
- SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning (2025)
- IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property (2025)
- CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios (2025)
- SWE-bench Goes Live! (2025)
- SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use (2025)
- CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper