arxiv:2506.03231

NetPress: Dynamically Generated LLM Benchmarks for Network Applications

Published on Jun 3

· Submitted by

lesleychou on Jun 10

Upvote

Authors:

Yajie Zhou ,

Abstract

NetPress generates dynamic benchmarks for evaluating large language model agents in network operations, providing realistic tests across correctness, safety, and latency.

AI-generated summary

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

View arXiv page View PDF GitHub repository Add to collection

Community

lesleychou

Paper author Paper submitter 1 day ago

NetPress is the first benchmark specifically designed for evaluating LLMs in network and system applications. It dynamically generates benchmark datasets with over 10,000 unique queries per application, covering a wide range of complexities. NetPress also offers automated and comprehensive evaluation metrics for LLM outputs, including correctness, safety, and latency, all assessed using real emulators.

lesleychou

Paper author Paper submitter 1 day ago

We welcome your testing and feedback on trying the NetPress benchmark! Please feel free to leave comments here (or Github issues) if you have any questions!

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.03231 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.03231 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.03231 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.