Papers
arxiv:2507.02964

Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Published on Jun 30
Authors:
,
,
,
,

Abstract

Domain-Adaptive Continuous Pretraining enhances cybersecurity capabilities in large language models using a curated corpus, achieving state-of-the-art performance with smaller datasets.

AI-generated summary

While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models lack specialized domain knowledge for effective cybersecurity analysis. In this work, we investigate Domain-Adaptive Continuous Pretraining (DAP) as a methodology for enhancing cybersecurity understanding in pretrained LLMs while preserving general language capabilities. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources. Our approach employed constrained training parameters and distributed FSDP training to balance domain specialization with knowledge preservation. Evaluation across three cybersecurity benchmarks, namely, CTI-MCQ, CyberMetric, and SecEval, demonstrates consistent improvements post-adaptation. The Llama-3.3-70B-Ins-DAP model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864, respectively, outperforming specialized models, including Llama-Primus-Base. Notably, competitive performance was achieved using substantially smaller datasets (118.8 million versus 2.77 billion tokens), demonstrating efficient domain specialization viability. We establish that targeted continuous pretraining enables effective cybersecurity domain adaptation with computational feasibility, providing foundations for specialized AI assistants in threat analysis, vulnerability assessment, and security documentation while challenging prevailing assumptions about data requirements for LLM specialization.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.02964 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.02964 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.02964 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.