trendmicro-ailab
/

Llama-Primus-Merged

+---
+license: mit
+language:
+- en
+- ja
+base_model:
+- meta-llama/Llama-3.1-8B-Instruct
+pipeline_tag: text-generation
+tags:
+- cybersecurity
+---
+# Primus: A Pioneering Pretraining Dataset for Large Language Models in Cybersecurity
+## Introduction
+Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, with promising applications in specialized domains such as finance, law, and biomedicine. However, in the domain of cybersecurity, we noticed a lack of open-source datasets specifically designed for LLM pre-training—even though much research has shown that LLMs acquire their knowledge during pre-training.  To fill this gap, we present a collection of datasets covering multiple stages of LLM training, including pre-training (_Primus-Seed_ and _Primus-FineWeb_), instruction fine-tuning (_Primus-Instruct_), and reasoning data for distillation (_Primus-Reasoning_).  Based on these datasets and Llama-3.1-8B-Instruct, we trained _Llama-Primus-Base_, _Llama-Primus-Merged_, and _Llama-Primus-Reasoning_. This model card is **Llama-Primus-Merged**.
+  >  **Note:** No TrendMicro customer information is included.
+## Benchmark Results
+- [Cybersecurity](#cybersecurity)
+- [Function Calling](#function-calling)
+- [Safety & Toxicity](#safety--toxicity)
+- [Multilingual](#multilingual)
+- [General Chat Performance](#general-chat-performance)
+- [Long-Context](#long-context)
+#### Cybersecurity
+| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
+|---------------------------------|---------------------------|------------------------------|
+| **CtiBench (MCQ)** | 0.6420 | 0.6656 |
+| **CtiBench (CVE → CWE)** | 0.5910 | 0.6620 |
+| **CtiBench (CVSS, _lower is better_)** | 1.2712 | 1.1233 |
+| **CtiBench (ATE)** | 0.2721 | 0.3387 |
+| **CyberMetric (500)** | 0.8560 | 0.8660 |
+| **SecEval** | 0.4966 | 0.5062 |
+| **Cissp (Exams in book)** | 0.7073 | 0.7191 |
+References:
+-  **CyberMetric**: [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented...](https://arxiv.org/abs/2402.07688)
+-  **CtiBench**: [CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence](https://arxiv.org/abs/2406.07599)
+-  **SecEval**: [SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models](https://xuanwuai.github.io/SecEval/)
+#### Function Calling
+| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged**|
+|---------------|---------------------------|------------------------------|
+| **BFCL (V2)** | 73.02 (prompt) | 74.77 (prompt) | - |
+Reference:
+- [BFCL (V2)](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html)
+### Safety & Toxicity
+| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
+|----------------------------------|---------------------------|------------------------------|
+| **dan (Jailbreak)** | 28.98% | 41.70% |
+| **encoding (Jailbreak)** | 100.00% | 100.00% |
+| **goodside (Hallucination/Injection)** | 77.08% | 72.10% |
+| **latentinjection (Injection)** | 75.55% | 74.00% |
+| **leakreplay (Copyright)** | 95.71% | 96.90% |
+| **malwaregen (Disallowed)** | 14.34% | 29.00% |
+| **realtoxicityprompts (Disallowed)** | 90.03% | 85.40% |
+| **snowball (Hallucination)** | 59.67% | 84.20% |
+| **xss (Injection)** | 100.00% | 98.30% |
+| **XSTest (Over Refuse)** | 93.20% | 83.20% |
+References:
+-  **Garak**: [Garak Repository](https://github.com/leondz/garak)
+-  **XSTest**: [XSTest Repository](https://github.com/paul-rottger/exaggerated-safety)
+### Multilingual
+| **Language** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
+|---------------|---------------------------|------------------------------|
+| **MMLU (English)** | 68.16% | 67.36% |
+| **MMLU (Japanese)** | 49.22% | 47.85% |
+| **MMLU (French)** | 58.91% | 58.14% |
+| **MMLU (German)** | 57.70% | 56.68% |
+References:
+-  **English**: [MMLU Dataset](https://arxiv.org/abs/2009.03300)
+-  **German/French**: [MLMM Evaluation](https://github.com/nlp-uoregon/mlmm-evaluation?tab=readme-ov-file)
+-  **Japanese**: [Freedom Intelligence MMLU Japanese](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Japanese)
+#### General Chat Performance
+| **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
+|-----------------|---------------------------|------------------------------|
+| **MT Bench** | 8.3491 | 8.29375 |
+Reference:
+- [MT Bench](https://arxiv.org/abs/2306.05685)
+### Long-Context
+| **Length** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
+|------------|---------------------------|------------------------------|
+| **8K+** | 51.08 | 50.66 |
+| **16K+** | 29.18 | 27.13 |
+Reference:
+- [Long-Context Benchmarks](https://arxiv.org/abs/2308.14508)
+## License
+This model is based on the MIT license, but you must also comply with the Llama 3.1 Community License Agreement.