Text Generation
Transformers
Safetensors
English
Japanese
llama
conversational
text-generation-inference
youyaoching commited on
Commit
0647dd1
·
verified ·
1 Parent(s): 799cd02

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ja
6
+ base_model:
7
+ - meta-llama/Llama-3.1-8B-Instruct
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - cybersecurity
11
+ ---
12
+ # Primus: A Pioneering Pretraining Dataset for Large Language Models in Cybersecurity
13
+
14
+ ## Introduction
15
+
16
+ Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, with promising applications in specialized domains such as finance, law, and biomedicine. However, in the domain of cybersecurity, we noticed a lack of open-source datasets specifically designed for LLM pre-training—even though much research has shown that LLMs acquire their knowledge during pre-training. To fill this gap, we present a collection of datasets covering multiple stages of LLM training, including pre-training (_Primus-Seed_ and _Primus-FineWeb_), instruction fine-tuning (_Primus-Instruct_), and reasoning data for distillation (_Primus-Reasoning_). Based on these datasets and Llama-3.1-8B-Instruct, we trained _Llama-Primus-Base_, _Llama-Primus-Merged_, and _Llama-Primus-Reasoning_. This model card is **Llama-Primus-Merged**.
17
+
18
+ > **Note:** No TrendMicro customer information is included.
19
+
20
+
21
+ ## Benchmark Results
22
+
23
+ - [Cybersecurity](#cybersecurity)
24
+ - [Function Calling](#function-calling)
25
+ - [Safety & Toxicity](#safety--toxicity)
26
+ - [Multilingual](#multilingual)
27
+ - [General Chat Performance](#general-chat-performance)
28
+ - [Long-Context](#long-context)
29
+
30
+
31
+
32
+ #### Cybersecurity
33
+
34
+
35
+
36
+ | **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
37
+ |---------------------------------|---------------------------|------------------------------|
38
+ | **CtiBench (MCQ)** | 0.6420 | 0.6656 |
39
+ | **CtiBench (CVE → CWE)** | 0.5910 | 0.6620 |
40
+ | **CtiBench (CVSS, _lower is better_)** | 1.2712 | 1.1233 |
41
+ | **CtiBench (ATE)** | 0.2721 | 0.3387 |
42
+ | **CyberMetric (500)** | 0.8560 | 0.8660 |
43
+ | **SecEval** | 0.4966 | 0.5062 |
44
+ | **Cissp (Exams in book)** | 0.7073 | 0.7191 |
45
+
46
+ References:
47
+ - **CyberMetric**: [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented...](https://arxiv.org/abs/2402.07688)
48
+ - **CtiBench**: [CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence](https://arxiv.org/abs/2406.07599)
49
+ - **SecEval**: [SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models](https://xuanwuai.github.io/SecEval/)
50
+
51
+
52
+
53
+ #### Function Calling
54
+
55
+
56
+ | **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged**|
57
+ |---------------|---------------------------|------------------------------|
58
+ | **BFCL (V2)** | 73.02 (prompt) | 74.77 (prompt) | - |
59
+
60
+
61
+
62
+ Reference:
63
+
64
+ - [BFCL (V2)](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html)
65
+
66
+
67
+
68
+ ### Safety & Toxicity
69
+
70
+
71
+
72
+ | **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
73
+ |----------------------------------|---------------------------|------------------------------|
74
+ | **dan (Jailbreak)** | 28.98% | 41.70% |
75
+ | **encoding (Jailbreak)** | 100.00% | 100.00% |
76
+ | **goodside (Hallucination/Injection)** | 77.08% | 72.10% |
77
+ | **latentinjection (Injection)** | 75.55% | 74.00% |
78
+ | **leakreplay (Copyright)** | 95.71% | 96.90% |
79
+ | **malwaregen (Disallowed)** | 14.34% | 29.00% |
80
+ | **realtoxicityprompts (Disallowed)** | 90.03% | 85.40% |
81
+ | **snowball (Hallucination)** | 59.67% | 84.20% |
82
+ | **xss (Injection)** | 100.00% | 98.30% |
83
+ | **XSTest (Over Refuse)** | 93.20% | 83.20% |
84
+
85
+ References:
86
+
87
+ - **Garak**: [Garak Repository](https://github.com/leondz/garak)
88
+ - **XSTest**: [XSTest Repository](https://github.com/paul-rottger/exaggerated-safety)
89
+
90
+
91
+
92
+
93
+ ### Multilingual
94
+
95
+
96
+
97
+ | **Language** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
98
+ |---------------|---------------------------|------------------------------|
99
+ | **MMLU (English)** | 68.16% | 67.36% |
100
+ | **MMLU (Japanese)** | 49.22% | 47.85% |
101
+ | **MMLU (French)** | 58.91% | 58.14% |
102
+ | **MMLU (German)** | 57.70% | 56.68% |
103
+
104
+
105
+ References:
106
+ - **English**: [MMLU Dataset](https://arxiv.org/abs/2009.03300)
107
+ - **German/French**: [MLMM Evaluation](https://github.com/nlp-uoregon/mlmm-evaluation?tab=readme-ov-file)
108
+ - **Japanese**: [Freedom Intelligence MMLU Japanese](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Japanese)
109
+
110
+
111
+
112
+
113
+ #### General Chat Performance
114
+
115
+ | **Metric** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
116
+ |-----------------|---------------------------|------------------------------|
117
+ | **MT Bench** | 8.3491 | 8.29375 |
118
+
119
+ Reference:
120
+ - [MT Bench](https://arxiv.org/abs/2306.05685)
121
+
122
+
123
+
124
+ ### Long-Context
125
+
126
+
127
+ | **Length** | **Llama-3.1-8B-Instruct** | **Llama-Primus-Merged** |
128
+ |------------|---------------------------|------------------------------|
129
+ | **8K+** | 51.08 | 50.66 |
130
+ | **16K+** | 29.18 | 27.13 |
131
+
132
+ Reference:
133
+ - [Long-Context Benchmarks](https://arxiv.org/abs/2308.14508)
134
+
135
+ ## License
136
+ This model is based on the MIT license, but you must also comply with the Llama 3.1 Community License Agreement.