mahdin70 commited on
Commit
ba0824f
·
verified ·
1 Parent(s): ae20617

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -101
README.md CHANGED
@@ -36,7 +36,7 @@ The architecture is implemented as a custom `MultiTaskUnixCoder` class in PyTorc
36
 
37
  ## Training Dataset
38
 
39
- The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset (configuration: `10_per_commit`), which combines:
40
  - **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
41
  - **PrimeVul**: A dataset focused on prime vulnerabilities in code.
42
 
@@ -45,10 +45,12 @@ The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset
45
  - Train: 124,780 samples
46
  - Validation: 26,740 samples
47
  - Test: 26,738 samples
 
48
  - **Features**:
49
  - `func`: Code snippet (text)
50
  - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
51
  - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
 
52
  - **Preprocessing**:
53
  - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
54
  - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).
@@ -105,106 +107,6 @@ Install the required libraries:
105
  ```bash
106
  pip install transformers torch datasets huggingface_hub
107
 
108
- ```
109
- Apologies for the oversight! Below is the corrected README.md with the entire content, including the "Sample Code Snippet" section through to the end, formatted properly in Markdown.
110
-
111
- markdown
112
-
113
- Collapse
114
-
115
- Wrap
116
-
117
- Copy
118
- # UnixCoder-Primevul-BigVul Model Card
119
-
120
- ## Model Overview
121
-
122
- `UnixCoder-Primevul-BigVul` is a multi-task model based on Microsoft's `unixcoder-base`, fine-tuned to detect vulnerabilities (`vul`) and classify Common Weakness Enumeration (CWE) types in code snippets. It was developed by [mahdin70](https://huggingface.co/mahdin70) and trained on a balanced dataset combining BigVul and PrimeVul datasets. The model performs binary classification for vulnerability detection and multi-class classification for CWE identification.
123
-
124
- - **Model Repository**: [mahdin70/UnixCoder-Primevul-BigVul](https://huggingface.co/mahdin70/UnixCoder-Primevul-BigVul)
125
- - **Base Model**: [microsoft/unixcoder-base](https://huggingface.co/microsoft/unixcoder-base)
126
- - **Tasks**: Vulnerability Detection (Binary), CWE Classification (Multi-class)
127
- - **License**: MIT (assumed; adjust if different)
128
- - **Date**: Trained and uploaded as of March 11, 2025
129
-
130
- ## Model Architecture
131
-
132
- The model extends `unixcoder-base` with two task-specific heads:
133
- - **Vulnerability Head**: A linear layer mapping 768-dimensional hidden states to 2 classes (vulnerable or not).
134
- - **CWE Head**: A linear layer mapping 768-dimensional hidden states to 135 classes (134 CWE types + 1 for "no CWE").
135
-
136
- The architecture is implemented as a custom `MultiTaskUnixCoder` class in PyTorch, with the loss computed as the sum of cross-entropy losses for both tasks.
137
-
138
- ## Training Dataset
139
-
140
- The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset (configuration: `10_per_commit`), which combines:
141
- - **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
142
- - **PrimeVul**: A dataset focused on prime vulnerabilities in code.
143
-
144
- ### Dataset Details
145
- - **Splits**:
146
- - Train: 124,780 samples
147
- - Validation: 26,740 samples
148
- - Test: 26,738 samples
149
- - **Features**:
150
- - `func`: Code snippet (text)
151
- - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
152
- - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
153
- - **Preprocessing**:
154
- - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
155
- - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).
156
-
157
- The dataset is balanced to ensure a fair representation of vulnerable and non-vulnerable samples, with a maximum of 10 samples per commit where applicable.
158
-
159
- ## Training Details
160
-
161
- ### Training Arguments
162
- The model was trained using the Hugging Face `Trainer` API with the following arguments:
163
- - **Output Directory**: `./unixcoder_multitask`
164
- - **Evaluation Strategy**: Per epoch
165
- - **Save Strategy**: Per epoch
166
- - **Learning Rate**: 2e-5
167
- - **Batch Size**: 8 (per device, train and eval)
168
- - **Epochs**: 3
169
- - **Weight Decay**: 0.01
170
- - **Logging**: Every 10 steps, logged to `./logs`
171
- - **WANDB**: Disabled
172
-
173
- ### Training Environment
174
- - **Hardware**: NVIDIA Tesla T4 GPU
175
- - **Framework**: PyTorch 2.5.1+cu121, Transformers 4.47.0
176
- - **Duration**: ~6 hours, 34 minutes, 53 seconds (23,397 steps)
177
-
178
- ### Training Metrics
179
- Validation metrics across epochs:
180
-
181
- | Epoch | Training Loss | Validation Loss | Vul Accuracy | Vul Precision | Vul Recall | Vul F1 | CWE Accuracy |
182
- |-------|---------------|-----------------|--------------|---------------|------------|----------|--------------|
183
- | 1 | 0.3038 | 0.4997 | 0.9570 | 0.8082 | 0.5379 | 0.6459 | 0.1887 |
184
- | 2 | 0.6092 | 0.4859 | 0.9587 | 0.8118 | 0.5641 | 0.6657 | 0.2964 |
185
- | 3 | 0.4261 | 0.5090 | 0.9585 | 0.8114 | 0.5605 | 0.6630 | 0.3323 |
186
-
187
- - **Final Training Loss**: 0.4430 (average over all steps)
188
-
189
- ## Evaluation
190
-
191
- The model was evaluated on the test split (26,738 samples) with the following metrics:
192
- - **Vulnerability Detection**:
193
- - Accuracy: 0.9571
194
- - Precision: 0.7947
195
- - Recall: 0.5437
196
- - F1 Score: 0.6457
197
- - **CWE Classification** (on vulnerable samples):
198
- - Accuracy: 0.3288
199
-
200
- The model excels at identifying non-vulnerable code (high accuracy) but has moderate recall for vulnerabilities and lower CWE classification accuracy, indicating room for improvement in CWE prediction.
201
-
202
- ## Usage
203
-
204
- ### Installation
205
- Install the required libraries:
206
- ```bash
207
- pip install transformers torch datasets huggingface_hub
208
  ```
209
 
210
  ### Sample Code Snippet
 
36
 
37
  ## Training Dataset
38
 
39
+ The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset, which combines:
40
  - **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
41
  - **PrimeVul**: A dataset focused on prime vulnerabilities in code.
42
 
 
45
  - Train: 124,780 samples
46
  - Validation: 26,740 samples
47
  - Test: 26,738 samples
48
+
49
  - **Features**:
50
  - `func`: Code snippet (text)
51
  - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
52
  - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
53
+
54
  - **Preprocessing**:
55
  - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
56
  - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).
 
107
  ```bash
108
  pip install transformers torch datasets huggingface_hub
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ```
111
 
112
  ### Sample Code Snippet