mahdin70
/

UnixCoder-Primevul-BigVul

@@ -36,7 +36,7 @@ The architecture is implemented as a custom `MultiTaskUnixCoder` class in PyTorc
 ## Training Dataset
-The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset (configuration: `10_per_commit`), which combines:
 - **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
 - **PrimeVul**: A dataset focused on prime vulnerabilities in code.
@@ -45,10 +45,12 @@ The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset
   - Train: 124,780 samples
   - Validation: 26,740 samples
   - Test: 26,738 samples
 - **Features**:
   - `func`: Code snippet (text)
   - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
   - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
 - **Preprocessing**:
   - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
   - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).
@@ -105,106 +107,6 @@ Install the required libraries:
 ```bash
 pip install transformers torch datasets huggingface_hub
-```
-Apologies for the oversight! Below is the corrected README.md with the entire content, including the "Sample Code Snippet" section through to the end, formatted properly in Markdown.
-markdown
-Collapse
-Wrap
-Copy
-# UnixCoder-Primevul-BigVul Model Card
-## Model Overview
-`UnixCoder-Primevul-BigVul` is a multi-task model based on Microsoft's `unixcoder-base`, fine-tuned to detect vulnerabilities (`vul`) and classify Common Weakness Enumeration (CWE) types in code snippets. It was developed by [mahdin70](https://huggingface.co/mahdin70) and trained on a balanced dataset combining BigVul and PrimeVul datasets. The model performs binary classification for vulnerability detection and multi-class classification for CWE identification.
-- **Model Repository**: [mahdin70/UnixCoder-Primevul-BigVul](https://huggingface.co/mahdin70/UnixCoder-Primevul-BigVul)
-- **Base Model**: [microsoft/unixcoder-base](https://huggingface.co/microsoft/unixcoder-base)
-- **Tasks**: Vulnerability Detection (Binary), CWE Classification (Multi-class)
-- **License**: MIT (assumed; adjust if different)
-- **Date**: Trained and uploaded as of March 11, 2025
-## Model Architecture
-The model extends `unixcoder-base` with two task-specific heads:
-- **Vulnerability Head**: A linear layer mapping 768-dimensional hidden states to 2 classes (vulnerable or not).
-- **CWE Head**: A linear layer mapping 768-dimensional hidden states to 135 classes (134 CWE types + 1 for "no CWE").
-The architecture is implemented as a custom `MultiTaskUnixCoder` class in PyTorch, with the loss computed as the sum of cross-entropy losses for both tasks.
-## Training Dataset
-The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset (configuration: `10_per_commit`), which combines:
-- **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
-- **PrimeVul**: A dataset focused on prime vulnerabilities in code.
-### Dataset Details
-- **Splits**:
-  - Train: 124,780 samples
-  - Validation: 26,740 samples
-  - Test: 26,738 samples
-- **Features**:
-  - `func`: Code snippet (text)
-  - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
-  - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
-- **Preprocessing**:
-  - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
-  - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).
-The dataset is balanced to ensure a fair representation of vulnerable and non-vulnerable samples, with a maximum of 10 samples per commit where applicable.
-## Training Details
-### Training Arguments
-The model was trained using the Hugging Face `Trainer` API with the following arguments:
-- **Output Directory**: `./unixcoder_multitask`
-- **Evaluation Strategy**: Per epoch
-- **Save Strategy**: Per epoch
-- **Learning Rate**: 2e-5
-- **Batch Size**: 8 (per device, train and eval)
-- **Epochs**: 3
-- **Weight Decay**: 0.01
-- **Logging**: Every 10 steps, logged to `./logs`
-- **WANDB**: Disabled
-### Training Environment
-- **Hardware**: NVIDIA Tesla T4 GPU
-- **Framework**: PyTorch 2.5.1+cu121, Transformers 4.47.0
-- **Duration**: ~6 hours, 34 minutes, 53 seconds (23,397 steps)
-### Training Metrics
-Validation metrics across epochs:
-| Epoch | Training Loss | Validation Loss | Vul Accuracy | Vul Precision | Vul Recall | Vul F1   | CWE Accuracy |
-|-------|---------------|-----------------|--------------|---------------|------------|----------|--------------|
-| 1     | 0.3038        | 0.4997          | 0.9570       | 0.8082        | 0.5379     | 0.6459   | 0.1887       |
-| 2     | 0.6092        | 0.4859          | 0.9587       | 0.8118        | 0.5641     | 0.6657   | 0.2964       |
-| 3     | 0.4261        | 0.5090          | 0.9585       | 0.8114        | 0.5605     | 0.6630   | 0.3323       |
-- **Final Training Loss**: 0.4430 (average over all steps)
-## Evaluation
-The model was evaluated on the test split (26,738 samples) with the following metrics:
-- **Vulnerability Detection**:
-  - Accuracy: 0.9571
-  - Precision: 0.7947
-  - Recall: 0.5437
-  - F1 Score: 0.6457
-- **CWE Classification** (on vulnerable samples):
-  - Accuracy: 0.3288
-The model excels at identifying non-vulnerable code (high accuracy) but has moderate recall for vulnerabilities and lower CWE classification accuracy, indicating room for improvement in CWE prediction.
-## Usage
-### Installation
-Install the required libraries:
-```bash
-pip install transformers torch datasets huggingface_hub
 ```
 ### Sample Code Snippet

 ## Training Dataset
+The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset, which combines:
 - **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
 - **PrimeVul**: A dataset focused on prime vulnerabilities in code.
   - Train: 124,780 samples
   - Validation: 26,740 samples
   - Test: 26,738 samples
 - **Features**:
   - `func`: Code snippet (text)
   - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
   - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
 - **Preprocessing**:
   - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
   - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).
 ```bash
 pip install transformers torch datasets huggingface_hub
 ```
 ### Sample Code Snippet