GraphCodeBERT-VulnCWE - Fine-Tuned GraphCodeBERT for Vulnerability and CWE Classification

Model Overview

This model is a fine-tuned version of microsoft/graphcodebert-base on a curated and enriched dataset for vulnerability detection and CWE classification. It is capable of predicting whether a given code snippet is vulnerable and, if vulnerable, identifying the specific CWE ID associated with it.

Dataset

The model was fine-tuned using the dataset mahdin70/cwe_enriched_balanced_bigvul_primevul. The dataset contains both vulnerable and non-vulnerable code samples and is enriched with CWE metadata.

CWE IDs Covered:

CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
CWE-20: Improper Input Validation
CWE-125: Out-of-bounds Read
CWE-399: Resource Management Errors
CWE-200: Information Exposure
CWE-787: Out-of-bounds Write
CWE-264: Permissions, Privileges, and Access Controls
CWE-416: Use After Free
CWE-476: NULL Pointer Dereference
CWE-190: Integer Overflow or Wraparound
CWE-189: Numeric Errors
CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization

Model Training

The model was trained for 3 epochs with the following configuration:

Learning Rate: 2e-5
Weight Decay: 0.01
Batch Size: 8
Optimizer: AdamW
Scheduler: Linear

Training Loss and Validation Metrics Per Epoch:

Epoch	Training Loss	Validation Loss	Vul Accuracy	Vul Precision	Vul Recall	Vul F1	CWE Accuracy
1	1.2824	1.4160	0.7914	0.8990	0.5200	0.6589	0.3551
2	1.1292	1.2632	0.8007	0.8037	0.6426	0.7142	0.4433
3	0.8598	1.2436	0.7945	0.7669	0.6747	0.7179	0.4605

Training Summary:

Total Training Steps: 5916
Training Loss: 1.2380
Training Time: 4785.0 seconds (~80 minutes)
Training Speed: 9.89 samples per second
Steps Per Second: 1.236

How to Use the Model

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("mahdin70/GraphCodeBERT-VulnCWE", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")

code_snippet = "int main() { int arr[10]; arr[11] = 5; return 0; }"
inputs = tokenizer(code_snippet, return_tensors="pt")
outputs = model(**inputs)

vul_logits = outputs["vul_logits"]
cwe_logits = outputs["cwe_logits"]

vul_pred = vul_logits.argmax(dim=1).item()
cwe_pred = cwe_logits.argmax(dim=1).item()

print(f"Vulnerability: {'Vulnerable' if vul_pred == 1 else 'Non-vulnerable'}")
print(f"CWE ID: {cwe_pred if vul_pred == 1 else 'N/A'}")

mahdin70
/

GraphCodeBERT-VulnCWE