CodeBERT-VulnCWE - Fine-Tuned CodeBERT for Vulnerability and CWE Classification

Model Overview

This model is a fine-tuned version of microsoft/codebert-base on a curated and enriched dataset for vulnerability detection and CWE classification. It is capable of predicting whether a given code snippet is vulnerable and, if vulnerable, identifying the specific CWE ID associated with it.

Dataset

The model was fine-tuned using the dataset mahdin70/cwe_enriched_balanced_bigvul_primevul. The dataset contains both vulnerable and non-vulnerable code samples and is enriched with CWE metadata.

CWE IDs Covered:

  1. CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
  2. CWE-20: Improper Input Validation
  3. CWE-125: Out-of-bounds Read
  4. CWE-399: Resource Management Errors
  5. CWE-200: Information Exposure
  6. CWE-787: Out-of-bounds Write
  7. CWE-264: Permissions, Privileges, and Access Controls
  8. CWE-416: Use After Free
  9. CWE-476: NULL Pointer Dereference
  10. CWE-190: Integer Overflow or Wraparound
  11. CWE-189: Numeric Errors
  12. CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization

Model Training

The model was trained for 3 epochs with the following configuration:

  • Learning Rate: 2e-5
  • Weight Decay: 0.01
  • Batch Size: 8
  • Optimizer: AdamW
  • Scheduler: Linear

Training Loss and Validation Metrics Per Epoch:

Epoch Training Loss Validation Loss Vul Accuracy Vul Precision Vul Recall Vul F1 CWE Accuracy
1 1.4663 1.4988 0.7887 0.8526 0.5498 0.6685 0.2932
2 1.2107 1.3474 0.8038 0.8493 0.6002 0.7034 0.3688
3 1.1885 1.3096 0.8034 0.8020 0.6541 0.7205 0.3963

Training Summary:

  • Total Training Steps: 2958
  • Training Loss: 1.3862
  • Training Time: 3058.7 seconds (~51 minutes)
  • Training Speed: 15.47 samples per second
  • Steps Per Second: 0.967

How to Use the Model

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("mahdin70/CodeBERT-VulnCWE", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

code_snippet = "int main() { int arr[10]; arr[11] = 5; return 0; }"
inputs = tokenizer(code_snippet, return_tensors="pt")
outputs = model(**inputs)

vul_logits = outputs["vul_logits"]
cwe_logits = outputs["cwe_logits"]

vul_pred = vul_logits.argmax(dim=1).item()
cwe_pred = kov_logits.argmax(dim=1).item()

print(f"Vulnerability: {'Vulnerable' if vul_pred == 1 else 'Non-vulnerable'}")
print(f"CWE ID: {cwe_pred if vul_pred == 1 else 'N/A'}")

Limitations and Future Improvements

  • The model achieves a CWE classification accuracy of 39.63% on the validation set, indicating significant room for improvement. Advanced architectures, better data balancing, or additional pretraining could enhance performance.
  • The model's vulnerability detection F1-score (72.05% on validation) is moderate but could be improved with further tuning or a larger dataset.
  • The model may struggle with edge cases or CWEs not well-represented in the training data.
  • Test set evaluation metrics are pending. Running the model on the test set will provide a clearer picture of its generalization.

Notes

  • Ensure the trust_remote_code=True flag is used when loading the model, as it relies on custom code for the MultiTaskCodeBERT architecture.
  • The model expects input code snippets tokenized using the CodeBERT tokenizer (microsoft/codebert-base).
  • For best results, preprocess code snippets consistently with the training dataset (e.g., max length of 512 tokens).
Downloads last month
41
Safetensors
Model size
126M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahdin70/CodeBERT-VulnCWE

Finetuned
(42)
this model

Dataset used to train mahdin70/CodeBERT-VulnCWE