Active Virulence Prediction using Evolutionary Scale Modeling

This model is a fine-tuned facebook/esm2_t30_150M_UR50D protein language model for the binary classification of protein sequences. The model is designed to distinguish between virulence factors and non-virulence factors.

Intended Use

The primary purpose of this model is to predict whether a given protein sequence functions as a virulence factor. This can be applied in bioinformatics research, drug discovery, and pathogen analysis to identify and characterize potential therapeutic targets. The model is not intended for clinical diagnostic purposes.

Data Sources

The model was trained on a curated dataset of protein sequences from multiple sources, categorized as positive (virulence factors) and negative (non-virulence factors).

Positive Data:
- VFDB (Virulence Factor Database): The Core dataset and Full dataset from mgc.ac.cn were utilized as sources for positive samples.
- VPAgs-Dataset4ML: The positive.fasta file containing 210 protective antigen sequences from data.mendeley.com was included.
- VirulentPred 2.0: The Positive Test dataset from bioinfo.icgeb.res.in contributed to the positive samples.
Negative Data:
- VPAgs-Dataset4ML: The negative.fasta file containing 1,935 non-protective protein sequences from data.mendeley.com was used.
- VirulentPred 2.0: The Negative Test dataset was included for non-virulence factors.
- InterPro: A selection of proteins from InterPro (ebi.ac.uk/interpro), specifically those with specific conserved domains, were used to augment the negative dataset.

Training Procedure

The model was fine-tuned using a multi-iteration active learning approach.

Model Architecture

Base Model: facebook/esm2_t30_150M_UR50D
Head: A linear classification head was added on top of the base model's final hidden state.

Training Strategy

The training process employed an active learning loop with a Least Confidence querying strategy.

Initial Training: The model was initially trained on a small, randomly sampled subset of the labeled data.
Iterative Querying: In each iteration, the model was used to predict on a large pool of unlabeled data.
Uncertainty Sampling: The query_size most uncertain samples were identified.
Re-labeling and Retraining: These newly selected samples were added to the labeled training set, and the model was retrained on the expanded dataset. This process was repeated for several iterations, progressively improving the model's performance by focusing on the most challenging examples.

Evaluation

The model's performance was evaluated using two distinct methods: a final evaluation on a held-out test set and an intermediate evaluation on a validation set used during the active learning process. This distinction clarifies the purpose of each report.

Intermediate Evaluation on Validation Set

This report reflects the model's performance on a validation set. This data was used during the training loop to monitor progress and guide the active learning strategy, allowing the model to focus on the most uncertain examples. The perfect scores indicate strong performance on the data it was exposed to during the retraining process.

Class	Precision	Recall	F1-Score	Support
Non-Virulent	1.00	1.00	1.00	2554
Virulent	1.00	1.00	1.00	2545
Accuracy	-	-	1.00	5099
Macro Avg	1.00	1.00	1.00	5099
Weighted Avg	1.00	1.00	1.00	5099

Intermediate Evaluation Confusion Matrix: A detailed breakdown of correct and incorrect predictions on the validaion set.

Final Evaluation on Held-out Test Set

This report presents the model's final, unbiased performance on data it has never seen. These metrics are the most reliable indicators of the model's performance on new, unseen protein sequences.

Final Test Accuracy: 0.9600
Final Test F1 Score (Macro): 0.9600

Class	Precision	Recall	F1-Score	Support
Negative	0.97	0.95	0.96	6491
Positive	0.95	0.97	0.96	6492
Accuracy	-	-	0.96	12983
Macro Avg	0.96	0.96	0.96	12983
Weighted Avg	0.96	0.96	0.96	12983

Final Test Accuracy Confusion Matrix: A detailed breakdown of correct and incorrect predictions on the test set.

Visualization

The following visualizations provide further insight into the model's training process and performance.

Active Learning Performance: This plot shows the model's accuracy improvement over successive active learning iterations.
CLS Embeddings Visualization (t-SNE): This plot shows the separability of the positive and negative classes in a reduced-dimension space using t-SNE.
CLS Embeddings Visualization (UMAP): This plot shows the separability of the positive and negative classes in a reduced-dimension space using UMAP.

Web Interface for Easy Access

For a user-friendly way to interact with the model, you can use the Streamlit web application. This interface allows you to predict on a single protein sequence or upload a multi-sequence FASTA file for batch processing. The application is also designed to automatically handle hardware constraints, switching to CPU if a CUDA device is not available or if it runs out of memory.

To run the web interface, follow these steps:

Clone the repository: Open your terminal and clone the AVP-ESM repository from GitHub.

git clone [https://github.com/kssrikar4/AVP-ESM.git](https://github.com/kssrikar4/AVP-ESM.git)
cd AVP-ESM

Install dependencies: The application requires several Python libraries. You can install them by creating a requirements.txt file
```
python -m venv py
source py/bin/activate  # On Windows: `py\Scripts\activate`
pip install -r requirements.txt
```
Run the application: Once the dependencies are installed, you can launch the web interface with the following command:
```
streamlit run app.py
```

Your default web browser should open automatically, displaying the Protein Virulence Predictor application.

Licensing

This model is licensed under the GNU Lesser General Public License v2.1 (LGPL-2.1).

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for kssrikar4/AVP-ESM2-150m

Base model

facebook/esm2_t30_150M_UR50D

Finetuned

(13)

this model