Active Virulence Prediction using Evolutionary Scale Modeling
This model is a fine-tuned facebook/esm2_t30_150M_UR50D
protein language model for the binary classification of protein sequences. The model is designed to distinguish between virulence factors and non-virulence factors.
Intended Use
The primary purpose of this model is to predict whether a given protein sequence functions as a virulence factor. This can be applied in bioinformatics research, drug discovery, and pathogen analysis to identify and characterize potential therapeutic targets. The model is not intended for clinical diagnostic purposes.
Data Sources
The model was trained on a curated dataset of protein sequences from multiple sources, categorized as positive (virulence factors) and negative (non-virulence factors).
Positive Data:
VFDB (Virulence Factor Database): The Core dataset and Full dataset from mgc.ac.cn were utilized as sources for positive samples.
VPAgs-Dataset4ML: The
positive.fasta
file containing 210 protective antigen sequences from data.mendeley.com was included.VirulentPred 2.0: The Positive Test dataset from bioinfo.icgeb.res.in contributed to the positive samples.
Negative Data:
VPAgs-Dataset4ML: The
negative.fasta
file containing 1,935 non-protective protein sequences from data.mendeley.com was used.VirulentPred 2.0: The Negative Test dataset was included for non-virulence factors.
InterPro: A selection of proteins from InterPro (ebi.ac.uk/interpro), specifically those with specific conserved domains, were used to augment the negative dataset.
Training Procedure
The model was fine-tuned using a multi-iteration active learning approach.
Model Architecture
Base Model:
facebook/esm2_t30_150M_UR50D
Head: A linear classification head was added on top of the base model's final hidden state.
Training Strategy
The training process employed an active learning loop with a Least Confidence querying strategy.
Initial Training: The model was initially trained on a small, randomly sampled subset of the labeled data.
Iterative Querying: In each iteration, the model was used to predict on a large pool of unlabeled data.
Uncertainty Sampling: The
query_size
most uncertain samples were identified.Re-labeling and Retraining: These newly selected samples were added to the labeled training set, and the model was retrained on the expanded dataset. This process was repeated for several iterations, progressively improving the model's performance by focusing on the most challenging examples.
Evaluation
The model's performance was evaluated using two distinct methods: a final evaluation on a held-out test set and an intermediate evaluation on a validation set used during the active learning process. This distinction clarifies the purpose of each report.
Intermediate Evaluation on Validation Set
This report reflects the model's performance on a validation set. This data was used during the training loop to monitor progress and guide the active learning strategy, allowing the model to focus on the most uncertain examples. The perfect scores indicate strong performance on the data it was exposed to during the retraining process.
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Non-Virulent | 1.00 | 1.00 | 1.00 | 2554 |
Virulent | 1.00 | 1.00 | 1.00 | 2545 |
Accuracy | - | - | 1.00 | 5099 |
Macro Avg | 1.00 | 1.00 | 1.00 | 5099 |
Weighted Avg | 1.00 | 1.00 | 1.00 | 5099 |
- Intermediate Evaluation Confusion Matrix:
A detailed breakdown of correct and incorrect predictions on the validaion set.
Final Evaluation on Held-out Test Set
This report presents the model's final, unbiased performance on data it has never seen. These metrics are the most reliable indicators of the model's performance on new, unseen protein sequences.
- Final Test Accuracy: 0.9600
- Final Test F1 Score (Macro): 0.9600
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Negative | 0.97 | 0.95 | 0.96 | 6491 |
Positive | 0.95 | 0.97 | 0.96 | 6492 |
Accuracy | - | - | 0.96 | 12983 |
Macro Avg | 0.96 | 0.96 | 0.96 | 12983 |
Weighted Avg | 0.96 | 0.96 | 0.96 | 12983 |
- Final Test Accuracy Confusion Matrix:
A detailed breakdown of correct and incorrect predictions on the test set.
Visualization
The following visualizations provide further insight into the model's training process and performance.
Active Learning Performance:
This plot shows the model's accuracy improvement over successive active learning iterations.
CLS Embeddings Visualization (t-SNE):
This plot shows the separability of the positive and negative classes in a reduced-dimension space using t-SNE.
CLS Embeddings Visualization (UMAP):
This plot shows the separability of the positive and negative classes in a reduced-dimension space using UMAP.
Web Interface for Easy Access
For a user-friendly way to interact with the model, you can use the Streamlit web application. This interface allows you to predict on a single protein sequence or upload a multi-sequence FASTA file for batch processing. The application is also designed to automatically handle hardware constraints, switching to CPU if a CUDA device is not available or if it runs out of memory.
To run the web interface, follow these steps:
Clone the repository: Open your terminal and clone the
AVP-ESM
repository from GitHub.git clone [https://github.com/kssrikar4/AVP-ESM.git](https://github.com/kssrikar4/AVP-ESM.git) cd AVP-ESM
Install dependencies: The application requires several Python libraries. You can install them by creating a
requirements.txt
filepython -m venv py source py/bin/activate # On Windows: `py\Scripts\activate` pip install -r requirements.txt
Run the application: Once the dependencies are installed, you can launch the web interface with the following command:
streamlit run app.py
Your default web browser should open automatically, displaying the Protein Virulence Predictor application.
Licensing
This model is licensed under the GNU Lesser General Public License v2.1 (LGPL-2.1).
- Downloads last month
- 5
Model tree for kssrikar4/AVP-ESM2-150m
Base model
facebook/esm2_t30_150M_UR50D