--- license: lgpl base_model: - facebook/esm2_t30_150M_UR50D tags: - virulence-prediction - biology - bioinformatics - microbiology pipeline_tag: token-classification library_name: transformers --- # Active Virulence Prediction using Evolutionary Scale Modeling This model is a fine-tuned `facebook/esm2_t30_150M_UR50D` protein language model for the binary classification of protein sequences. The model is designed to distinguish between virulence factors and non-virulence factors. ## Intended Use The primary purpose of this model is to predict whether a given protein sequence functions as a virulence factor. This can be applied in bioinformatics research, drug discovery, and pathogen analysis to identify and characterize potential therapeutic targets. The model is not intended for clinical diagnostic purposes. ## Data Sources The model was trained on a curated dataset of protein sequences from multiple sources, categorized as positive (virulence factors) and negative (non-virulence factors). * **Positive Data:** * **VFDB (Virulence Factor Database):** The Core dataset and Full dataset from mgc.ac.cn were utilized as sources for positive samples. * **VPAgs-Dataset4ML:** The `positive.fasta` file containing 210 protective antigen sequences from data.mendeley.com was included. * **VirulentPred 2.0:** The Positive Test dataset from bioinfo.icgeb.res.in contributed to the positive samples. * **Negative Data:** * **VPAgs-Dataset4ML:** The `negative.fasta` file containing 1,935 non-protective protein sequences from data.mendeley.com was used. * **VirulentPred 2.0:** The Negative Test dataset was included for non-virulence factors. * **InterPro:** A selection of proteins from InterPro (ebi.ac.uk/interpro), specifically those with specific conserved domains, were used to augment the negative dataset. ## Training Procedure The model was fine-tuned using a multi-iteration active learning approach. ### Model Architecture * **Base Model:** `facebook/esm2_t30_150M_UR50D` * **Head:** A linear classification head was added on top of the base model's final hidden state. ### Training Strategy The training process employed an active learning loop with a Least Confidence querying strategy. 1. **Initial Training:** The model was initially trained on a small, randomly sampled subset of the labeled data. 2. **Iterative Querying:** In each iteration, the model was used to predict on a large pool of unlabeled data. 3. **Uncertainty Sampling:** The `query_size` most uncertain samples were identified. 4. **Re-labeling and Retraining:** These newly selected samples were added to the labeled training set, and the model was retrained on the expanded dataset. This process was repeated for several iterations, progressively improving the model's performance by focusing on the most challenging examples. ## Evaluation The model's performance was evaluated using two distinct methods: a final evaluation on a held-out test set and an intermediate evaluation on a validation set used during the active learning process. This distinction clarifies the purpose of each report. #### Intermediate Evaluation on Validation Set This report reflects the model's performance on a validation set. This data was used during the training loop to monitor progress and guide the active learning strategy, allowing the model to focus on the most uncertain examples. The perfect scores indicate strong performance on the data it was exposed to during the retraining process. | Class | Precision | Recall | F1-Score | Support | | :--- | :--- | :--- | :--- | :--- | | Non-Virulent | 1.00 | 1.00 | 1.00 | 2554 | | Virulent | 1.00 | 1.00 | 1.00 | 2545 | | **Accuracy** | **-** | **-** | **1.00** | **5099** | | **Macro Avg** | **1.00** | **1.00** | **1.00** | **5099** | | **Weighted Avg** | **1.00** | **1.00** | **1.00** | **5099** | * **Intermediate Evaluation Confusion Matrix:** ![Placeholder for Confusion Matrix](validation.png) _A detailed breakdown of correct and incorrect predictions on the validaion set._ #### Final Evaluation on Held-out Test Set This report presents the model's final, unbiased performance on data it has never seen. These metrics are the most reliable indicators of the model's performance on new, unseen protein sequences. * **Final Test Accuracy:** 0.9600 * **Final Test F1 Score (Macro):** 0.9600 | Class | Precision | Recall | F1-Score | Support | | :--- | :--- | :--- | :--- | :--- | | Negative | 0.97 | 0.95 | 0.96 | 6491 | | Positive | 0.95 | 0.97 | 0.96 | 6492 | | **Accuracy** | **-** | **-** | **0.96** | **12983** | | **Macro Avg** | **0.96** | **0.96** | **0.96** | **12983** | | **Weighted Avg** | **0.96** | **0.96** | **0.96** | **12983** | * **Final Test Accuracy Confusion Matrix:** ![Placeholder for Confusion Matrix](test.png) _A detailed breakdown of correct and incorrect predictions on the test set._ ## Visualization The following visualizations provide further insight into the model's training process and performance. * **Active Learning Performance:** ![Placeholder for Active Learning Performance Plot](alp.png) _This plot shows the model's accuracy improvement over successive active learning iterations._ * **CLS Embeddings Visualization (t-SNE):** ![Placeholder for CLS Embeddings Visualization](tsne.png) _This plot shows the separability of the positive and negative classes in a reduced-dimension space using t-SNE._ * **CLS Embeddings Visualization (UMAP):** ![Placeholder for CLS Embeddings Visualization](umap.png) _This plot shows the separability of the positive and negative classes in a reduced-dimension space using UMAP._ ## Web Interface for Easy Access For a user-friendly way to interact with the model, you can use the Streamlit web application. This interface allows you to predict on a single protein sequence or upload a multi-sequence FASTA file for batch processing. The application is also designed to automatically handle hardware constraints, switching to CPU if a CUDA device is not available or if it runs out of memory. To run the web interface, follow these steps: 1. **Clone the repository:** Open your terminal and clone the `AVP-ESM` repository from GitHub. ```bash git clone [https://github.com/kssrikar4/AVP-ESM.git](https://github.com/kssrikar4/AVP-ESM.git) cd AVP-ESM ``` 2. **Install dependencies:** The application requires several Python libraries. You can install them by creating a `requirements.txt` file ```bash python -m venv py source py/bin/activate # On Windows: `py\Scripts\activate` pip install -r requirements.txt ``` 3. **Run the application:** Once the dependencies are installed, you can launch the web interface with the following command: ```bash streamlit run app.py ``` Your default web browser should open automatically, displaying the Protein Virulence Predictor application. ## Licensing This model is licensed under the **GNU Lesser General Public License v2.1 (LGPL-2.1)**.