kssrikar4 commited on
Commit
8aa0d9d
·
verified ·
1 Parent(s): 9432358

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: lgpl-2.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: lgpl
3
+ base_model:
4
+ - facebook/esm2_t30_150M_UR50D
5
+ tags:
6
+ - virulence-prediction
7
+ - biology
8
+ - bioinformatics
9
+ - microbiology
10
+ pipeline_tag: token-classification
11
+ library_name: transformers
12
+ ---
13
+ # Active Virulence Prediction using Evolutionary Scale Modeling
14
+
15
+ This model is a fine-tuned `facebook/esm2_t30_150M_UR50D` protein language model for the binary classification of protein sequences. The model is designed to distinguish between virulence factors and non-virulence factors.
16
+
17
+ ## Intended Use
18
+
19
+ The primary purpose of this model is to predict whether a given protein sequence functions as a virulence factor. This can be applied in bioinformatics research, drug discovery, and pathogen analysis to identify and characterize potential therapeutic targets. The model is not intended for clinical diagnostic purposes.
20
+
21
+ ## Data Sources
22
+
23
+ The model was trained on a curated dataset of protein sequences from multiple sources, categorized as positive (virulence factors) and negative (non-virulence factors).
24
+
25
+ * **Positive Data:**
26
+
27
+ * **VFDB (Virulence Factor Database):** The Core dataset and Full dataset from mgc.ac.cn were utilized as sources for positive samples.
28
+
29
+ * **VPAgs-Dataset4ML:** The `positive.fasta` file containing 210 protective antigen sequences from data.mendeley.com was included.
30
+
31
+ * **VirulentPred 2.0:** The Positive Test dataset from bioinfo.icgeb.res.in contributed to the positive samples.
32
+
33
+ * **Negative Data:**
34
+
35
+ * **VPAgs-Dataset4ML:** The `negative.fasta` file containing 1,935 non-protective protein sequences from data.mendeley.com was used.
36
+
37
+ * **VirulentPred 2.0:** The Negative Test dataset was included for non-virulence factors.
38
+
39
+ * **InterPro:** A selection of proteins from InterPro (ebi.ac.uk/interpro), specifically those with specific conserved domains, were used to augment the negative dataset.
40
+
41
+ ## Training Procedure
42
+
43
+ The model was fine-tuned using a multi-iteration active learning approach.
44
+
45
+ ### Model Architecture
46
+
47
+ * **Base Model:** `facebook/esm2_t30_150M_UR50D`
48
+
49
+ * **Head:** A linear classification head was added on top of the base model's final hidden state.
50
+
51
+ ### Training Strategy
52
+
53
+ The training process employed an active learning loop with a Least Confidence querying strategy.
54
+
55
+ 1. **Initial Training:** The model was initially trained on a small, randomly sampled subset of the labeled data.
56
+
57
+ 2. **Iterative Querying:** In each iteration, the model was used to predict on a large pool of unlabeled data.
58
+
59
+ 3. **Uncertainty Sampling:** The `query_size` most uncertain samples were identified.
60
+
61
+ 4. **Re-labeling and Retraining:** These newly selected samples were added to the labeled training set, and the model was retrained on the expanded dataset. This process was repeated for several iterations, progressively improving the model's performance by focusing on the most challenging examples.
62
+
63
+ ## Evaluation
64
+
65
+ The model's performance was evaluated using two distinct methods: a final evaluation on a held-out test set and an intermediate evaluation on a validation set used during the active learning process. This distinction clarifies the purpose of each report.
66
+
67
+ #### Intermediate Evaluation on Validation Set
68
+ This report reflects the model's performance on a validation set. This data was used during the training loop to monitor progress and guide the active learning strategy, allowing the model to focus on the most uncertain examples. The perfect scores indicate strong performance on the data it was exposed to during the retraining process.
69
+
70
+ | Class | Precision | Recall | F1-Score | Support |
71
+ | :--- | :--- | :--- | :--- | :--- |
72
+ | Non-Virulent | 1.00 | 1.00 | 1.00 | 2554 |
73
+ | Virulent | 1.00 | 1.00 | 1.00 | 2545 |
74
+ | **Accuracy** | **-** | **-** | **1.00** | **5099** |
75
+ | **Macro Avg** | **1.00** | **1.00** | **1.00** | **5099** |
76
+ | **Weighted Avg** | **1.00** | **1.00** | **1.00** | **5099** |
77
+
78
+ * **Intermediate Evaluation Confusion Matrix:**
79
+ ![Placeholder for Confusion Matrix](validation.png)
80
+ _A detailed breakdown of correct and incorrect predictions on the validaion set._
81
+
82
+ #### Final Evaluation on Held-out Test Set
83
+ This report presents the model's final, unbiased performance on data it has never seen. These metrics are the most reliable indicators of the model's performance on new, unseen protein sequences.
84
+
85
+ * **Final Test Accuracy:** 0.9600
86
+ * **Final Test F1 Score (Macro):** 0.9600
87
+
88
+ | Class | Precision | Recall | F1-Score | Support |
89
+ | :--- | :--- | :--- | :--- | :--- |
90
+ | Negative | 0.97 | 0.95 | 0.96 | 6491 |
91
+ | Positive | 0.95 | 0.97 | 0.96 | 6492 |
92
+ | **Accuracy** | **-** | **-** | **0.96** | **12983** |
93
+ | **Macro Avg** | **0.96** | **0.96** | **0.96** | **12983** |
94
+ | **Weighted Avg** | **0.96** | **0.96** | **0.96** | **12983** |
95
+
96
+ * **Final Test Accuracy Confusion Matrix:**
97
+ ![Placeholder for Confusion Matrix](test.png)
98
+ _A detailed breakdown of correct and incorrect predictions on the test set._
99
+
100
+ ## Visualization
101
+
102
+ The following visualizations provide further insight into the model's training process and performance.
103
+
104
+ * **Active Learning Performance:**
105
+ ![Placeholder for Active Learning Performance Plot](alp.png)
106
+ _This plot shows the model's accuracy improvement over successive active learning iterations._
107
+
108
+ * **CLS Embeddings Visualization (t-SNE):**
109
+ ![Placeholder for CLS Embeddings Visualization](tsne.png)
110
+ _This plot shows the separability of the positive and negative classes in a reduced-dimension space using t-SNE._
111
+
112
+ * **CLS Embeddings Visualization (UMAP):**
113
+ ![Placeholder for CLS Embeddings Visualization](umap.png)
114
+ _This plot shows the separability of the positive and negative classes in a reduced-dimension space using UMAP._
115
+
116
+ ## Licensing
117
+
118
+ This model is licensed under the **GNU Lesser General Public License v2.1 (LGPL-2.1)**.