Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ they strictly follow biological genomic patterns and depend heavily on certain s
|
|
22 |
In contrast, natural language models are more resilient and can tolerate linguistic errors such as typos and grammar mistakes.
|
23 |
Thus, effective RNA sequence curation is crucial to minimize the impact of noisy data and enhance modeling performance.
|
24 |
Specifically, our data curation protocol is as follows.
|
25 |
-
- Sequence truncation and filtering: We truncated RNA sequences exceeding
|
26 |
filtered out sequences shorter than 20 nucleotides to eliminate noise, such as RNA fragment sequences.
|
27 |
- RNA secondary structure annotation: Given the significant impact of RNA secondary structures on sequence function,
|
28 |
we annotated the local RNA structures of all RNA sequences using ViennaRNA (with parameters "maxBPspan"=30)25.
|
@@ -36,7 +36,7 @@ In this study, we developed PlantRNA-FM, a specialised language model based on t
|
|
36 |
PlantRNA-FM has 35 million parameters, including 12 transformer network layers, 24 attention heads,
|
37 |
and an embedding dimension of 480. We applied layer normalisation and residual connections both before and after the encoder block.
|
38 |
As our focus is on RNA understanding rather than generation, we only utilised the encoder component of the transformer architecture.
|
39 |
-
PlantRNA-FM is capable of processing sequences up to
|
40 |
such as the Nvidia RTX 4090, with a batch size of 16. The model was trained on four A100 GPUs over a period of three weeks,
|
41 |
completing 3 epochs.
|
42 |
|
|
|
22 |
In contrast, natural language models are more resilient and can tolerate linguistic errors such as typos and grammar mistakes.
|
23 |
Thus, effective RNA sequence curation is crucial to minimize the impact of noisy data and enhance modeling performance.
|
24 |
Specifically, our data curation protocol is as follows.
|
25 |
+
- Sequence truncation and filtering: We truncated RNA sequences exceeding 1026 nucleotides to comply with the model's maximum length capacity and
|
26 |
filtered out sequences shorter than 20 nucleotides to eliminate noise, such as RNA fragment sequences.
|
27 |
- RNA secondary structure annotation: Given the significant impact of RNA secondary structures on sequence function,
|
28 |
we annotated the local RNA structures of all RNA sequences using ViennaRNA (with parameters "maxBPspan"=30)25.
|
|
|
36 |
PlantRNA-FM has 35 million parameters, including 12 transformer network layers, 24 attention heads,
|
37 |
and an embedding dimension of 480. We applied layer normalisation and residual connections both before and after the encoder block.
|
38 |
As our focus is on RNA understanding rather than generation, we only utilised the encoder component of the transformer architecture.
|
39 |
+
PlantRNA-FM is capable of processing sequences up to 1026 nucleotides in length, making it compatible with consumer-grade GPUs,
|
40 |
such as the Nvidia RTX 4090, with a batch size of 16. The model was trained on four A100 GPUs over a period of three weeks,
|
41 |
completing 3 epochs.
|
42 |
|