yangheng
/

PlantRNA-FM

Model card Files Files and versions Community

yangheng commited on Feb 21

Commit

107fa64

·

verified ·

1 Parent(s): f795e57

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -22,7 +22,7 @@ they strictly follow biological genomic patterns and depend heavily on certain s
 In contrast, natural language models are more resilient and can tolerate linguistic errors such as typos and grammar mistakes.
 Thus, effective RNA sequence curation is crucial to minimize the impact of noisy data and enhance modeling performance.
 Specifically, our data curation protocol is as follows.
-- Sequence truncation and filtering: We truncated RNA sequences exceeding 512 nucleotides to comply with the model's maximum length capacity and
 filtered out sequences shorter than 20 nucleotides to eliminate noise, such as RNA fragment sequences.
 - RNA secondary structure annotation: Given the significant impact of RNA secondary structures on sequence function,
 we annotated the local RNA structures of all RNA sequences using ViennaRNA (with parameters "maxBPspan"=30)25.
@@ -36,7 +36,7 @@ In this study, we developed PlantRNA-FM, a specialised language model based on t
 PlantRNA-FM has 35 million parameters, including 12 transformer network layers, 24 attention heads,
 and an embedding dimension of 480. We applied layer normalisation and residual connections both before and after the encoder block.
 As our focus is on RNA understanding rather than generation, we only utilised the encoder component of the transformer architecture.
-PlantRNA-FM is capable of processing sequences up to 512 nucleotides in length, making it compatible with consumer-grade GPUs,
 such as the Nvidia RTX 4090, with a batch size of 16. The model was trained on four A100 GPUs over a period of three weeks,
 completing 3 epochs.

 In contrast, natural language models are more resilient and can tolerate linguistic errors such as typos and grammar mistakes.
 Thus, effective RNA sequence curation is crucial to minimize the impact of noisy data and enhance modeling performance.
 Specifically, our data curation protocol is as follows.
+- Sequence truncation and filtering: We truncated RNA sequences exceeding 1026 nucleotides to comply with the model's maximum length capacity and
 filtered out sequences shorter than 20 nucleotides to eliminate noise, such as RNA fragment sequences.
 - RNA secondary structure annotation: Given the significant impact of RNA secondary structures on sequence function,
 we annotated the local RNA structures of all RNA sequences using ViennaRNA (with parameters "maxBPspan"=30)25.
 PlantRNA-FM has 35 million parameters, including 12 transformer network layers, 24 attention heads,
 and an embedding dimension of 480. We applied layer normalisation and residual connections both before and after the encoder block.
 As our focus is on RNA understanding rather than generation, we only utilised the encoder component of the transformer architecture.
+PlantRNA-FM is capable of processing sequences up to 1026 nucleotides in length, making it compatible with consumer-grade GPUs,
 such as the Nvidia RTX 4090, with a batch size of 16. The model was trained on four A100 GPUs over a period of three weeks,
 completing 3 epochs.