Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- biology
|
4 |
+
- DNA
|
5 |
+
- genomics
|
6 |
+
---
|
7 |
+
This is the official pre-trained model introduced in [GROVER : A foundation DNA language with optimized vocabulary learns sequence context in the human genome](https://www.biorxiv.org/content/10.1101/2023.07.19.549677v2)
|
8 |
+
|
9 |
+
|
10 |
+
|
11 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
12 |
+
import torch
|
13 |
+
|
14 |
+
# Import the tokenizer and the model
|
15 |
+
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
|
16 |
+
model = AutoModelForMaskedLM.from_pretrained("PoetschLab/GROVER")
|
17 |
+
|
18 |
+
|
19 |
+
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
|
20 |
+
We advice to add 100 nucleotides at the beginning and end of every sequence in order to garantee that your sequence is represented with the same tokens as the original tokenization.
|
21 |
+
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).
|