Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- dnabert
|
4 |
+
- bacteria
|
5 |
+
- kmer
|
6 |
+
- translation-initiation-site
|
7 |
+
- sequence-modeling
|
8 |
+
library_name: transformers
|
9 |
+
---
|
10 |
+
|
11 |
+
# BacteriaTIS-DNABERT-K6-89M
|
12 |
+
|
13 |
+
This model, `BacteriaTIS-DNABERT-K6-89M`, is a **DNA sequence classifier** based on **DNABERT** trained for **Translation Initiation Site (TIS) classification** in bacterial genomes. It operates on **6-mer tokenized sequences** derived from a **60 bp window (30 bp upstream + 30 bp downstream)** around the TIS. The model was fine-tuned using **89M trainable parameters**.
|
14 |
+
|
15 |
+
## Model Details
|
16 |
+
- **Base Model:** DNABERT
|
17 |
+
- **Task:** Translation Initiation Site (TIS) Classification
|
18 |
+
- **K-mer Size:** 6
|
19 |
+
- **Input Sequence Window:** 60 bp (30 upstream + 30 downstream) of TIS site in ORF sequence
|
20 |
+
- **Number of Trainable Parameters:** 89M
|
21 |
+
- **Max Sequence Length:** 512
|
22 |
+
- **Precision Used:** AMP (Automatic Mixed Precision)
|
23 |
+
|
24 |
+
---
|
25 |
+
|
26 |
+
|
27 |
+
### **Install Dependencies**
|
28 |
+
Ensure you have `transformers` and `torch` installed:
|
29 |
+
```bash
|
30 |
+
pip install torch transformers
|
31 |
+
```
|
32 |
+
|
33 |
+
### **Load Model & Tokenizer**
|
34 |
+
```python
|
35 |
+
import torch
|
36 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
37 |
+
|
38 |
+
# Load Model
|
39 |
+
model_checkpoint = "Genereux-akotenou/BacteriaTIS-DNABERT-K6-89M"
|
40 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
|
41 |
+
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
|
42 |
+
```
|
43 |
+
|
44 |
+
### **Inference Example**
|
45 |
+
To classify a TIS, extract a 60 bp sequence window (30 bp upstream + 30 bp downstream) of the TIS codon site and convert it to 6-mers:
|
46 |
+
```python
|
47 |
+
def generate_kmer(sequence: str, k: int, overlap: int = 1):
|
48 |
+
"""Generate k-mer encoding from DNA sequence."""
|
49 |
+
return " ".join([sequence[j:j+k] for j in range(0, len(sequence) - k + 1, overlap)])
|
50 |
+
|
51 |
+
# Example TIS-centered sequence (60 bp window)
|
52 |
+
sequence = "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG" # Replace with real sequence
|
53 |
+
seq_kmer = generate_kmer(sequence, k=6)
|
54 |
+
```
|
55 |
+
|
56 |
+
### **Run Model**
|
57 |
+
```python
|
58 |
+
# Tokenize input
|
59 |
+
inputs = tokenizer(
|
60 |
+
seq_kmer,
|
61 |
+
return_tensors="pt",
|
62 |
+
max_length=tokenizer.model_max_length,
|
63 |
+
padding="max_length",
|
64 |
+
truncation=True
|
65 |
+
)
|
66 |
+
|
67 |
+
# Run inference
|
68 |
+
with torch.no_grad():
|
69 |
+
outputs = model(**inputs)
|
70 |
+
logits = outputs.logits
|
71 |
+
predicted_class = torch.argmax(logits, dim=-1).item()
|
72 |
+
```
|
73 |
+
|
74 |
+
<!-- ### **Citation**
|
75 |
+
If you use this model in your research, please cite:
|
76 |
+
```tex
|
77 |
+
@article{paper title,
|
78 |
+
title={DNABERT for Bacterial Translation Initiation Site Classification},
|
79 |
+
author={Genereux Akotenou, et al.},
|
80 |
+
journal={Hugging Face Model Hub},
|
81 |
+
year={2024}
|
82 |
+
}
|
83 |
+
``` -->
|