Genereux-akotenou
/

BacteriaTIS-DNABERT-K6-89M

Text Classification

translation-initiation-site

sequence-modeling

Model card Files Files and versions Community

Genereux-akotenou commited on Feb 13

Commit

4cf94b4

·

verified ·

1 Parent(s): ae706e7

Create README.md

Files changed (1) hide show

README.md +83 -0

README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+---
+tags:
+- dnabert
+- bacteria
+- kmer
+- translation-initiation-site
+- sequence-modeling
+library_name: transformers
+---
+# BacteriaTIS-DNABERT-K6-89M
+This model, `BacteriaTIS-DNABERT-K6-89M`, is a **DNA sequence classifier** based on **DNABERT** trained for **Translation Initiation Site (TIS) classification** in bacterial genomes. It operates on **6-mer tokenized sequences** derived from a **60 bp window (30 bp upstream + 30 bp downstream)** around the TIS. The model was fine-tuned using **89M trainable parameters**.
+## Model Details
+- **Base Model:** DNABERT
+- **Task:** Translation Initiation Site (TIS) Classification
+- **K-mer Size:** 6
+- **Input Sequence Window:** 60 bp (30 upstream + 30 downstream) of TIS site in ORF sequence
+- **Number of Trainable Parameters:** 89M
+- **Max Sequence Length:** 512
+- **Precision Used:** AMP (Automatic Mixed Precision)
+---
+### **Install Dependencies**
+Ensure you have `transformers` and `torch` installed:
+```bash
+pip install torch transformers
+```
+### **Load Model & Tokenizer**
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+# Load Model
+model_checkpoint = "Genereux-akotenou/BacteriaTIS-DNABERT-K6-89M"
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+```
+### **Inference Example**
+To classify a TIS, extract a 60 bp sequence window (30 bp upstream + 30 bp downstream) of the TIS codon site and convert it to 6-mers:
+```python
+def generate_kmer(sequence: str, k: int, overlap: int = 1):
+    """Generate k-mer encoding from DNA sequence."""
+    return " ".join([sequence[j:j+k] for j in range(0, len(sequence) - k + 1, overlap)])
+# Example TIS-centered sequence (60 bp window)
+sequence = "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"  # Replace with real sequence
+seq_kmer = generate_kmer(sequence, k=6)
+```
+### **Run Model**
+```python
+# Tokenize input
+inputs = tokenizer(
+  seq_kmer,
+  return_tensors="pt",
+  max_length=tokenizer.model_max_length,
+  padding="max_length",
+  truncation=True
+)
+# Run inference
+with torch.no_grad():
+  outputs = model(**inputs)
+  logits = outputs.logits
+  predicted_class = torch.argmax(logits, dim=-1).item()
+```
+<!-- ### **Citation**
+If you use this model in your research, please cite:
+```tex
+@article{paper title,
+  title={DNABERT for Bacterial Translation Initiation Site Classification},
+  author={Genereux Akotenou, et al.},
+  journal={Hugging Face Model Hub},
+  year={2024}
+}
+``` -->