Genereux-akotenou commited on
Commit
4cf94b4
·
verified ·
1 Parent(s): ae706e7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - dnabert
4
+ - bacteria
5
+ - kmer
6
+ - translation-initiation-site
7
+ - sequence-modeling
8
+ library_name: transformers
9
+ ---
10
+
11
+ # BacteriaTIS-DNABERT-K6-89M
12
+
13
+ This model, `BacteriaTIS-DNABERT-K6-89M`, is a **DNA sequence classifier** based on **DNABERT** trained for **Translation Initiation Site (TIS) classification** in bacterial genomes. It operates on **6-mer tokenized sequences** derived from a **60 bp window (30 bp upstream + 30 bp downstream)** around the TIS. The model was fine-tuned using **89M trainable parameters**.
14
+
15
+ ## Model Details
16
+ - **Base Model:** DNABERT
17
+ - **Task:** Translation Initiation Site (TIS) Classification
18
+ - **K-mer Size:** 6
19
+ - **Input Sequence Window:** 60 bp (30 upstream + 30 downstream) of TIS site in ORF sequence
20
+ - **Number of Trainable Parameters:** 89M
21
+ - **Max Sequence Length:** 512
22
+ - **Precision Used:** AMP (Automatic Mixed Precision)
23
+
24
+ ---
25
+
26
+
27
+ ### **Install Dependencies**
28
+ Ensure you have `transformers` and `torch` installed:
29
+ ```bash
30
+ pip install torch transformers
31
+ ```
32
+
33
+ ### **Load Model & Tokenizer**
34
+ ```python
35
+ import torch
36
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
37
+
38
+ # Load Model
39
+ model_checkpoint = "Genereux-akotenou/BacteriaTIS-DNABERT-K6-89M"
40
+ model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
41
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
42
+ ```
43
+
44
+ ### **Inference Example**
45
+ To classify a TIS, extract a 60 bp sequence window (30 bp upstream + 30 bp downstream) of the TIS codon site and convert it to 6-mers:
46
+ ```python
47
+ def generate_kmer(sequence: str, k: int, overlap: int = 1):
48
+ """Generate k-mer encoding from DNA sequence."""
49
+ return " ".join([sequence[j:j+k] for j in range(0, len(sequence) - k + 1, overlap)])
50
+
51
+ # Example TIS-centered sequence (60 bp window)
52
+ sequence = "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG" # Replace with real sequence
53
+ seq_kmer = generate_kmer(sequence, k=6)
54
+ ```
55
+
56
+ ### **Run Model**
57
+ ```python
58
+ # Tokenize input
59
+ inputs = tokenizer(
60
+ seq_kmer,
61
+ return_tensors="pt",
62
+ max_length=tokenizer.model_max_length,
63
+ padding="max_length",
64
+ truncation=True
65
+ )
66
+
67
+ # Run inference
68
+ with torch.no_grad():
69
+ outputs = model(**inputs)
70
+ logits = outputs.logits
71
+ predicted_class = torch.argmax(logits, dim=-1).item()
72
+ ```
73
+
74
+ <!-- ### **Citation**
75
+ If you use this model in your research, please cite:
76
+ ```tex
77
+ @article{paper title,
78
+ title={DNABERT for Bacterial Translation Initiation Site Classification},
79
+ author={Genereux Akotenou, et al.},
80
+ journal={Hugging Face Model Hub},
81
+ year={2024}
82
+ }
83
+ ``` -->