Dichopsis commited on
Commit
baf2b66
·
verified ·
1 Parent(s): 611a17b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -1
README.md CHANGED
@@ -4,4 +4,50 @@ metrics:
4
  - r_squared
5
  base_model:
6
  - InstaDeepAI/nucleotide-transformer-v2-500m-multi-species
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - r_squared
5
  base_model:
6
  - InstaDeepAI/nucleotide-transformer-v2-500m-multi-species
7
+ ---
8
+
9
+ # PanDrugTransformer Model Card
10
+
11
+ ## Model Overview
12
+
13
+ **PanDrugTransformer** is a sequence-to-value regression model designed to predict `RT_transformed` from nucleotide sequences and drug context.
14
+
15
+ - **Architecture:** Custom transformer with cross-attention between nucleotide sequence and drug embedding, plus a regression head.
16
+ - **Base Model:** [`InstaDeepAI/nucleotide-transformer-v2-500m-multi-species`](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species)
17
+ - **Purpose:** Predict readthrough rates for given nucleotide sequences and drug conditions.
18
+
19
+ ## Training Procedure
20
+
21
+ - **Hyperparameter Optimization:** Optuna was used to tune model parameters.
22
+ - **Final Training:** Best hyperparameters were selected for full training on processed splits.
23
+ - **Evaluation Metrics:** R² (coefficient of determination) on validation/test sets.
24
+
25
+ ## Data
26
+
27
+ - **Splits:** Model trained and evaluated on processed train/validation/test splits.
28
+ - **Features:** Each sample includes a nucleotide sequence and a drug name column (embedded for cross-attention).
29
+
30
+ ## Usage Instructions
31
+
32
+ ```python
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ model = AutoModel.from_pretrained("Dichopsis/TransStop")
36
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-500m-multi-species")
37
+
38
+ # Example input
39
+ sequence = "CGTTGGTAGCCAATT" # (6nt-STOP-6nt)
40
+ drug_name = "Clitocine" # Format as required by model
41
+
42
+ inputs = tokenizer(sequence, return_tensors="pt")
43
+ # Add drug name embedding as required by model's API
44
+ outputs = model(**inputs, drug_name=drug_name)
45
+ prediction = outputs.logits.item() # Regression output
46
+ ```
47
+
48
+ ## Notes for Hugging Face Users
49
+
50
+ - **Drug Embedding:** Drug name is embedded and integrated via cross-attention.
51
+ - **Regression Head:** Model outputs a continuous value.
52
+ - **Compatibility:** Requires a 15nt nucleotide sequence (6nt-STOP-6nt) and drug name input.
53
+ - **Evaluation:** R² reported for validation/test splits.