File size: 2,085 Bytes
611a17b
 
 
 
 
 
baf2b66
 
 
 
 
 
5c0bc03
baf2b66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: apache-2.0
metrics:
- r_squared
base_model:
- InstaDeepAI/nucleotide-transformer-v2-500m-multi-species
---

# PanDrugTransformer Model Card

## Model Overview

**PanDrugTransformer** is a sequence-to-value regression model designed to predict readthrough from nucleotide sequences and drug context.

- **Architecture:** Custom transformer with cross-attention between nucleotide sequence and drug embedding, plus a regression head.
- **Base Model:** [`InstaDeepAI/nucleotide-transformer-v2-500m-multi-species`](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species)
- **Purpose:** Predict readthrough rates for given nucleotide sequences and drug conditions.

## Training Procedure

- **Hyperparameter Optimization:** Optuna was used to tune model parameters.
- **Final Training:** Best hyperparameters were selected for full training on processed splits.
- **Evaluation Metrics:** R² (coefficient of determination) on validation/test sets.

## Data

- **Splits:** Model trained and evaluated on processed train/validation/test splits.
- **Features:** Each sample includes a nucleotide sequence and a drug name column (embedded for cross-attention).

## Usage Instructions

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("Dichopsis/TransStop")
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-500m-multi-species")

# Example input
sequence = "CGTTGGTAGCCAATT" # (6nt-STOP-6nt)
drug_name = "Clitocine"  # Format as required by model

inputs = tokenizer(sequence, return_tensors="pt")
# Add drug name embedding as required by model's API
outputs = model(**inputs, drug_name=drug_name)
prediction = outputs.logits.item()  # Regression output
```

## Notes for Hugging Face Users

- **Drug Embedding:** Drug name is embedded and integrated via cross-attention.
- **Regression Head:** Model outputs a continuous value.
- **Compatibility:** Requires a 15nt nucleotide sequence (6nt-STOP-6nt) and drug name input.
- **Evaluation:** R² reported for validation/test splits.