|
|
--- |
|
|
license: apache-2.0 |
|
|
metrics: |
|
|
- r_squared |
|
|
base_model: |
|
|
- InstaDeepAI/nucleotide-transformer-v2-500m-multi-species |
|
|
--- |
|
|
|
|
|
# PanDrugTransformer Model Card |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
**PanDrugTransformer** is a sequence-to-value regression model designed to predict readthrough from nucleotide sequences and drug context. |
|
|
|
|
|
- **Architecture:** Custom transformer with cross-attention between nucleotide sequence and drug embedding, plus a regression head. |
|
|
- **Base Model:** [`InstaDeepAI/nucleotide-transformer-v2-500m-multi-species`](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) |
|
|
- **Purpose:** Predict readthrough rates for given nucleotide sequences and drug conditions. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
- **Hyperparameter Optimization:** Optuna was used to tune model parameters. |
|
|
- **Final Training:** Best hyperparameters were selected for full training on processed splits. |
|
|
- **Evaluation Metrics:** R² (coefficient of determination) on validation/test sets. |
|
|
|
|
|
## Data |
|
|
|
|
|
- **Splits:** Model trained and evaluated on processed train/validation/test splits. |
|
|
- **Features:** Each sample includes a nucleotide sequence and a drug name column (embedded for cross-attention). |
|
|
|
|
|
## Usage Instructions |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("Dichopsis/TransStop") |
|
|
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-500m-multi-species") |
|
|
|
|
|
# Example input |
|
|
sequence = "CGTTGGTAGCCAATT" # (6nt-STOP-6nt) |
|
|
drug_name = "Clitocine" # Format as required by model |
|
|
|
|
|
inputs = tokenizer(sequence, return_tensors="pt") |
|
|
# Add drug name embedding as required by model's API |
|
|
outputs = model(**inputs, drug_name=drug_name) |
|
|
prediction = outputs.logits.item() # Regression output |
|
|
``` |
|
|
|
|
|
## Notes for Hugging Face Users |
|
|
|
|
|
- **Drug Embedding:** Drug name is embedded and integrated via cross-attention. |
|
|
- **Regression Head:** Model outputs a continuous value. |
|
|
- **Compatibility:** Requires a 15nt nucleotide sequence (6nt-STOP-6nt) and drug name input. |
|
|
- **Evaluation:** R² reported for validation/test splits. |