
π§ͺ IbnSinna-2B-Pharma: Drug Discovery Language Model
Advanced Pharmaceutical AI for Molecular Discovery
π¬ Binary Classification β’ π Regression ⒠𧬠Conditional Generation
π Note: This model supersedes the earlier IbnSinna-2B-Drug model, which has been deprecated. Please use this version for all applications.
π¨ Model Overview
IbnSinna-2B-Pharma is a specialized language model fine-tuned for three core drug discovery tasks:
- Binary Classification: Predicting a wide range of molecular properties with Yes/No answers, including ADMET properties (e.g., clinical toxicity, BBB permeability), bioactivity (e.g., BACE and HIV inhibition), and pathway interactions (e.g., stress response and nuclear receptor binding)
- Regression: Predicting key quantitative physicochemical values, specifically aqueous solubility (logS) and hydration free energy.
- Conditional Generation: Designing novel molecules by providing a chemical scaffold as a structural constraint.
Built on Google's TXGemma-2B architecture and fine-tuned on comprehensive pharmaceutical datasets, this model serves as a powerful tool for computational drug discovery and molecular property prediction.
Important: This model was trained as a plain text generator (not chat-based). Prompts should end with \nAnswer:
and the model will complete the answer directly.
π Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"OussamaEL/IbnSinna-2B-Pharma",
torch_dtype=torch.bfloat16,
device_map={"": 0}
)
tokenizer = AutoTokenizer.from_pretrained("OussamaEL/IbnSinna-2B-Pharma")
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
def ask(prompt, max_new_tokens=32):
text = prompt + "\nAnswer:"
x = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
x = {k: v.to(model.device) for k, v in x.items()}
with torch.no_grad():
# Disable torch._dynamo for this generation step
with torch._dynamo.config.patch(disable=True):
y = model.generate(**x, max_new_tokens=max_new_tokens, do_sample=False,
pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
gen = y[0, x["input_ids"].shape[1]:]
return tokenizer.decode(gen, skip_special_tokens=True).strip()
# Example: HIV Inhibitor Classification
prompt = "Analyze COc1ccc(N2C(=O)C3c4[nH]c5ccc(C)cc5c4C4CCC(C(C)(C)C)CC4C3C2=O)cc1 and determine if it is an HIV inhibitor."
print(ask(prompt)) # Output: "Yes" or "No"
π― Model Capabilities
The model excels at three primary tasks:
1οΈβ£ Binary Classification
Predicting molecular properties with Yes/No outcomes:
- Clinical Toxicity: Predicting adverse effects and safety profiles (from ClinTox).
- BACE Inhibition: Ξ²-secretase 1 inhibitor prediction (from BACE).
- HIV Inhibition: Antiviral activity prediction (from HIV).
- Stress Response Pathways: Activation of pathways like ARE, HSE, and p53 (from Tox21).
- Nuclear Receptor Binding: Ligand prediction for receptors like ER, AR, and AhR (from Tox21).
- ADMET Properties: Blood-Brain Barrier (BBB) permeability (from BBBP).
2οΈβ£ Regression
Predicting quantitative molecular properties:
- Aqueous Solubility (logS): Water solubility prediction (from ESOL).
- Hydration Free Energy: Calculating the free energy of solvation in water (from FreeSolv).
3οΈβ£ Conditional Generation
Designing novel molecules based on constraints:
- Scaffold-based Design: Generating novel molecules that contain a specific user-provided core structure or scaffold.
π» Detailed Usage Examples
Classification Tasks
classification_prompts = [
"Analyze Oc1ccc(Cl)c(Cl)c1 and determine if it is a stress response pathway ARE activator.",
"Analyze COc1ccccc1NC(=O)CC(C)=O and determine if it is a nuclear receptor AhR ligand.",
"Analyze CC[C@H]1C[C@H]2[C@@H]3CCC4=CC(=O)CC[C@@H]4[C@H]3CC[C@]2(C)[C@H]1O and determine if it is a nuclear receptor AR ligand."
]
for prompt in classification_prompts:
text = prompt + "\nAnswer:"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=5,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
gen = outputs[0, inputs["input_ids"].shape[1]:]
answer = tokenizer.decode(gen, skip_special_tokens=True).strip().split()[0] # Get first word (Yes/No)
print(f"Q: {prompt[:50]}...")
print(f"A: {answer}\n")
Molecule Generation
generation_prompts = [
"Design a molecule containing the core structure O=C1Nc2ccccc2Oc2ccccc21.",
"Design a molecule containing the core structure c1ccc2c(c1)Cc1ccccc1N2.",
"Generate a molecule based on the scaffold c1ccccc1."
]
for prompt in generation_prompts:
text = prompt + "\nAnswer:"
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
top_p=0.95
)
gen = outputs[0, inputs["input_ids"].shape[1]:]
smiles = tokenizer.decode(gen, skip_special_tokens=True).strip()
print(f"Scaffold: {prompt}")
print(f"Generated: {smiles}\n")
Property Prediction
property_prompts = [
"What is the predicted aqueous solubility (logS) of CCCCCCCO?",
"What is the predicted aqueous solubility (logS) of Cc1ccc(O)cc1?"
]
for prompt in property_prompts:
text = prompt + "\nAnswer:"
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
gen = outputs[0, inputs["input_ids"].shape[1]:]
value = tokenizer.decode(gen, skip_special_tokens=True).strip()
print(f"Molecule: {prompt.split('of ')[-1].replace('?', '')}")
print(f"Predicted logS: {value}\n")
π§ Technical Specifications
Model Architecture
- Base Model: Google's TXGemma-2B-predict
- Parameters: ~2.5B (2,635,108,608 total)
- Fine-tuning Method: QLoRA (4-bit NF4 + bf16 compute) β Full model merge to FP16
- LoRA Config: r=8, alpha=16, dropout=0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Training Precision: 4-bit NF4 with bf16 compute dtype
- Released Precision: FP16 (merged model)
Training Details
- Optimizer: Paged AdamW 8-bit
- Learning Rate: 1e-4 with cosine scheduler
- Effective Batch Size: 4
- Warmup Steps: 20
- Max Sequence Length: 256-512 tokens (recommend 256 for short prompts)
- Training Framework: HuggingFace Transformers + PEFT + TRL
Dataset Information
The model was trained on a custom-built dataset aggregated from MoleculeNet.
- Source: MoleculeNet (Tox21, ClinTox, BBBP, HIV, BACE, ESOL, FreeSolv)
- Training Samples: 33,179
- Training Task Distribution (approximate):
- ~53% Classification
- ~44% Generation
- ~3% Regression
β οΈ Limitations and Ethical Considerations
Intended Use
- β Research and Development: Drug discovery research, lead optimization
- β Educational Purposes: Teaching molecular modeling concepts
- β Screening Tools: Initial compound screening and prioritization
- β NOT for Clinical Decisions: Do not use for patient treatment decisions
- β NOT for Final Validation: Always require wet-lab validation
Known Limitations
- Domain Specificity: Trained on specific molecular tasks, may not generalize to all chemistry
- SMILES Representation: Limited to SMILES notation, doesn't handle 3D structures
- Dataset Biases: Inherits biases from training data sources
- Validation Required: All predictions require experimental validation
Ethical Guidelines
- Always validate predictions experimentally
- Consider potential biases in drug discovery pipelines
- Ensure equitable application across different populations
- Respect intellectual property in molecular design
π Citation
If you use this model in your research, please cite:
@model{ibnsinna_2b_pharma_2025,
title={IbnSinna-2B-Pharma: Drug Discovery Language Model},
author={Oussama El Allam},
year={2025},
month={August},
url={https://huggingface.co/OussamaEL/IbnSinna-2B-Pharma},
note={Fine-tuned TXGemma-2B for pharmaceutical and drug discovery applications},
publisher={Hugging Face}
}
π€ Acknowledgments
- Google DeepMind for the TXGemma base model
- The drug discovery and cheminformatics community
- Contributors to the training datasets
π Links
- Model: IbnSinna-2B-Pharma
- Base Model: TXGemma-2B-predict
- Dataset: OussamaEL/drug-discovery-dataset
Made with 𧬠for the Drug Discovery Community
Advanced Pharmaceutical AI for Next-Generation Drug Development
- Downloads last month
- 53
Model tree for OussamaEL/IbnSinna-2B-Pharma
Evaluation results
- Classification Accuracy (BACE, Toxicity, HIV inhibition)self-reportedpending
- F1 Scoreself-reportedpending
- RMSE (logS, logP, IC50 predictions)self-reportedpending
- MAEself-reportedpending
- SMILES Validity Rateself-reportedpending
- Novel Molecule Generation Rateself-reportedpending