🧪 IbnSinna-2B-Pharma: Drug Discovery Language Model

Advanced Pharmaceutical AI for Molecular Discovery

🔬 Binary Classification • 📊 Regression • 🧬 Conditional Generation

📌 Note: This model supersedes the earlier IbnSinna-2B-Drug model, which has been deprecated. Please use this version for all applications.

🎨 Model Overview

IbnSinna-2B-Pharma is a specialized language model fine-tuned for three core drug discovery tasks:

Binary Classification: Predicting a wide range of molecular properties with Yes/No answers, including ADMET properties (e.g., clinical toxicity, BBB permeability), bioactivity (e.g., BACE and HIV inhibition), and pathway interactions (e.g., stress response and nuclear receptor binding)
Regression: Predicting key quantitative physicochemical values, specifically aqueous solubility (logS) and hydration free energy.
Conditional Generation: Designing novel molecules by providing a chemical scaffold as a structural constraint.

Built on Google's TXGemma-2B architecture and fine-tuned on comprehensive pharmaceutical datasets, this model serves as a powerful tool for computational drug discovery and molecular property prediction.

Important: This model was trained as a plain text generator (not chat-based). Prompts should end with \nAnswer: and the model will complete the answer directly.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "OussamaEL/IbnSinna-2B-Pharma",
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
tokenizer = AutoTokenizer.from_pretrained("OussamaEL/IbnSinna-2B-Pharma")
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

def ask(prompt, max_new_tokens=32):
    text = prompt + "\nAnswer:"
    x = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    x = {k: v.to(model.device) for k, v in x.items()}
    with torch.no_grad():
        # Disable torch._dynamo for this generation step
        with torch._dynamo.config.patch(disable=True):
            y = model.generate(**x, max_new_tokens=max_new_tokens, do_sample=False,
                               pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = y[0, x["input_ids"].shape[1]:]
    return tokenizer.decode(gen, skip_special_tokens=True).strip()

# Example: HIV Inhibitor Classification
prompt = "Analyze COc1ccc(N2C(=O)C3c4[nH]c5ccc(C)cc5c4C4CCC(C(C)(C)C)CC4C3C2=O)cc1 and determine if it is an HIV inhibitor."
print(ask(prompt))  # Output: "Yes" or "No"

🎯 Model Capabilities

The model excels at three primary tasks:

1️⃣ Binary Classification

Predicting molecular properties with Yes/No outcomes:

Clinical Toxicity: Predicting adverse effects and safety profiles (from ClinTox).
BACE Inhibition: β-secretase 1 inhibitor prediction (from BACE).
HIV Inhibition: Antiviral activity prediction (from HIV).
Stress Response Pathways: Activation of pathways like ARE, HSE, and p53 (from Tox21).
Nuclear Receptor Binding: Ligand prediction for receptors like ER, AR, and AhR (from Tox21).
ADMET Properties: Blood-Brain Barrier (BBB) permeability (from BBBP).

2️⃣ Regression

Predicting quantitative molecular properties:

Aqueous Solubility (logS): Water solubility prediction (from ESOL).
Hydration Free Energy: Calculating the free energy of solvation in water (from FreeSolv).

3️⃣ Conditional Generation

Designing novel molecules based on constraints:

Scaffold-based Design: Generating novel molecules that contain a specific user-provided core structure or scaffold.

💻 Detailed Usage Examples

Classification Tasks

classification_prompts = [
    "Analyze Oc1ccc(Cl)c(Cl)c1 and determine if it is a stress response pathway ARE activator.",
    "Analyze COc1ccccc1NC(=O)CC(C)=O and determine if it is a nuclear receptor AhR ligand.",
    "Analyze CC[C@H]1C[C@H]2[C@@H]3CCC4=CC(=O)CC[C@@H]4[C@H]3CC[C@]2(C)[C@H]1O and determine if it is a nuclear receptor AR ligand."
]

for prompt in classification_prompts:
    text = prompt + "\nAnswer:"
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=5,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    gen = outputs[0, inputs["input_ids"].shape[1]:]
    answer = tokenizer.decode(gen, skip_special_tokens=True).strip().split()[0]  # Get first word (Yes/No)
    print(f"Q: {prompt[:50]}...")
    print(f"A: {answer}\n")

Molecule Generation

generation_prompts = [
    "Design a molecule containing the core structure O=C1Nc2ccccc2Oc2ccccc21.",
    "Design a molecule containing the core structure c1ccc2c(c1)Cc1ccccc1N2.",
    "Generate a molecule based on the scaffold c1ccccc1."
]

for prompt in generation_prompts:
    text = prompt + "\nAnswer:"
    inputs = tokenizer(text, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        temperature=0.7,
        do_sample=True,
        top_p=0.95
    )
    
    gen = outputs[0, inputs["input_ids"].shape[1]:]
    smiles = tokenizer.decode(gen, skip_special_tokens=True).strip()
    print(f"Scaffold: {prompt}")
    print(f"Generated: {smiles}\n")

Property Prediction

property_prompts = [
    "What is the predicted aqueous solubility (logS) of CCCCCCCO?",
    "What is the predicted aqueous solubility (logS) of Cc1ccc(O)cc1?"
]

for prompt in property_prompts:
    text = prompt + "\nAnswer:"
    inputs = tokenizer(text, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
    gen = outputs[0, inputs["input_ids"].shape[1]:]
    value = tokenizer.decode(gen, skip_special_tokens=True).strip()
    print(f"Molecule: {prompt.split('of ')[-1].replace('?', '')}")
    print(f"Predicted logS: {value}\n")

🔧 Technical Specifications

Model Architecture

Base Model: Google's TXGemma-2B-predict
Parameters: ~2.5B (2,635,108,608 total)
Fine-tuning Method: QLoRA (4-bit NF4 + bf16 compute) → Full model merge to FP16
LoRA Config: r=8, alpha=16, dropout=0.05
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Precision: 4-bit NF4 with bf16 compute dtype
Released Precision: FP16 (merged model)

Training Details

Optimizer: Paged AdamW 8-bit
Learning Rate: 1e-4 with cosine scheduler
Effective Batch Size: 4
Warmup Steps: 20
Max Sequence Length: 256-512 tokens (recommend 256 for short prompts)
Training Framework: HuggingFace Transformers + PEFT + TRL

Dataset Information

The model was trained on a custom-built dataset aggregated from MoleculeNet.

Source: MoleculeNet (Tox21, ClinTox, BBBP, HIV, BACE, ESOL, FreeSolv)
Training Samples: 33,179
Training Task Distribution (approximate):
- ~53% Classification
- ~44% Generation
- ~3% Regression

⚠️ Limitations and Ethical Considerations

Intended Use

✅ Research and Development: Drug discovery research, lead optimization
✅ Educational Purposes: Teaching molecular modeling concepts
✅ Screening Tools: Initial compound screening and prioritization
❌ NOT for Clinical Decisions: Do not use for patient treatment decisions
❌ NOT for Final Validation: Always require wet-lab validation

Known Limitations

Domain Specificity: Trained on specific molecular tasks, may not generalize to all chemistry
SMILES Representation: Limited to SMILES notation, doesn't handle 3D structures
Dataset Biases: Inherits biases from training data sources
Validation Required: All predictions require experimental validation

Ethical Guidelines

Always validate predictions experimentally
Consider potential biases in drug discovery pipelines
Ensure equitable application across different populations
Respect intellectual property in molecular design

📚 Citation

If you use this model in your research, please cite:

@model{ibnsinna_2b_pharma_2025,
  title={IbnSinna-2B-Pharma: Drug Discovery Language Model},
  author={Oussama El Allam},
  year={2025},
  month={August},
  url={https://huggingface.co/OussamaEL/IbnSinna-2B-Pharma},
  note={Fine-tuned TXGemma-2B for pharmaceutical and drug discovery applications},
  publisher={Hugging Face}
}

🤝 Acknowledgments

Google DeepMind for the TXGemma base model
The drug discovery and cheminformatics community
Contributors to the training datasets

🔗 Links

Model: IbnSinna-2B-Pharma
Base Model: TXGemma-2B-predict
Dataset: OussamaEL/drug-discovery-dataset

Made with 🧬 for the Drug Discovery Community

Advanced Pharmaceutical AI for Next-Generation Drug Development

OussamaEL
/

IbnSinna-2B-Pharma