IbnSinna-2B-Drug Logo

πŸ§ͺ IbnSinna-2B-Pharma: Drug Discovery Language Model

Advanced Pharmaceutical AI for Molecular Discovery

πŸ”¬ Binary Classification β€’ πŸ“Š Regression β€’ 🧬 Conditional Generation

Hugging Face License Base Model


πŸ“Œ Note: This model supersedes the earlier IbnSinna-2B-Drug model, which has been deprecated. Please use this version for all applications.


🎨 Model Overview

IbnSinna-2B-Pharma is a specialized language model fine-tuned for three core drug discovery tasks:

  1. Binary Classification: Predicting a wide range of molecular properties with Yes/No answers, including ADMET properties (e.g., clinical toxicity, BBB permeability), bioactivity (e.g., BACE and HIV inhibition), and pathway interactions (e.g., stress response and nuclear receptor binding)
  2. Regression: Predicting key quantitative physicochemical values, specifically aqueous solubility (logS) and hydration free energy.
  3. Conditional Generation: Designing novel molecules by providing a chemical scaffold as a structural constraint.

Built on Google's TXGemma-2B architecture and fine-tuned on comprehensive pharmaceutical datasets, this model serves as a powerful tool for computational drug discovery and molecular property prediction.

Important: This model was trained as a plain text generator (not chat-based). Prompts should end with \nAnswer: and the model will complete the answer directly.

πŸš€ Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "OussamaEL/IbnSinna-2B-Pharma",
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
tokenizer = AutoTokenizer.from_pretrained("OussamaEL/IbnSinna-2B-Pharma")
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

def ask(prompt, max_new_tokens=32):
    text = prompt + "\nAnswer:"
    x = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    x = {k: v.to(model.device) for k, v in x.items()}
    with torch.no_grad():
        # Disable torch._dynamo for this generation step
        with torch._dynamo.config.patch(disable=True):
            y = model.generate(**x, max_new_tokens=max_new_tokens, do_sample=False,
                               pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = y[0, x["input_ids"].shape[1]:]
    return tokenizer.decode(gen, skip_special_tokens=True).strip()

# Example: HIV Inhibitor Classification
prompt = "Analyze COc1ccc(N2C(=O)C3c4[nH]c5ccc(C)cc5c4C4CCC(C(C)(C)C)CC4C3C2=O)cc1 and determine if it is an HIV inhibitor."
print(ask(prompt))  # Output: "Yes" or "No"

🎯 Model Capabilities

The model excels at three primary tasks:

1️⃣ Binary Classification

Predicting molecular properties with Yes/No outcomes:

  • Clinical Toxicity: Predicting adverse effects and safety profiles (from ClinTox).
  • BACE Inhibition: Ξ²-secretase 1 inhibitor prediction (from BACE).
  • HIV Inhibition: Antiviral activity prediction (from HIV).
  • Stress Response Pathways: Activation of pathways like ARE, HSE, and p53 (from Tox21).
  • Nuclear Receptor Binding: Ligand prediction for receptors like ER, AR, and AhR (from Tox21).
  • ADMET Properties: Blood-Brain Barrier (BBB) permeability (from BBBP).

2️⃣ Regression

Predicting quantitative molecular properties:

  • Aqueous Solubility (logS): Water solubility prediction (from ESOL).
  • Hydration Free Energy: Calculating the free energy of solvation in water (from FreeSolv).

3️⃣ Conditional Generation

Designing novel molecules based on constraints:

  • Scaffold-based Design: Generating novel molecules that contain a specific user-provided core structure or scaffold.

πŸ’» Detailed Usage Examples

Classification Tasks

classification_prompts = [
    "Analyze Oc1ccc(Cl)c(Cl)c1 and determine if it is a stress response pathway ARE activator.",
    "Analyze COc1ccccc1NC(=O)CC(C)=O and determine if it is a nuclear receptor AhR ligand.",
    "Analyze CC[C@H]1C[C@H]2[C@@H]3CCC4=CC(=O)CC[C@@H]4[C@H]3CC[C@]2(C)[C@H]1O and determine if it is a nuclear receptor AR ligand."
]

for prompt in classification_prompts:
    text = prompt + "\nAnswer:"
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=5,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    gen = outputs[0, inputs["input_ids"].shape[1]:]
    answer = tokenizer.decode(gen, skip_special_tokens=True).strip().split()[0]  # Get first word (Yes/No)
    print(f"Q: {prompt[:50]}...")
    print(f"A: {answer}\n")

Molecule Generation

generation_prompts = [
    "Design a molecule containing the core structure O=C1Nc2ccccc2Oc2ccccc21.",
    "Design a molecule containing the core structure c1ccc2c(c1)Cc1ccccc1N2.",
    "Generate a molecule based on the scaffold c1ccccc1."
]

for prompt in generation_prompts:
    text = prompt + "\nAnswer:"
    inputs = tokenizer(text, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        temperature=0.7,
        do_sample=True,
        top_p=0.95
    )
    
    gen = outputs[0, inputs["input_ids"].shape[1]:]
    smiles = tokenizer.decode(gen, skip_special_tokens=True).strip()
    print(f"Scaffold: {prompt}")
    print(f"Generated: {smiles}\n")

Property Prediction

property_prompts = [
    "What is the predicted aqueous solubility (logS) of CCCCCCCO?",
    "What is the predicted aqueous solubility (logS) of Cc1ccc(O)cc1?"
]

for prompt in property_prompts:
    text = prompt + "\nAnswer:"
    inputs = tokenizer(text, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
    gen = outputs[0, inputs["input_ids"].shape[1]:]
    value = tokenizer.decode(gen, skip_special_tokens=True).strip()
    print(f"Molecule: {prompt.split('of ')[-1].replace('?', '')}")
    print(f"Predicted logS: {value}\n")

πŸ”§ Technical Specifications

Model Architecture

  • Base Model: Google's TXGemma-2B-predict
  • Parameters: ~2.5B (2,635,108,608 total)
  • Fine-tuning Method: QLoRA (4-bit NF4 + bf16 compute) β†’ Full model merge to FP16
  • LoRA Config: r=8, alpha=16, dropout=0.05
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Training Precision: 4-bit NF4 with bf16 compute dtype
  • Released Precision: FP16 (merged model)

Training Details

  • Optimizer: Paged AdamW 8-bit
  • Learning Rate: 1e-4 with cosine scheduler
  • Effective Batch Size: 4
  • Warmup Steps: 20
  • Max Sequence Length: 256-512 tokens (recommend 256 for short prompts)
  • Training Framework: HuggingFace Transformers + PEFT + TRL

Dataset Information

The model was trained on a custom-built dataset aggregated from MoleculeNet.

  • Source: MoleculeNet (Tox21, ClinTox, BBBP, HIV, BACE, ESOL, FreeSolv)
  • Training Samples: 33,179
  • Training Task Distribution (approximate):
    • ~53% Classification
    • ~44% Generation
    • ~3% Regression

⚠️ Limitations and Ethical Considerations

Intended Use

  • βœ… Research and Development: Drug discovery research, lead optimization
  • βœ… Educational Purposes: Teaching molecular modeling concepts
  • βœ… Screening Tools: Initial compound screening and prioritization
  • ❌ NOT for Clinical Decisions: Do not use for patient treatment decisions
  • ❌ NOT for Final Validation: Always require wet-lab validation

Known Limitations

  1. Domain Specificity: Trained on specific molecular tasks, may not generalize to all chemistry
  2. SMILES Representation: Limited to SMILES notation, doesn't handle 3D structures
  3. Dataset Biases: Inherits biases from training data sources
  4. Validation Required: All predictions require experimental validation

Ethical Guidelines

  • Always validate predictions experimentally
  • Consider potential biases in drug discovery pipelines
  • Ensure equitable application across different populations
  • Respect intellectual property in molecular design

πŸ“š Citation

If you use this model in your research, please cite:

@model{ibnsinna_2b_pharma_2025,
  title={IbnSinna-2B-Pharma: Drug Discovery Language Model},
  author={Oussama El Allam},
  year={2025},
  month={August},
  url={https://huggingface.co/OussamaEL/IbnSinna-2B-Pharma},
  note={Fine-tuned TXGemma-2B for pharmaceutical and drug discovery applications},
  publisher={Hugging Face}
}

🀝 Acknowledgments

  • Google DeepMind for the TXGemma base model
  • The drug discovery and cheminformatics community
  • Contributors to the training datasets

πŸ”— Links


Made with 🧬 for the Drug Discovery Community

Advanced Pharmaceutical AI for Next-Generation Drug Development

Downloads last month
53
Safetensors
Model size
2.61B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for OussamaEL/IbnSinna-2B-Pharma

Finetuned
(3)
this model
Quantizations
1 model

Evaluation results