LucaVirus Large Model (3.8M steps)

Model Description

LucaVirus Large is a specialized transformer model designed for analyzing viral genomic and protein sequences. This large model was trained for 3.8M steps and represents the most capable version for understanding both gene and protein sequences in viral contexts. This repository is a clean and huggingface-compatible re-implementation of the original code present in our Github repository (see below).

Model Details

  • Model Type: Transformer-based language model for biological sequences
  • Architecture: Custom LucaGPLM architecture
  • Training Steps: 3.8M
  • Vocabulary Size: 39 tokens (gene + protein alphabet)
  • Hidden Size: 2560 (4x larger than base)
  • Number of Layers: 12
  • Number of Attention Heads: 20
  • Max Sequence Length: 3074
  • Parameters: ~3.5GB

Intended Use

This model is designed for:

  • Advanced feature extraction from viral sequences
  • Complex sequence classification tasks
  • Protein function prediction with high accuracy
  • Detailed genomic analysis of viral samples
  • Advanced research in computational biology and virology

Usage

Quick Start with AutoModel and AutoTokenizer

from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer using AutoModel and AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Yuanfei/lucavirus-large-step3.8M", trust_remote_code=True)
model = AutoModel.from_pretrained("Yuanfei/lucavirus-large-step3.8M", trust_remote_code=True)

# Example usage with a viral DNA sequence
dna_sequence = "ATCGATCGATCGAAATTTCCCGGGAAATTTCCCGGG"
inputs = tokenizer(dna_sequence, seq_type="gene", return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)

# Extract features
features = outputs.last_hidden_state  # Shape: (batch_size, seq_len, hidden_size=2560)
pooled_output = outputs.pooler_output  # Shape: (batch_size, hidden_size=2560)

print(f"Sequence length: {features.shape[1]}")
print(f"Feature dimension: {features.shape[2]}")
print(f"Pooled feature shape: {pooled_output.shape}")

GitHub:

Citation

If you use this model in your research, please cite:

Pan, Y.-F., He, Y., Liu, Y.-Q., Shan, Y.-T., Liu, S.-N., Liu, X., Pan, X., Bai, Y., Xu, Z., Wang, Z., Ye, J., Holmes, E. C., Li, B., Chen, Y.-Q., Li, Z.-R., & Shi, M. (2025). Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus. bioRxiv, 2025.2006.2014.659722. https://doi.org/10.1101/2025.06.14.659722

Downloads last month
54
Safetensors
Model size
944M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support