This model has been optimized using NVIDIA's TransformerEngine library. Slight numerical differences may be observed between the original model and the optimized model. For instructions on how to install TransformerEngine, please refer to the official documentation.

The original xformers-based models are available at chandar-lab/AMPLIFY.

AMPLIFY

AMPLIFY is an efficient, state-of-the-art protein language model pre-trained using masked language modeling on UniRef100, OAS, and SCOP (UR100P). AMPLIFY can generate residue and protein embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences, and much more. AMPLIFY is available in two sizes, 120M and 350M parameters, with the _base models not extended beyond 512 residues (Stage 1). The model architecture and pre-training procedure are detailed below. For more details, please refer to the accompanying paper.

Model Description

AMPLIFY 120M AMPLIFY 350M
hidden-size 640 960
num-hidden-layers 24 32
num-attention-heads 10 15
intermediate-size 2560 3840
max-position-embeddings 2048 2048
vocab-size 27 27
rope-theta 10000 10000
dropout-prob 0 0
embedding-init-range 0.02 0.02
norm-eps 1.0e-05 1.0e-05
hidden-act swiglu swiglu
pre-activation-layer-norm true true
layer-norm-after-embedding false false
layer-norm-before-last-layer true true
rms-norm true true
ffn-bias false false
attn-bias false false

Training Description

Stage 1 Stage 2
dataset UR100P UR100P
max-steps 1000000 25000 (120M) or 50000 (350M)
max-length 512 2048
optimizer adamw adamw
lr 0.001 0.0001
betas (0.9, 0.95) (0.9, 0.95)
eps 1.0e-08 1.0e-08
weight-decay 0.01 0.01
scheduler cosinedecay none
warmup-steps 1,000 none
final-step 900,000 none
warmup-steps 1,000 none
gradient-clipping 1.0 1.0
tf32 true true
mixed-precision bf16 bf16
padding max-length max-length
random-truncate true true
mask-probability 0.15 0.15
total-batch-size 4096 4096
deepspeed true true
zero-stage 3 3

Get Started

from transformers import AutoModel
from transformers import AutoTokenizer
from datasets import load_dataset

# Load AMPLIFY and tokenizer
model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True)

# Move the model to GPU (required due to Flash Attention)
model = model.to("cuda")

# Load the UniProt validation set
dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")

for sample in dataset:
    # Protein
    print("Sample: ", sample["name"], sample["sequence"])

    # Tokenize the protein
    input = tokenizer.encode(sample["sequence"], return_tensors="pt")
    print("Input: ", input)

    # Move to the GPU and make a prediction
    input = input.to("cuda")
    output = model(input)
    print("Output: ", output)

    break

Citations

If you find the models useful in your research, we ask that you cite the paper:

@article{Fournier2024.09.23.614603,
    title        = {Protein Language Models: Is Scaling Necessary?},
    author       = {Fournier, Quentin and Vernon, Robert M. and van der Sloot, Almer and Schulz, Benjamin and Chandar, Sarath and Langmead, Christopher James},
    year         = {2024},
    journal      = {bioRxiv},
    publisher    = {Cold Spring Harbor Laboratory},
    doi          = {10.1101/2024.09.23.614603},
    url          = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603},
    elocation-id = {2024.09.23.614603},
    eprint       = {https://www.biorxiv.org/content/early/2024/09/23/2024.09.23.614603.full.pdf}
}
Downloads last month
220
Safetensors
Model size
118M params
Tensor type
F32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nvidia/AMPLIFY_120M

Collection including nvidia/AMPLIFY_120M