| | --- |
| | library_name: transformers |
| | license: mit |
| | datasets: |
| | - chandar-lab/UR100P |
| | language: |
| | - en |
| | tags: |
| | - biology |
| | --- |
| | |
| | > [!NOTE] |
| | > This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) |
| | > library. Slight numerical differences may be observed between the original model and the optimized |
| | > model. For instructions on how to install TransformerEngine, please refer to the |
| | > [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation). |
| |
|
| | # AMPLIFY (TransformerEngine-Optimized) Overview |
| |
|
| | ## Description: |
| |
|
| | AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein |
| | embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two |
| | sizes, 120M and 350M parameters. |
| |
|
| | This version of the AMPLIFY model is optimized with NVIDIA's |
| | [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from |
| | Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs. |
| |
|
| | This model is ready for commercial/non-commercial use. |
| |
|
| | ## Third-Party Community Consideration |
| |
|
| | This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements |
| | for this application and use case; see link to Non-NVIDIA [AMPLIFY Model |
| | Card](https://huggingface.co/chandar-lab/AMPLIFY_350M). |
| | |
| | ### License/Terms of Use: |
| | |
| | AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE). |
| | |
| | ### Deployment Geography: |
| | |
| | Global |
| | |
| | ### Use Case: |
| | |
| | Protein design, mutation prediction, and function analysis. |
| | |
| | ### Release Date: |
| | |
| | Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) |
| | |
| | ## References: |
| | |
| | - [Protein Language Models: Is Scaling |
| | Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed |
| | information on the model architecture and training data. |
| | |
| | ## Model Architecture: |
| | |
| | **Architecture Type:** Transformer |
| | **Network Architecture:** ESM-2 |
| | |
| | **This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_350M) <br> |
| | **Number of model parameters:** 3.5 x 10^8 |
| | |
| | ## Input: |
| | |
| | **Input Type:** Text (Protein Sequences) <br> |
| | **Input Format:** String <br> |
| | **Input Parameters:** One-Dimensional (1D) <br> |
| | **Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum |
| | context length is 2048 residues. |
| | |
| | ## Output: |
| | |
| | **Output Type:** Embeddings (Amino acid and sequence-level) <br> |
| | **Output Format:** Numeric vector <br> |
| | **Output Parameters:** One-Dimensional (1D) <br> |
| | **Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each |
| | amino acid in the input protein sequence. |
| | |
| | Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware |
| | (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times |
| | compared to CPU-only solutions. |
| | |
| | ## Software Integration: |
| | |
| | **Runtime Engines:** |
| | |
| | - Hugging Face Transformers |
| | |
| | **Supported Hardware Microarchitecture Compatibility:** |
| | |
| | - NVIDIA Ampere |
| | - NVIDIA Blackwell |
| | - NVIDIA Hopper |
| | |
| | **Preferred Operating System(s):** |
| | |
| | - Linux |
| | |
| | The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific |
| | data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at |
| | both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure |
| | compliance with safety and ethical standards before deployment. |
| | |
| | ## Model and checkpoint versions are noted below: |
| | |
| | - [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br> |
| | - [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br> |
| | |
| | **Get Started** |
| | |
| | ```python |
| | from transformers import AutoModel |
| | from transformers import AutoTokenizer |
| | from datasets import load_dataset |
| |
|
| | # Load AMPLIFY and tokenizer |
| | model = AutoModel.from_pretrained("nvidia/AMPLIFY_350M", trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | "nvidia/AMPLIFY_350M", trust_remote_code=True |
| | ) |
| |
|
| | # Move the model to GPU (required due to Flash Attention) |
| | model = model.to("cuda") |
| |
|
| | # Load the UniProt validation set |
| | dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") |
| |
|
| | for sample in dataset: |
| | # Protein |
| | print("Sample: ", sample["name"], sample["sequence"]) |
| | |
| | # Tokenize the protein |
| | input = tokenizer.encode(sample["sequence"], return_tensors="pt") |
| | print("Input: ", input) |
| | |
| | # Move to the GPU and make a prediction |
| | input = input.to("cuda") |
| | output = model(input) |
| | print("Output: ", output) |
| | |
| | break |
| | ``` |
| | |
| | ## Training and Evaluation Datasets: |
| |
|
| | ## Training Datasets: |
| |
|
| | **Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0) |
| |
|
| | **Data Modality:** |
| |
|
| | - Text (Protein Sequences) |
| |
|
| | **Text Training Data Size:** |
| |
|
| | - 1 Billion to 10 Trillion Tokens |
| |
|
| | **Data Collection Method:** |
| |
|
| | - Human |
| |
|
| | **Labeling Method:** |
| |
|
| | - N/A |
| |
|
| | **Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase |
| | and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using |
| | the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the |
| | longest sequence is not always the most informative. There is often more biologically relevant information and |
| | annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are |
| | ranked to facilitate the selection of a biologically relevant representative for the cluster. |
| |
|
| | **Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/) |
| |
|
| | **Data Modality:** |
| |
|
| | - Text (Protein Sequences) |
| |
|
| | **Text Training Data Size:** |
| |
|
| | - 1 Billion to 10 Trillion Tokens |
| |
|
| | **Data Collection Method:** |
| |
|
| | - Human |
| |
|
| | **Labeling Method:** |
| |
|
| | - Human |
| |
|
| | **Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for |
| | use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These |
| | repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals. |
| |
|
| | **Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download) |
| |
|
| | **Data Modality:** |
| |
|
| | - Text (Protein Sequences) |
| |
|
| | **Text Training Data Size:** |
| |
|
| | - 1 Billion to 10 Trillion Tokens |
| |
|
| | **Data Collection Method:** |
| |
|
| | - Hybrid: Human, Automated |
| |
|
| | **Labeling Method:** |
| |
|
| | - Hybrid: Human, Automated |
| |
|
| | **Properties:** The main levels of classification in SCOP are: |
| |
|
| | - Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and |
| | alpha+beta. |
| | - Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same |
| | topological connections. |
| | - Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and |
| | functional features, even if sequence similarity is low. |
| | - Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through |
| | sequence comparison methods. |
| | - Species: Represents a distinct protein sequence. |
| | - Protein: Groups similar sequences with the same function. |
| |
|
| | ## Evaluation Datasets: |
| |
|
| | **Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/) |
| |
|
| | **Benchmark Score:** LR P@L of 20.9±15.7 |
| |
|
| | **Data Collection Method:** |
| |
|
| | - Human |
| |
|
| | **Labeling Method:** |
| |
|
| | - N/A |
| |
|
| | **Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by |
| | the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction |
| | servers, which then return their predictions. |
| |
|
| | **Link:** [CASP14 (Critical Assessment of Methods of Protein Structure |
| | Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/) |
| |
|
| | **Benchmark Score:** LR P@L of 16.6±13.6 |
| |
|
| | **Data Collection Method:** |
| |
|
| | - Human |
| |
|
| | **Labeling Method:** |
| |
|
| | - N/A |
| |
|
| | **Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental |
| | structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full, |
| | three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to |
| | participating research groups and servers, who must submit their predicted structures within a specific time frame. |
| |
|
| | **Link:** [CASP15 (Critical Assessment of Methods of Protein Structure |
| | Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/) |
| |
|
| | **Benchmark Score:** LR P@L of 20.0±14.6 |
| |
|
| | **Data Collection Method:** |
| |
|
| | - Human |
| |
|
| | **Labeling Method:** |
| |
|
| | - N/A |
| |
|
| | **Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental |
| | structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full, |
| | three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to |
| | participating research groups and servers, who must submit their predicted structures within a specific time frame. |
| |
|
| | ## Inference: |
| |
|
| | **Acceleration Engine:** |
| |
|
| | - Hugging Face Transformers |
| |
|
| | **Test Hardware:** |
| |
|
| | - A100 |
| | - H100 |
| | - H200 |
| | - GB200 |
| |
|
| | ## Ethical Considerations: |
| |
|
| | NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable |
| | development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, |
| | developers should work with their internal model team to ensure this model meets requirements for the relevant industry |
| | and use case and addresses unforeseen product misuse. |
| |
|
| | Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and |
| | comply with applicable safety regulations and ethical standards. |
| |
|
| | Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns |
| | [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). |
| |
|