biology

AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks

AbBFN2 allows for flexible task adaptation by virtue of its ability to condition the generative process on an arbitrary subset of variables. Further, since AbBFN2 is based on the Bayesian Flow Network paradigm, it can jointly model both discrete and continuous variables. Using this architecture, we provide a rich syntax which can be used to interact with the model. Regardless of conditioning information, the model generates all 45 "data modes" at inference time and arbitrary conditioning can be used to define specific tasks.

License Summary

  1. The Licensed Models are only available under this License for Non-Commercial Purposes.
  2. You are permitted to reproduce, publish, share and adapt the Output generated by the Licensed Model only for Non-Commercial Purposes and in accordance with this License.
  3. You may not use the Licensed Models or any of its Outputs in connection with:
    1. any Commercial Purposes, unless agreed by Us under a separate licence;
    2. to train, improve or otherwise influence the functionality or performance of any other third-party derivative model that is commercial or intended for a Commercial Purpose and is similar to the Licensed Models;
    3. to create models distilled or derived from the Outputs of the Licensed Models, unless such models are for Non-Commercial Purposes and open-sourced under the same license as the Licensed Models; or
    4. in violation of any applicable laws and regulations.

Getting Started

You can interact with AbBFN2 via:

The instructions below pertain to the open-source repository.

Prerequisites

  • Docker installed on your system
  • Sufficient computational resources (TPU/GPU recommended)
  • Basic understanding of antibody structure and sequence notation

Installation

Hardware Configuration

First, configure your accelerator in the Makefile:

ACCELERATOR = GPU  # Options: CPU, TPU, or GPU

Note: Multi-host inference is not supported in this release. Please use single-host settings only.

Building the Docker Image

Run the following command to build the AbBFN2 Docker image:

make build

This process typically takes 5-20 minutes depending on your hardware.

For Apple Silicon users

Build the conda environment instead directly using:

conda env create -f environment.yaml
conda activate abbfn2

Usage

AbBFN2 supports three main generation modes, each with its own configuration file in the experiments/configs/ directory.

In addition to the mode-specific settings, configuration files contain options for loading model weights. By default (load_from_hf: true), weights are downloaded from Hugging Face. Optionally, if you have the weights locally, set load_from_hf: false and provide the path in model_weights_path (e.g., /app/params.pkl).

1. Unconditional Generation

Generate novel antibody sequences without any constraints. AbBFN2 will generate natural-like antibody sequences matching its training distribution. Note that the metadata labels are also predictions made by the model. For a discussion of the accuracy of these labels, please refer to the AbBFN2 manuscript.

Configuration (unconditional.yaml):

cfg:
  sampling:
    num_samples_per_batch: 10   # Number of sequences per batch
    num_batches: 1              # Number of batches to generate
  sample_fn:
    num_steps: 300              # Number of sampling steps (recommended: 300-1000)

Run:

make unconditional # or python experiments/unconditional.py for Apple Silicon users.

2. Conditional Generation/Inpainting

Generate antibody sequences conditioned on specific attributes. Conditional generation highlights the flexibility of AbBFN2 and allows it to be task adaptible depending on the exact conditioning data. While any arbitrary combination is possible, conditional generation is mostly to be used primarily when conditioning on full sequences (referred to as sequence labelling in the manuscript), partial sequences (sequence inpainting), partial sequences and metadata (sequence design), metadata only (conditional de novo generation). For categorical variables, the set of of possible values is found in src/abbfn2/data_mode_handler/oas_paired/constants.py. For genes and CDR lengths, only values that appear at least 100 times in the training data are valid. When conditioning on species, human, mouse, or rat can be chosen.

Disclaimer: As discussed in the manuscript, the flexibility of AbBFN2 requires careful consideration of the exact combination of conditioning information for effective generation. For instance, conditioning on a kappa light chain locus V-gene together with a lambda locus J-gene family is unlikely to yield samples of high quality. Such paradoxical combinations can also exist in more subtle ways. Due to the space of possible conditioning information, we have only tested a small subset of such combinations.

Configuration (inpaint.yaml):

cfg:
  input:
    num_input_samples: 2        # Number of input samples
    dm_overwrites:              # Specify values of the data modes
      h_cdr1_seq: GYTFTSHA
      h_cdr2_seq: ISPYRGDT
      h_cdr3_seq: ARDAGVPLDY
  sampling:
    inpaint_fn:
      num_steps: 300       # Number of sampling steps (recommended: 300-1000)
    mask_fn:
      data_modes:               # Specify which data modes to condition on
        - "h_cdr1_seq"
        - "h_cdr2_seq"
        - "h_cdr3_seq"

Run:

make inpaint # or python experiments/inpaint.py for Apple Silicon users.

3. Sequence Humanization

Convert non-human antibody sequences into humanized versions. This workflow is designed to run a sequence humanisation experiment given a paired, non-human starting sequence. AbBFN2 will be used to introduce mutations to the framework regions of the starting antibody, possibly using several recycling iterations. During sequence humanisation, appropriate human V-gene families to target will also be chosen, but can be manually set by the user too.

Briefly, the humanisation workflow here uses the conditional generation capabilities of AbBFN2 in a sample recycling approach. At each iteration, further mutations are introduced, using a more aggressive starting strategy that is likely to introduce a larger number of mutations. As the sequence becomes more human under the model, fewer mutations are introduced at subsequent steps. Please note that we have found that in most cases, humanisation is achieved within a single recycling iteration. If the model introduces a change to the CDR loops, which can happen in rare cases, these are removed. For a detailed description of the humanisation workflow, please refer to the AbBFN2 manuscript.

Please also note that while we provide the option to manually select V-gene families here, this workflow allows the model to select more appropriate V-gene families during inference. Therefore, the final V-gene families may differ from the initially selected ones. Please also note that due to the data that AbBFN2 is trained on, humanisation will be most reliable when performed on murine or rat sequences. Sequences from other species have not been tested.

Configuration (humanization.yaml):

cfg:
  input:
    l_seq: "DIVLTQSPASLAVSLGQRATISCKASQSVDYDGHSYMNWYQQKPGQPPKLLIYAASNLESGIPARFSGSGSGTDFTLNIHPVEEEDAATYYCQQSDENPLTFGTGTKLELK"
    h_seq: "QVQLQQSGPELVKPGALVKISCKASGYTFTSYDINWVKQRPGQGLEWIGWIYPGDGSIKYNEKFKGKATLTVDKSSSTAYMQVSSLTSENSAVYFCARRGEYGNYEGAMDYWGQGTTVTVSS"
    # h_vfams: null # Optionally, set target v-gene families
    # l_vfams: null
  sampling:
    recycling_steps: 10         # Number of recycling steps (recommended: 5-12)
    inpaint_fn:
      num_steps: 500            # Number of sampling steps (recommended: 300-1000)

Run:

make humanization # or python experiments/humanization.py Apple Silicon users.

Data Modes

The data modes supported by AbBFN2 are detailed below.

Heavy-Chain IMGT Regions
Field Type Region (IMGT) Description Length Range (AA)
h_fwr1_seq string FWR1 Framework region 1 18 – 41
h_fwr2_seq string FWR2 Framework region 2 6 – 30
h_fwr3_seq string FWR3 Framework region 3 29 – 58
h_fwr4_seq string FWR4 Framework region 4 3 – 12
h_cdr1_seq string CDR1 Complementarity-determining region 1 1 – 22
h_cdr2_seq string CDR2 Complementarity-determining region 2 1 – 25
h_cdr3_seq string CDR3 Complementarity-determining region 3 2 – 58
Light-Chain IMGT Regions
Field Type Region (IMGT) Description Length Range (AA)
l_fwr1_seq string FWR1 Framework region 1 18 – 36
l_fwr2_seq string FWR2 Framework region 2 11 – 27
l_fwr3_seq string FWR3 Framework region 3 25 – 48
l_fwr4_seq string FWR4 Framework region 4 3 – 13
l_cdr1_seq string CDR1 Complementarity-determining region 1 1 – 20
l_cdr2_seq string CDR2 Complementarity-determining region 2 1 – 16
l_cdr3_seq string CDR3 Complementarity-determining region 3 1 – 27
CDR Length Metrics

Possible values provided in src/abbfn2/data_mode_handler/oas_paired/constants.py.

Field Type Description
h1_length int CDR1 length (heavy chain)
h2_length int CDR2 length (heavy chain)
h3_length int CDR3 length (heavy chain)
l1_length int CDR1 length (light chain)
l2_length int CDR2 length (light chain)
l3_length int CDR3 length (light chain)
Gene and Family Annotations

Possible values provided in src/abbfn2/data_mode_handler/oas_paired/constants.py.

Field Type Description
hv_gene string V gene segment (heavy)
hd_gene string D gene segment (heavy)
hj_gene string J gene segment (heavy)
lv_gene string V gene segment (light)
lj_gene string J gene segment (light)
hv_family string V gene family (heavy)
hd_family string D gene family (heavy)
hj_family string J gene family (heavy)
lv_family string V gene family (light)
lj_family string J gene family (light)
species string One of “human”, “rat”, “mouse”
light_locus string One of “K” (kappa) or “L” (lambda)
TAP Physicochemical Metrics
Field Type Description Range
tap_psh float Patch hydrophobicity 72.0 – 300.0
tap_pnc float Proportion of non-covalent contacts 0.0 – 10.0
tap_ppc float Proportion of polar contacts 0.0 – 7.5
tap_sfvcsp float Surface-exposed variable-chain charge score –55.0 – 55.0
tap_psh_flag string Hydrophobicity flag “red“ / “amber“ / “green“
tap_pnc_flag string Non-covalent contacts flag “red“ / “amber“ / “green“
tap_ppc_flag string Polar contacts flag “red“ / “amber“ / “green“
tap_sfvcsp_flag string Charge score flag “red“ / “amber“ / “green“
V- and J- Identity Scores
Field Type Description Range (%)
h_v_identity float Heavy-chain V segment identity 64.0 – 100.0
h_d_identity float Heavy-chain D segment identity 74.0 – 100.0
h_j_identity float Heavy-chain J segment identity 74.0 – 100.0
l_v_identity float Light-chain V segment identity 66.0 – 100.0
l_j_identity float Light-chain J segment identity 77.0 – 100.0

Citation

If you use AbBFN2 in your research, please cite our work:

@article{Guloglu_etal_AbBFN2,
  title={AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks},
  author={Bora Guloglu and Miguel Bragan\c{c}a and Alex Graves and Scott Cameron and Timothy Atkinson and Liviu Copoiu and Alexandre Laterre and Thomas D Barrett},
  journal={bioRxiv},
  year={2025},
  url={https://www.biorxiv.org/content/10.1101/2025.04.29.651170v1}
}

Related Papers

Acknowledgements

The development of this library was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including InstaDeepAI/AbBFN2