AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks

AbBFN2 allows for flexible task adaptation by virtue of its ability to condition the generative process on an arbitrary subset of variables. Further, since AbBFN2 is based on the Bayesian Flow Network paradigm, it can jointly model both discrete and continuous variables. Using this architecture, we provide a rich syntax which can be used to interact with the model. Regardless of conditioning information, the model generates all 45 "data modes" at inference time and arbitrary conditioning can be used to define specific tasks.

License Summary

The Licensed Models are only available under this License for Non-Commercial Purposes.
You are permitted to reproduce, publish, share and adapt the Output generated by the Licensed Model only for Non-Commercial Purposes and in accordance with this License.
You may not use the Licensed Models or any of its Outputs in connection with:
1. any Commercial Purposes, unless agreed by Us under a separate licence;
2. to train, improve or otherwise influence the functionality or performance of any other third-party derivative model that is commercial or intended for a Commercial Purpose and is similar to the Licensed Models;
3. to create models distilled or derived from the Outputs of the Licensed Models, unless such models are for Non-Commercial Purposes and open-sourced under the same license as the Licensed Models; or
4. in violation of any applicable laws and regulations.

Getting Started

You can interact with AbBFN2 via:

Web Application: https://abbfn2.labs.deepchain.bio/
Open-Source Repository: https://github.com/instadeepai/AbBFN2

The instructions below pertain to the open-source repository.

Prerequisites

Docker installed on your system
Sufficient computational resources (TPU/GPU recommended)
Basic understanding of antibody structure and sequence notation

Installation

Hardware Configuration

First, configure your accelerator in the Makefile:

ACCELERATOR = GPU  # Options: CPU, TPU, or GPU

Note: Multi-host inference is not supported in this release. Please use single-host settings only.

Building the Docker Image

Run the following command to build the AbBFN2 Docker image:

make build

This process typically takes 5-20 minutes depending on your hardware.

For Apple Silicon users

Build the conda environment instead directly using:

conda env create -f environment.yaml
conda activate abbfn2

Usage

AbBFN2 supports three main generation modes, each with its own configuration file in the experiments/configs/ directory.

In addition to the mode-specific settings, configuration files contain options for loading model weights. By default (load_from_hf: true), weights are downloaded from Hugging Face. Optionally, if you have the weights locally, set load_from_hf: false and provide the path in model_weights_path (e.g., /app/params.pkl).

1. Unconditional Generation

Generate novel antibody sequences without any constraints. AbBFN2 will generate natural-like antibody sequences matching its training distribution. Note that the metadata labels are also predictions made by the model. For a discussion of the accuracy of these labels, please refer to the AbBFN2 manuscript.

Configuration (unconditional.yaml):

cfg:
  sampling:
    num_samples_per_batch: 10   # Number of sequences per batch
    num_batches: 1              # Number of batches to generate
  sample_fn:
    num_steps: 300              # Number of sampling steps (recommended: 300-1000)

Run:

make unconditional # or python experiments/unconditional.py for Apple Silicon users.

2. Conditional Generation/Inpainting

Generate antibody sequences conditioned on specific attributes. Conditional generation highlights the flexibility of AbBFN2 and allows it to be task adaptible depending on the exact conditioning data. While any arbitrary combination is possible, conditional generation is mostly to be used primarily when conditioning on full sequences (referred to as sequence labelling in the manuscript), partial sequences (sequence inpainting), partial sequences and metadata (sequence design), metadata only (conditional de novo generation). For categorical variables, the set of of possible values is found in src/abbfn2/data_mode_handler/oas_paired/constants.py. For genes and CDR lengths, only values that appear at least 100 times in the training data are valid. When conditioning on species, human, mouse, or rat can be chosen.

Disclaimer: As discussed in the manuscript, the flexibility of AbBFN2 requires careful consideration of the exact combination of conditioning information for effective generation. For instance, conditioning on a kappa light chain locus V-gene together with a lambda locus J-gene family is unlikely to yield samples of high quality. Such paradoxical combinations can also exist in more subtle ways. Due to the space of possible conditioning information, we have only tested a small subset of such combinations.

Configuration (inpaint.yaml):

cfg:
  input:
    num_input_samples: 2        # Number of input samples
    dm_overwrites:              # Specify values of the data modes
      h_cdr1_seq: GYTFTSHA
      h_cdr2_seq: ISPYRGDT
      h_cdr3_seq: ARDAGVPLDY
  sampling:
    inpaint_fn:
      num_steps: 300       # Number of sampling steps (recommended: 300-1000)
    mask_fn:
      data_modes:               # Specify which data modes to condition on
        - "h_cdr1_seq"
        - "h_cdr2_seq"
        - "h_cdr3_seq"

Run:

make inpaint # or python experiments/inpaint.py for Apple Silicon users.

3. Sequence Humanization

Convert non-human antibody sequences into humanized versions. This workflow is designed to run a sequence humanisation experiment given a paired, non-human starting sequence. AbBFN2 will be used to introduce mutations to the framework regions of the starting antibody, possibly using several recycling iterations. During sequence humanisation, appropriate human V-gene families to target will also be chosen, but can be manually set by the user too.

Briefly, the humanisation workflow here uses the conditional generation capabilities of AbBFN2 in a sample recycling approach. At each iteration, further mutations are introduced, using a more aggressive starting strategy that is likely to introduce a larger number of mutations. As the sequence becomes more human under the model, fewer mutations are introduced at subsequent steps. Please note that we have found that in most cases, humanisation is achieved within a single recycling iteration. If the model introduces a change to the CDR loops, which can happen in rare cases, these are removed. For a detailed description of the humanisation workflow, please refer to the AbBFN2 manuscript.

Please also note that while we provide the option to manually select V-gene families here, this workflow allows the model to select more appropriate V-gene families during inference. Therefore, the final V-gene families may differ from the initially selected ones. Please also note that due to the data that AbBFN2 is trained on, humanisation will be most reliable when performed on murine or rat sequences. Sequences from other species have not been tested.

Configuration (humanization.yaml):

cfg:
  input:
    l_seq: "DIVLTQSPASLAVSLGQRATISCKASQSVDYDGHSYMNWYQQKPGQPPKLLIYAASNLESGIPARFSGSGSGTDFTLNIHPVEEEDAATYYCQQSDENPLTFGTGTKLELK"
    h_seq: "QVQLQQSGPELVKPGALVKISCKASGYTFTSYDINWVKQRPGQGLEWIGWIYPGDGSIKYNEKFKGKATLTVDKSSSTAYMQVSSLTSENSAVYFCARRGEYGNYEGAMDYWGQGTTVTVSS"
    # h_vfams: null # Optionally, set target v-gene families
    # l_vfams: null
  sampling:
    recycling_steps: 10         # Number of recycling steps (recommended: 5-12)
    inpaint_fn:
      num_steps: 500            # Number of sampling steps (recommended: 300-1000)

Run:

make humanization # or python experiments/humanization.py Apple Silicon users.

Data Modes

The data modes supported by AbBFN2 are detailed below.

Heavy-Chain IMGT Regions

Field	Type	Region (IMGT)	Description	Length Range (AA)
`h_fwr1_seq`	string	FWR1	Framework region 1	18 – 41
`h_fwr2_seq`	string	FWR2	Framework region 2	6 – 30
`h_fwr3_seq`	string	FWR3	Framework region 3	29 – 58
`h_fwr4_seq`	string	FWR4	Framework region 4	3 – 12
`h_cdr1_seq`	string	CDR1	Complementarity-determining region 1	1 – 22
`h_cdr2_seq`	string	CDR2	Complementarity-determining region 2	1 – 25
`h_cdr3_seq`	string	CDR3	Complementarity-determining region 3	2 – 58

Light-Chain IMGT Regions

Field	Type	Region (IMGT)	Description	Length Range (AA)
`l_fwr1_seq`	string	FWR1	Framework region 1	18 – 36
`l_fwr2_seq`	string	FWR2	Framework region 2	11 – 27
`l_fwr3_seq`	string	FWR3	Framework region 3	25 – 48
`l_fwr4_seq`	string	FWR4	Framework region 4	3 – 13
`l_cdr1_seq`	string	CDR1	Complementarity-determining region 1	1 – 20
`l_cdr2_seq`	string	CDR2	Complementarity-determining region 2	1 – 16
`l_cdr3_seq`	string	CDR3	Complementarity-determining region 3	1 – 27

CDR Length Metrics

Possible values provided in src/abbfn2/data_mode_handler/oas_paired/constants.py.

Field	Type	Description
`h1_length`	int	CDR1 length (heavy chain)
`h2_length`	int	CDR2 length (heavy chain)
`h3_length`	int	CDR3 length (heavy chain)
`l1_length`	int	CDR1 length (light chain)
`l2_length`	int	CDR2 length (light chain)
`l3_length`	int	CDR3 length (light chain)

Gene and Family Annotations

Possible values provided in src/abbfn2/data_mode_handler/oas_paired/constants.py.

Field	Type	Description
`hv_gene`	string	V gene segment (heavy)
`hd_gene`	string	D gene segment (heavy)
`hj_gene`	string	J gene segment (heavy)
`lv_gene`	string	V gene segment (light)
`lj_gene`	string	J gene segment (light)
`hv_family`	string	V gene family (heavy)
`hd_family`	string	D gene family (heavy)
`hj_family`	string	J gene family (heavy)
`lv_family`	string	V gene family (light)
`lj_family`	string	J gene family (light)
`species`	string	One of “human”, “rat”, “mouse”
`light_locus`	string	One of “K” (kappa) or “L” (lambda)

TAP Physicochemical Metrics

Field	Type	Description	Range
`tap_psh`	float	Patch hydrophobicity	72.0 – 300.0
`tap_pnc`	float	Proportion of non-covalent contacts	0.0 – 10.0
`tap_ppc`	float	Proportion of polar contacts	0.0 – 7.5
`tap_sfvcsp`	float	Surface-exposed variable-chain charge score	–55.0 – 55.0
`tap_psh_flag`	string	Hydrophobicity flag	“red“ / “amber“ / “green“
`tap_pnc_flag`	string	Non-covalent contacts flag	“red“ / “amber“ / “green“
`tap_ppc_flag`	string	Polar contacts flag	“red“ / “amber“ / “green“
`tap_sfvcsp_flag`	string	Charge score flag	“red“ / “amber“ / “green“

V- and J- Identity Scores

Field	Type	Description	Range (%)
`h_v_identity`	float	Heavy-chain V segment identity	64.0 – 100.0
`h_d_identity`	float	Heavy-chain D segment identity	74.0 – 100.0
`h_j_identity`	float	Heavy-chain J segment identity	74.0 – 100.0
`l_v_identity`	float	Light-chain V segment identity	66.0 – 100.0
`l_j_identity`	float	Light-chain J segment identity	77.0 – 100.0

Citation

If you use AbBFN2 in your research, please cite our work:

@article{Guloglu_etal_AbBFN2,
  title={AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks},
  author={Bora Guloglu and Miguel Bragan\c{c}a and Alex Graves and Scott Cameron and Timothy Atkinson and Liviu Copoiu and Alexandre Laterre and Thomas D Barrett},
  journal={bioRxiv},
  year={2025},
  url={https://www.biorxiv.org/content/10.1101/2025.04.29.651170v1}
}

Acknowledgements

The development of this library was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).

InstaDeepAI
/

AbBFN2

AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks

License Summary

Getting Started

Prerequisites

Installation

Hardware Configuration

Building the Docker Image

For Apple Silicon users

Usage

1. Unconditional Generation

2. Conditional Generation/Inpainting

3. Sequence Humanization

Data Modes

Heavy-Chain IMGT Regions

Light-Chain IMGT Regions

CDR Length Metrics

Gene and Family Annotations

TAP Physicochemical Metrics

V- and J- Identity Scores

Citation

Related Papers

Acknowledgements

Collection including InstaDeepAI/AbBFN2

Bayesian Flow Networks