Nicheformer
Nicheformer is a transformer-based model designed for understanding and predicting cellular niches and their interactions. The model uses masked language modeling to learn representations of cellular contexts and their relationships.
Model Description
Nicheformer is built on a transformer architecture with the following key features:
- Architecture: Transformer encoder with customizable number of layers and attention heads
- Pre-training: Masked Language Modeling (MLM) objective with dynamic masking
- Input Processing: Handles cell type, assay, and modality information
- Positional Encoding: Supports both learnable and fixed positional embeddings
- Masking Strategy:
- 80% of selected tokens are replaced with [MASK]
- 10% are replaced with random tokens
- 10% remain unchanged
Model Architecture
- Transformer encoder layers: 12
- Hidden dimension: 512
- Attention heads: 16
- Feedforward dimension: 1024
- Maximum sequence length: 1500
- Vocabulary size: 25000
- Masking probability: 15%
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
import anndata as ad
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)
# Set technology mean for HF tokenizer
technology_mean_path = 'technology_mean.npy'
technology_mean = np.load(technology_mean_path)
tokenizer._load_technology_mean(technology_mean)
# Load your single-cell data
adata = ad.read_h5ad("your_data.h5ad")
# Tokenize the data
inputs = tokenizer(adata)
# Get embeddings
embeddings = model.get_embeddings(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
layer=-1,
with_context=False
)
Training Data
The model was trained on single-cell gene expression data from various tissues and organisms. It supports:
- Modalities: spatial and dissociated
- Species: human and mouse
- Technologies:
- MERFISH
- CosMx
- Xenium
- 10x Genomics (various versions)
- CITE-seq
- Smart-seq v4
Limitations
- The model is specifically designed for gene expression data and may not generalize to other types of biological data
- Performance may vary depending on the quality and type of input data
- The model works best with data from supported species and technologies
License
This model is released under the MIT License. See the LICENSE file for more details.
Contact
For questions and issues, please open an issue on the GitHub repository or contact the maintainers.
nicheformer
This is the official repository for Nicheformer: a foundation model for single-cell and spatial omics
Citation
If you use our tool or build upon our concepts in your own work, please cite it as
Schaar, A.C., Tejada-Lapuerta, A., et al. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv (2024). doi: https://doi.org/10.1101/2024.04.15.589472
Contact
For questions and help requests, you can reach out on GitHub or email to the corresponding author ([email protected]).
- Downloads last month
- 16
Model tree for theislab/Nicheformer
Unable to build the model tree, the base model loops to the model itself. Learn more.