dsfsi/simcse-dna · Hugging Face

Project Description

This repository contains the trained model for our paper: Fine-tuning a Sentence Transformer for DNA & Protein tasks that is currently under review at BMC Bioinformatics. This model, called simcse-dna; is based on the original implementation of SimCSE [1]. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks.

Prerequisites

Please see the original SimCSE for installation details. The model will als be hosted on Zenodo (DOI: 10.5281/zenodo.11046580).

Usage

Run the following code to get the sentence embeddings:


import torch
from transformers import AutoModel, AutoTokenizer

# Import trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dsfsi/simcse-dna")
model = AutoModel.from_pretrained("dsfsi/simcse-dna")


#sentences is your list of n DNA tokens of size 6 
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification.

Performance on evaluation tasks

Find out more about the datasets and access in the paper (TBA)

Table: Accuracy scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method.

Model	Embed.	T1	T2	T3	T4	T5	T6	T7	T8
LR	Proposed	0.65 ± 0.01	0.67 ± 0.0	0.85 ± 0.01	0.64 ± 0.01	0.80 ± 0.0	0.49 ± 0.0	0.33 ± 0.0	0.70 ± 0.01
	DNABERT	0.62 ± 0.01	0.65 ± 0.0	0.84 ± 0.04	0.69 ± 0.01	0.85 ± 0.01	0.49 ± 0.0	0.33 ± 0.0	0.60 ± 0.01
	NT	0.66 ± 0.0	0.67 ± 0.0	0.84 ± 0.01	0.73 ± 0.0	0.85 ± 0.01	0.81 ± 0.0	0.62 ± 0.01	0.99 ± 0.0
-------	-----------	----------------	----------------	----------------	----------------	----------------	----------------	----------------	----------------
LGBM	Proposed	0.64 ± 0.01	0.66 ± 0.0	0.90 ± 0.02	0.61 ± 0.01	0.78 ± 0.0	0.49 ± 0.0	0.33 ± 0.0	0.81 ± 0.01
	DNABERT	0.62 ± 0.01	0.65 ± 0.01	0.90 ± 0.02	0.65 ± 0.01	0.83 ± 0.0	0.49 ± 0.0	0.33 ± 0.0	0.75 ± 0.01
	NT	0.63 ± 0.01	0.66 ± 0.0	0.91 ± 0.02	0.72 ± 0.0	0.85 ± 0.0	0.80 ± 0.0	0.59 ± 0.01	0.97 ± 0.0
-------	-----------	----------------	----------------	----------------	----------------	----------------	----------------	----------------	----------------
XGB	Proposed	0.60 ± 0.01	0.62 ± 0.0	0.90 ± 0.02	0.60 ± 0.0	0.77 ± 0.0	0.49 ± 0.0	0.33 ± 0.0	0.85 ± 0.01
	DNABERT	0.59 ± 0.01	0.62 ± 0.01	0.90 ± 0.01	0.64 ± 0.01	0.82 ± 0.01	0.49 ± 0.0	0.33 ± 0.0	0.79 ± 0.01
	NT	0.61 ± 0.01	0.64 ± 0.0	0.90 ± 0.02	0.89 ± 0.03	0.85 ± 0.01	0.81 ± 0.01	0.60 ± 0.01	0.98 ± 0.0
-------	-----------	----------------	----------------	----------------	----------------	----------------	----------------	----------------	----------------
RF	Proposed	0.61 ± 0.0	0.66 ± 0.01	0.90 ± 0.02	0.61 ± 0.01	0.77 ± 0.0	0.49 ± 0.0	0.33 ± 0.0	0.86 ± 0.0
	DNABERT	0.60 ± 0.0	0.66 ± 0.01	0.90 ± 0.02	0.63 ± 0.01	0.82 ± 0.0	0.49 ± 0.0	0.33 ± 0.0	0.81 ± 0.01
	NT	0.62 ± 0.01	0.67 ± 0.01	0.90 ± 0.01	0.71 ± 0.01	0.85 ± 0.0	0.79 ± 0.0	0.55 ± 0.01	0.97 ± 0.0

Table: F1-scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method.

Model	Embed.	T1	T2	T3	T4	T5	T6	T7	T8
LR	Proposed	*0.78 ± 0.0*	*0.80 ± 0.01*	0.20 ± 0.05	0.64 ± 0.01	0.79 ± 0.0	0.13 ± 0.37	0.16 ± 0.0	0.70 ± 0.01
	DNABERT	0.75 ± 0.01	0.78 ± 0.0	0.47 ± 0.09	0.69 ± 0.01	0.84 ± 0.01	0.13 ± 0.37	0.16 ± 0.0	0.59 ± 0.01
	NT	0.56 ± 0.01	0.54 ± 0.0	0.78 ± 0.01	0.73 ± 0.0	0.85 ± 0.01	0.81 ± 0.0	0.62 ± 0.01	0.99 ± 0.0
-------	-----------	----------------	----------------	----------------	----------------	----------------	----------------	----------------	----------------
LGBM	Proposed	0.76 ± 0.01	0.79 ± 0.0	0.60 ± 0.11	0.63 ± 0.01	0.77 ± 0.0	0.47 ± 0.20	0.26 ± 0.04	0.82 ± 0.0
	DNABERT	0.74 ± 0.0	0.78 ± 0.0	0.60 ± 0.08	0.66 ± 0.01	0.82 ± 0.01	0.47 ± 0.20	0.26 ± 0.04	0.75 ± 0.01
	NT	0.59 ± 0.01	0.56 ± 0.0	0.89 ± 0.02	0.72 ± 0.01	0.85 ± 0.0	0.80 ± 0.0	0.59 ± 0.01	0.97 ± 0.0
-------	-----------	----------------	----------------	----------------	----------------	----------------	----------------	----------------	----------------
XGB	Proposed	0.72 ± 0.01	0.75 ± 0.0	0.59 ± 0.08	0.60 ± 0.0	0.76 ± 0.0	0.47 ± 0.20	0.26 ± 0.04	0.85 ± 0.01
	DNABERT	0.71 ± 0.01	0.75 ± 0.01	0.58 ± 0.05	0.64 ± 0.01	0.82 ± 0.01	0.47 ± 0.20	0.26 ± 0.04	0.79 ± 0.01
	NT	0.59 ± 0.01	0.57 ± 0.01	0.72 ± 0.01	0.85 ± 0.01	0.85 ± 0.01	0.81 ± 0.01	0.60 ± 0.01	0.9893 ± 0.0
-------	-----------	----------------	----------------	----------------	----------------	----------------	----------------	----------------	----------------
RF	Proposed	0.73 ± 0.0	0.79 ± 0.0	0.58 ± 0.08	0.61 ± 0.01	0.75 ± 0.0	0.53 ± 0.17	0.24 ± 0.05	0.86 ± 0.0
	DNABERT	0.72 ± 0.0	0.79 ± 0.0	0.59 ± 0.09	0.63 ± 0.01	0.80 ± 0.01	0.53 ± 0.17	0.24 ± 0.05	0.82 ± 0.01
	NT	0.59 ± 0.01	0.56 ± 0.01	0.89 ± 0.02	0.71 ± 0.01	0.84 ± 0.0	0.79 ± 0.0	0.55 ± 0.01	0.97 ± 0.0

Authors

Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes
Contact details : [email protected]

Citation

Bibtex Reference TBA

References

[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021).