MTL and insilico perturbation

#549

by ZYSK-huggingface - opened 13 days ago

13 days ago

Hi,

I came across warnings when doing insilico perturbation using MTL classifier:

Some weights of the model checkpoint at /model_saved/GeneformerMultiTask/ were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classification_heads.0.bias', 'classification_heads.0.weight']

This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at /home/jiaming/data2/Geneformer-v2/03_Results/MTL_Classfier_DDP/Heart-test/test1/model_saved/GeneformerMultiTask/ and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that model are not correctly loaded and drop weights which obtained from fine-tuning?

ZYSK-huggingface

13 days ago

•

edited 13 days ago

Codes be like:

emb = EmbExtractor(
model_type="CellClassifier",
num_classes=2,
emb_mode="cls",
max_ncells=None,
emb_layer=0,
forward_batch_size=64,
nproc=16,
summary_stat="exact_mean",
model_version="V2",
token_dictionary_file=token_dictionary_file
)

state_embs_dict=emb.get_state_embs(
cell_states_to_model=cell_states_to_model,
model_directory=model_directory,
input_data_file=input_data_file,
output_directory=output_dir_emb,
output_prefix="state_embs_dic",
output_torch_embs=False
)

isp = InSilicoPerturber(
perturb_type=perturb_type,
perturb_rank_shift=None,
genes_to_perturb='all',
combos=0,
anchor_gene=None,
model_type="MTLCellClassifier",
num_classes=2,
emb_mode="cls",
cell_states_to_model=cell_states_to_model,
state_embs_dict=state_embs_dict,
cell_inds_to_perturb={"start": start, "end": end},
max_ncells=None,
emb_layer=0,
forward_batch_size=64,
nproc=16,
model_version='V2',
token_dictionary_file=token_dictionary_file,
clear_mem_ncells=64
)

isp.perturb_data(
model_directory=model_directory,
input_data_file=input_data_file,
output_directory=output_dir_insilico,
output_prefix=output_prefix
)

The model I loaded was fine-tuned by a one-task V2 MTL classfier(classfy 2 cell states)

ctheodoris

Owner 13 days ago

Thanks for your question. Since you are loading an MTL model as MaskedLM, this is expected so you can ignore this error.

Regarding your code, you should use Pretrained as the model type to generate the embeddings state embs dict to match what you use below with the in silico perturbation.

ctheodoris changed discussion status to closed 13 days ago

ZYSK-huggingface

12 days ago

Thanks for your question. Since you are loading an MTL model as MaskedLM, this is expected so you can ignore this error.

Regarding your code, you should use Pretrained as the model type to generate the embeddings state embs dict to match what you use below with the in silico perturbation.

Thank you for your response. When running the insilico pipeline, I loaded a fine-tuned MTL classifier, so I’m unclear why we should specify “Pretrained” as the model type when generating the embs state dictionary.

I’ve also tested the v2 model in DDP-MTL mode and noticed a slowdown. Although DDP usually speeds things up, v2 is much larger (~400 MB) than the previous 95 M-parameter, 12-layer model (145 MB). Could I use the original 95 M pretrained model within the v2 MTL framework, and if so, would you recommend doing that?

Wish you all well

ZYSK-huggingface

12 days ago

Besides，You mentioned, “Since you are loading an MTL model as MaskedLM, this is expected so you can ignore this error.” However, I’m actually trying to use my fine-tuned MTL model for perturbations, so it looks like the perturbation step isn’t using my updated weights—has the pipeline dropped my fine-tuned parameters when generating embeddings?

ctheodoris

Owner 12 days ago

The MTL heads are not used for the embeddings, so you can use Pretrained, as indicated in the documentation and examples. The weights are fine-tuned, but the heads are unneeded. The larger 316M parameter model is more computationally intensive. You may consider using the 104M parameter model, also provided in this repository.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment