error running the notebook: extract_and_plot_cell_embeddings

#535
by tomererez - opened

i am trying to run it with the following:

  1. input data: the one recommended in the notebook(https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset)
    2.token_dictionary_file="geneformer/gene_dictionaries_30m/token_dictionary_gc30M.pkl"
    3.model: fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224

so i thought the tokenizer and the model would fit because they are both 30M series. i see that it works in your example notebook.

how ever i am getting the error raised that the and tokens are not in the dictionary.
should i add those and fix the dataset to change the 1st and last tokens or am i running an invalid model-data-tokenizer combination?

the exact error is:
Traceback (most recent call last):
File "C:\Users\tomer.erez\PycharmProjects\trio-formers\genef_from_hf\Geneformer\examples\ext_plot_emb.py", line 51, in
embs = embex.extract_embs(f"{genef_path}/fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224", # example 30M fine-tuned model
File "C:\Users\tomer.erez\AppData\Local\anaconda3\envs\genef\lib\site-packages\geneformer\emb_extractor.py", line 599, in extract_embs
embs = get_embs(
File "C:\Users\tomer.erez\AppData\Local\anaconda3\envs\genef\lib\site-packages\geneformer\emb_extractor.py", line 71, in get_embs
assert cls_present, " token missing in token dictionary"
AssertionError: token missing in token dictionary

Thanks for your question. The first model does not have a cls token so you can change the emb_mode to “cell” in the EmbExtractor to resolve this.

ctheodoris changed discussion status to closed

Thanks for question and answer , I also encountered this problem and solved it according to the suggestions here: https://huggingface.co/ctheodoris/Geneformer/discussions/541#684fce493f3300bc9f8eddd0

Sign up or log in to comment