error running the notebook: extract_and_plot_cell_embeddings
i am trying to run it with the following:
- input data: the one recommended in the notebook(https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset)
2.token_dictionary_file="geneformer/gene_dictionaries_30m/token_dictionary_gc30M.pkl"
3.model: fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224
so i thought the tokenizer and the model would fit because they are both 30M series. i see that it works in your example notebook.
how ever i am getting the error raised that the and tokens are not in the dictionary.
should i add those and fix the dataset to change the 1st and last tokens or am i running an invalid model-data-tokenizer combination?
the exact error is:
Traceback (most recent call last):
File "C:\Users\tomer.erez\PycharmProjects\trio-formers\genef_from_hf\Geneformer\examples\ext_plot_emb.py", line 51, in
embs = embex.extract_embs(f"{genef_path}/fine_tuned_models/gf-6L-30M-i2048_CellClassifier_cardiomyopathies_220224", # example 30M fine-tuned model
File "C:\Users\tomer.erez\AppData\Local\anaconda3\envs\genef\lib\site-packages\geneformer\emb_extractor.py", line 599, in extract_embs
embs = get_embs(
File "C:\Users\tomer.erez\AppData\Local\anaconda3\envs\genef\lib\site-packages\geneformer\emb_extractor.py", line 71, in get_embs
assert cls_present, " token missing in token dictionary"
AssertionError: token missing in token dictionary
Thanks for your question. The first model does not have a cls token so you can change the emb_mode to “cell” in the EmbExtractor to resolve this.
Thanks for question and answer , I also encountered this problem and solved it according to the suggestions here: https://huggingface.co/ctheodoris/Geneformer/discussions/541#684fce493f3300bc9f8eddd0