some issues about input data and cell type annotation
Hi Christina,
While reading through some past discussions, I came across one in issue #130 where the poster mentioned that using scaled data led to poor results in cell type annotation tasks. This really concerns me, as I’ve performed extensive fine-tuning tasks using a large dataset that inevitably includes both raw count and scaled data prior to tokenization. I previously assumed this wouldn’t cause any issues since the tokenizer uses an expression rank-based strategy, meaning the impact on gene ranking should be minimal regardless of whether the input is raw or scaled counts. However, after reading that discussion, I’m no longer so certain.
I also have a question about padding. In that same discussion, the poster mentioned manually padding genes to the same length. Is this necessary when using Geneformer? I had assumed that padding was handled automatically by the tokenizer, but now I’m unsure.
Lastly, I’m interested in training a classifier for cell type annotation. It seems there is no pre-trained model specifically fine-tuned for this task in your repository, nor is there example code available. Does this mean we need to collect a large cell type-labeled dataset ourselves (or perhaps the 30M dataset you uploaded could be used for this purpose?) in order to fine-tune a model specifically for cell annotation?
Thanks for your questions.
As discussed in the tokenizer documentation and examples, the input for the tokenizer is raw counts data. We do not recommend using normalized data, as this normalization and transformation is not always linear. If all raw counts in the cell were multipled by 10 for all genes, let's say, then yes the rank value encoding would be maintained. But, many transformation effects would not follow this simple pattern.
Padding for training or inference is automatically handled at those steps when using out built-in functions.