ctheodoris/Geneformer · label of dataset and prediction label mapping when using MTL

14 days ago

Hi !

Thanks for you previous answer. I have some other questions about MTL.

Task name vs. column name
The MTL trainer now requires a task_name that exactly matches a column in the raw dataset (for example, disease). In the previous version, when you specified disease as the key, the tokenizer would rename that column to label and generate an id_class.pkl file mapping each integer (0, 1, …) back to its original state. In v2, I don’t see an id_class.pkl being created, nor any automatic renaming of disease to label.

If I want to reuse my existing tokenized files without re-tokenizing, do I need to manually rename the label column back to disease so that --task_name disease will still work?

Or is there another recommended way to point v2 at pre-tokenized data?

Reconstructing the label–ID mapping
Since v2 doesn’t emit an id_class.pkl, what’s the best practice for recreating or preserving the mapping between numeric labels and their original strings? For example, when using the trained model to predict on new data, how can I be sure that “0” corresponds to “healthy” and “1” to “disease”?

Thank you again and best regards

ZYSK-huggingface

14 days ago

•

edited 14 days ago

Besides，I want to know if there is a prediction res of MTL so that I could check every pred result of every barcode cell, and Whether this fine-tuned MTL model could be used for 'evaluate' defined by 'Classfier' ?

ctheodoris

Owner 13 days ago

Thanks for your questions. The ID class labels you are referring to are unrelated to tokenization. These are just labels for the classes during fine-tuning. V1 vs. V2 only refers to the model update. The fine-tuning code has not been updated relating to the new models. The MTL code has a task mappings file in the output that relates to the labels. The MTL code has its own evaluate function, please check the documentation.

ctheodoris changed discussion status to closed 13 days ago