questions on MTL fine-tuning and insilico

#547
by ZYSK-huggingface - opened

Hi, I noticed that you've recently released updated code and model versions. To use the new V2 model, do I need to re-tokenize my existing data that was previously tokenized using the 95M dictionary?

Additionally, I saw that the MTL classifier training now supports DDP for multi-GPU training. I’d like to confirm whether the classifiers trained with DDP and non-DDP settings are functionally equivalent? This is important because I plan to use the trained model for in silico perturbations later on, and the available classifier options for perturbation include "MTLCellClassifier" and "MTLCellClassifier-Quantized".

If I want to use the "MTLCellClassifier-Quantized" option, should there be a separate quantization step? I didn't notice any quantization-related steps in the previous training scripts.

Finally, I want to know whether it is appropriate for only one task using MTL?

Thanks a lot for your clarification!

AND another question about InSilicoPerturber:

Since the input .pt files (tokenized data) don’t include any cell_id, I assume cell_inds_to_perturb just refers to the index position of each cell in the dataset, right?

In that case, the cell_id we build during MTL training (e.g., cell_{idx}) won’t affect perturbation at all, correct?

Just want to confirm that everything in the perturbation pipeline works off positional indices, not some unique ID.

Thanks a lot!

ZYSK-huggingface changed discussion title from questions on MTL fine-tuning to questions on MTL fine-tuning and insilico

Thank for your questions.

  • The 95M dictionary is compatible with the V2 model.
  • Distributed training only affects the speed and memory. However, if you change the hyperparameters (e.g. use larger batch size), this affects training. It does not affect inference with a previously trained model.
  • Quantization is built into fine-tuning or inference, by selecting the appropriate option for the given argument. There is no separate step needed.
  • You can use the MTL code for 1 task.
  • cell_inds_to_perturb is to facilitate parallelization by selecting sets of indices for each job. It is agnostic of cell labels.
ctheodoris changed discussion status to closed

Sign up or log in to comment