questions on MTL fine-tuning and insilico
Hi, I noticed that you've recently released updated code and model versions. To use the new V2 model, do I need to re-tokenize my existing data that was previously tokenized using the 95M dictionary?
Additionally, I saw that the MTL classifier training now supports DDP for multi-GPU training. I’d like to confirm whether the classifiers trained with DDP and non-DDP settings are functionally equivalent? This is important because I plan to use the trained model for in silico perturbations later on, and the available classifier options for perturbation include "MTLCellClassifier" and "MTLCellClassifier-Quantized".
If I want to use the "MTLCellClassifier-Quantized" option, should there be a separate quantization step? I didn't notice any quantization-related steps in the previous training scripts.
Finally, I want to know whether it is appropriate for only one task using MTL?
Thanks a lot for your clarification!
AND another question about InSilicoPerturber:
Since the input .pt files (tokenized data) don’t include any cell_id, I assume cell_inds_to_perturb just refers to the index position of each cell in the dataset, right?
In that case, the cell_id we build during MTL training (e.g., cell_{idx}) won’t affect perturbation at all, correct?
Just want to confirm that everything in the perturbation pipeline works off positional indices, not some unique ID.
Thanks a lot!
Thank for your questions.
- The 95M dictionary is compatible with the V2 model.
- Distributed training only affects the speed and memory. However, if you change the hyperparameters (e.g. use larger batch size), this affects training. It does not affect inference with a previously trained model.
- Quantization is built into fine-tuning or inference, by selecting the appropriate option for the given argument. There is no separate step needed.
- You can use the MTL code for 1 task.
- cell_inds_to_perturb is to facilitate parallelization by selecting sets of indices for each job. It is agnostic of cell labels.