Question about Geneformer Perturbation and Extended Data Fig. 2C/2E
Hello,
Thank you for all your work on Geneformer. I have a question regarding perturbation analysis, specifically how to recreate the plots shown in Extended Data Figures 2C and 2E.
I have a list of genes I’d like to perturb, with the goal of evaluating whether other genes shift toward a specific goal state. I’m particularly interested in reproducing a plot similar to Extended Data Fig. 2C, where a defined gene set is perturbed and compared to a random gene list to observe whether there’s a shift toward the desired state.
Could you please clarify how Fig. 2C was generated, and which specific options or parameters were used in the insilico_perturber to achieve this?
Thank you so much for your time and help — I really appreciate it, and I look forward to hearing from you.
--
Thank you for your question. Here, we either overexpressed the given gene(s) (moving to front of rank value encoding) or overexpressed random genes to get the random distribution and calculated the shift from the start to the goal state. Using the InSilicoPerturber, you can use the "overexpress" option with the genes_to_perturb being the ones you'd like to overexpress. Then, you can repeat this with random genes in random cells to get the random distribution. Then, you can aggregate the shifts using the "aggregate_data" option in the InSilicoPerturberStats. You can plot them with a KDE plot and calculate significance using Wilcoxon rank sums (with multiple hypothesis correction if you are testing many perturbations).
Thank you for the prompt reply. In the case where we have 4 genes that we want to perturb as a list. If I perturb these genes together setting the start and goal state, it outputs a single shift to end value. If I also do this with a random gene list for comparison, it also outputs a single shift to end value.
How do I get multiple shifts values to be able to plot a KDE plot ?
If the aggregate_data option does not extract all values, you can use the read_dictionaries function from the stats module to extract the values from the intermediate output files.
Thank you for the response. I am able to load the pickle files after perturbation using the goal state shift. I get two lists in a dictionary:
start_cell_state with a list of cosine shifts across the perturbed cells
goal_cell_state with a list of cosine shifts across the perturbed cells
How are these lists meant to be interpreted ? I am unsure why there is also a cosine shift for the goal cell state.
You should use the goal state values, like this:
https://huggingface.co/ctheodoris/Geneformer/blob/main/geneformer/in_silico_perturber_stats.py#L320