Some questions about gene type prediction task

by iLOVE2D - opened Jun 5, 2023

Jun 5, 2023

Hi, congs for your great work! I take a look at the supp table for gene type prediction task, and I found that the second dataset is a little ambiguous. I cannot find that dataset(15K embryonic stem cells (ESCs)29) in PanglaoDB. Could you please offer more information? Thanks a lot.

ctheodoris

Owner Jun 6, 2023

Thank you for your interest in Geneformer. The dataset used for fine-tuning the model to distinguish bivalent promoters was from PanglaoDB, SRA553822-SRS2119548. In the example_input_files directory, we added the labels for the genes in the 56 highly conserved regions reported in Bernstein et al. 2006.

ctheodoris changed discussion status to closed Jun 6, 2023

iLOVE2D

Jun 6, 2023

Hi, so the meaning of 15k is the code of this dataset rather than the number of cells, is it correct? Thanks a lot.

iLOVE2D changed discussion status to open Jun 6, 2023

ctheodoris

Owner Jun 6, 2023

•

edited Jun 22, 2023

15K refers to the number of cells. [Update: we have stored the embryonic stem cell .dataset in the dataset repository: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files]

ctheodoris changed discussion status to closed Jun 6, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment