OODyssey

Team Members: Anatoly Buchin, Antoine Argante, Meriem Bensouda, Sreenath Srikrishnan

For the Tahoe-100m hackathon, we focused on building a set of increasingly difficult benchmarks and then tested how well existing and new models perform in them.

We decided to focus on this area because, while there is no shortage of model publications, it is currently unclear how well they generalize to unseen data and clinically relevant scenarios.

A lot of effort went into splitting the data up in a thoughtful way: We created several categories of held-out data, which were increasingly more out-of-distribution (OOD). From easiest to hardest:

Plate 14, which is a replica of other data from other plates, should be identical to some of the data the model has been trained on, except for technical variation effects.
Drugs (Cell lines) where the model has seen other drugs (cell lines) from the same mechanism of action (organ)
Drugs (Cell lines) where the mechanism of action (organ) is completely novel to the model
External datasets (like Sciplex3, TCGA) that share some drugs but will contain a lot of technical variation compared to Tahoe
Drug combination datasets (extremely hard): We found one dataset (GSE206741) that combines two of the drugs found in Tahoe. To do well on this test set, the model must not only understand the effect of drugs on their own but also their interactions.

We then benchmarked different models against these test sets:

For a simple baseline, we used PCA to embed the data and then ran a logistic regression to predict the organ/drug label. Other models we trained and compared were Transcriptformer, constrastiveVI and scVI.

As an example, here are the results of predicting the organ of a held-out cell line when the other cell lines of that organ were included in the training data:

We found that Transcriptformer (a zero-shot model) does well on held out cell lines, also for fully held-out organs (perhaps it has seen similar training data or has generalized well). In contrast to this, the other models did not show significant improvements over the simple baseline.

cgoeldel
/

OODyssey

OODyssey

Datasets used to train cgoeldel/OODyssey