Screening a million compounds for the price of a few thousand?

April 05, 2017

Anne Carpenter

Biologists are coming up with more and more complex physiologically-relevant assay systems and scaling them up for screens. From co-cultured cells to C. elegans to 3D organoids and tumor spheroids, these assay systems can be challenging, expensive, lower-throughput, and/or rely on materials such as human primary cells that are in short supply.

Might there be a shortcut allowing you to screen a huge chemical library without the expense? If you have image-based screens of a large compound set on hand, this Simm, et al. paper offers a solution. The project was a collaboration among researchers from Janssen, KU Leuven, Johannes Kepler University Linz, Open Analytics, ExaScience Life Lab/IMEC, with a little bit of help from the Carpenter lab! Instead of describing the precise study in the paper, let me describe the drug discovery lab of the future:

High-throughput screening, at a fraction the cost

Imagine you’ve finally validated your complex assay system. It’s the perfect readout for the disease of interest. You begin by screening your assay against a modestly sized set of compounds, say 5,000 (Figure, Step 1). This set of compounds is routinely used at your institution so you’ve already got corresponding data for the compounds from the Cell Painting assay (a microscopy assay that uses 6 stains imaged in 5 channels and labels 8 subcellular constituents).

Next, the computational experts on your team build a model that can predict your assay’s outcome based on some combination of the 1,400+ image-based features from Cell Painting (Figure, Step 2). In the absolute simplest case, a single image-based feature will correlate with results from your assay, but more commonly this will require machine learning to combine image features.

Successful model in hand, you proceed (Figure, Step 3) to do a “virtual” screen: that is, you use the model to computationally predict hits from a huge library of compounds. Your institution (or maybe even the NIH!) already invested in testing 1,000,000 compounds in the Cell Painting assay – it’s not a terribly expensive assay and they knew this data will be used in hundreds of future screens.

This yields a list of compounds that are likely to be positive hits in your assay. So (Figure, Step 4), you physically test those compounds and find that a majority of them indeed have the desired activity and as a bonus, they have diverse chemical structures. There! You found most of the hits that you would have gotten in a million-compound screen, for the cost of a few thousand.

Now, bringing us back to reality: the Simm et al. study found that only 5.6% of their assays passed Step 2. Still, even if this only works 5.6% of the time it could literally save millions of dollars. But the future looks even brighter. One could readily increase the proportion of assays that are predictable by this strategy in a number of ways:

  • Use a more information-rich imaging assay as the source of profiles. Simm et al. had stains labeling only three cellular components (DNA, cell body (CellMask Deep Red), and glucocorticoid receptor) whereas the Cell Painting assay offers far more information content in each image (DNA, ER, mitochondria, golgi, F-actin, nucleoli, cytoplasmic RNA, and plasma membrane).
  • Relax the quality cutoff. Simm et al. set a very stringent threshold of 0.9 for Step 2; in reality, models with an AUC > 0.7 will still be very valuable (37.3% met that criterion).
  • Develop more advanced computational modeling methods. This is an area of active research.
  • Use multiple imaging assays instead of just one to produce profiles. This could include additional Cell Painting-like assays or simply combining data from all available historical imaging assays. Data could be leveraged from within a single institution or by combining publicly available data.
  • Add a second profiling modality. Gene expression measurements are also reasonably high-throughput, via L1000.
  • Use multiple compound doses, alternate cell lines, alternate time points.

It is entirely possible that running large-scale compound screens becomes a thing of the past, or at least less common.