Isak and I, with some great mentoring from Ian Smith (who rocks) and Vlado Dancik at the Broad Institute, did well in the second round of the
drug combination synergy prediction DREAM challenge (subchallenge 1A). We sent in three sets of predictions around midnight last night, and this morning the results were posted. One of our submissions scored best in the global correlation score, out of 167 submissions! Each team can send in 3 sets of predictions in each round. There are 2 or 5 scoring metrics, depending on how you count.
I guess the scores aren't visible to non-contestants, so I'll post a screenshot of the top entries, sorted by score.
When we first started working on this project, we were disappointed at what appear to be relatively low numbers for the scores, and 0.29 for a global correlation score doesn't strike me as great, even if we did WIN THIS ROUND (still celebrating...). (I woke Isak up to tell him, and even then he's going to be late for school because of how late we were up last night working on this.) The global correlation score asks how well your predictions are doing over and above how well you could do by predicting based only on cell line or drug combination, the two main variables in the dataset. So it's a much higher bar than just looking at correlation of actual with predicted values. While I understand the math behind the mean correlation score (which is taken across drug combinations), I haven't figured out the math behind the global correlation.
We used the random forest machine learning algorithm, with 100 trees, as implemented in the python scikit-learn library. As input, we provided the cell line, drug combination, and mutations and expression values for certain genes for each cell line. We selected some of the genes based on a Fisher test between synergy and mutation across the training data, but also selected genes based on number of mutations across the cell lines, only keeping genes with sufficiently different patterns of mutations. For gene expression, we took a representative sample of highly-varying genes that had different expression values.
In other news, my neighbor Elizabeth Harkavy got into MIT! Here is a picture of Elizabeth playing orienteering capture-the-flag: