Spektron’s Tox21 Data Challenge Results
The Tox21 Data Challenge was a 2014 – 2015 toxicological modeling challenge organized by the US EPA, NIH, and NCATS (National Center for Advancing Translational Science). The challenge was to model a library of 10K+ molecules for 12 different toxicological endpoints (i.e. “tasks”). More information about the challenge and the endpoints can be found at the challenge website: https://tripod.nih.gov/tox21/challenge/. Information about the underlying Tox21 program from which the data was constructed can be found here: https://ntp.niehs.nih.gov/results/tox21/index.html
We present a summary and description of the Q-MAP™ modeling results against the twelve Tox21 data challenge endpoints. We also compare results from competitors (to the extent they are known). All results herein have been compiled against the predefined test set used in the Tox21 data challenge. The Tox21 data challenge is a commonly referenced modeling challenge in predictive toxicology. As such, achieving a good performance against the Tox21 test set is a powerful statement about our platform’s predictive modeling skills.
A Note of Caution on “the Tox21 Test Set”
There is a point of caution when assessing performance on the Tox21 test set. There are actually three different commonly used versions. Often it is not clear which version is being used when results are reported or discussed- papers and websites discussing Tox21 performance results often do not clarify which version of the test set they use to generate results. The three test sets are very different in their constitution, however, and performance statistics can vary significantly between the three different test sets.
1 – Final Evaluation Test Set
This set of 645 molecules is the official holdout test set for the Tox21 challenge. It is also the most challenging data set. Nearly every team participating in the challenge obtained worse results on this dataset vs. the other test sets. We have the most information from our competitors on this particular test set, so we will use this as our benchmark for the Tox21 challenge. This test set has some missing class labels, but not nearly as many are missing as for the official Tox21 training set.
2 – Leaderboard Test Set
This dataset was used to show performance on the leaderboard as the Tox21 challenge was underway. This test set got considerable attention and is referred to as “the Tox21 test set” in some papers and websites.
3 – DeepChem / MoleculeNet.ai Test Set
This test set was put together by the same folks at Stanford who created the DeepChem chemical modeling library and the MoleculeNet.ai data sharing consortium. Performance statistics against this test set seem to be better for most modeling approaches than against the other two test sets. Perhaps for this reason, this test set seems to be popular with modeling groups. Confusingly, molecules in this test set are not labeled by name or ID code- only SMILES strings.
Points to Note for the Tox21 Data in general
Highly Imbalanced Training Sets
The Tox21 training and test sets are highly imbalanced. Some of the 12 tasks have > 25 inactive molecules for every 1 active molecule. This presents serious challenges for modeling methods that need relatively well-balanced training sets.
Missing Activities (Class Labels)
There are many missing class labels for the Tox21 training set. That is, we do not know all the activities for all molecules. In fact, most molecules in the training set lack activities for most tasks. This means that any given task trained in isolation will have a much smaller training set than the entire training set.
Spektron Models Tested
For this comparison, we tested three (3) different machine learning methods with three different molecular featurization algorithms. A featurization algorithm is a way in which molecular information is encoded into a format that is digestible for machine learning algorithms.
Modeling Method | Machine Learning Algorithm | Description |
---|---|---|
1 | Deep Learning GCN (Graph Convolutional Network) | A deep learning ANN which represents molecules by graphs of atoms and bond connections |
2 | XGBoost | A “boosting” classifier using “off the shelf” structural features |
3 | ANN (single hidden layer Artificial Neural Network) |
Single layer ANNs using Q-MAP™ features |
Performance Results Using the Official Tox21 Test Set
This comparison shows results using only the “official” Tox21 test set. The EPA provided predictions on the official test set for 116 competing teams. Note that of these 116 competitors, only 28 teams submitted predictions for all 12 tasks. For this analysis, we restrict our comparisons to teams that generated predictions for all 12 tasks.
Since the emphasis is on screening toxic molecules, the most important statistic in the challenge is sensitivity. But it is also important to maintain a sufficiently high specificity, or else one has a “dumb” classifier with no predictive skill. Hence, balanced accuracy (the average of sensitivity and specificity) is also quite important.
Sensitivity
Sensitivity is also known as the “true positive rate”. For this challenge, it is the probability of correctly predicting that a toxic molecule has toxicity. This statistic is extremely important for drug discovery as it is dependent on lower numbers of false negatives. Note that for the Tox21 challenge, the number of toxic molecules for each of the 12 tasks is much smaller than the number of non-toxic molecules. Thus, it is easy for “positive” samples to get overwhelmed by the sheer abundance of “negative” non-toxic samples in such an imbalanced dataset. Numbers in this table denote the mean sensitivity of each model across all 12 tasks.
Ranking | Team | Mean Sensitivity |
1 | Spektron Deep Learning GCN | 0.7398 |
2 | swamidass lab | 0.7285 |
3 | Spektron XGBoost+ECFP | 0.6619 |
4 | Spektron ANN+Q-MAP™ Features | 0.6341 |
5 | tabula rasa | 0.5768 |
6 | golda | 0.5735 |
7 | vif innovations, llc | 0.5688 |
8 | t | 0.5604 |
9 | amaziz | 0.5473 |
10 | pass | 0.5369 |
11 | rcc.org.rs | 0.5214 |
12 | nci | 0.5206 |
13 | pass affinities | 0.4779 |
14 | kibutz | 0.4651 |
15 | mml | 0.4543 |
16 | structuralbionformatics@charite | 0.4365 |
17 | capuzzi_sc | 0.4282 |
18 | winter is coming | 0.3456 |
19 | sc464303 | 0.33 |
20 | bioinf@jku-ensemble2 | 0.3078 |
21 | bioinf@jku-ensemble4 | 0.3071 |
22 | bioinf@jku-ensemble1 | 0.2329 |
23 | bioinf@jku-ensemble3 | 0.2269 |
24 | bioinf@jku | 0.219 |
25 | roadrunner | 0.1645 |
26 | mlworks | 0.147 |
27 | the toxic avenger | 0.1324 |
28 | dmlab | 0.0775 |
29 | toxfit | 0.0435 |
30 | cgl | 0.0405 |
31 | frozenarm | 0.0318 |
Predictions for all models on all 12 tasks can be seen in the following plot:
Balanced Accuracy
Balanced accuracy is defined as the average of the sensitivity and specificity. This metric is important for noting the balance of a model. A model that is high in sensitivity but low in specificity (or vice versa) is not a skilled model that one could rely on for predicting on unknown molecules.
Ranking | Team | Mean Balanced Accuracy |
1 | Spektron Deep Learning GCN | 0.7331 |
2 | t | 0.6855 |
3 | amaziz | 0.6752 |
4 | Spektron XGBoost+ECFP | 0.6693 |
5 | golda | 0.668 |
6 | tabula rasa | 0.6649 |
7 | Spektron ANN+Q-MAP™ Features | 0.6618 |
8 | nci | 0.6567 |
9 | mml | 0.6537 |
10 | kibutz | 0.6463 |
11 | capuzzi_sc | 0.6435 |
12 | rcc.org.rs | 0.6389 |
13 | structuralbionformatics@charite | 0.6304 |
14 | pass | 0.6203 |
15 | pass affinities | 0.62 |
16 | winter is coming | 0.6189 |
17 | bioinf@jku-ensemble2 | 0.6161 |
18 | bioinf@jku-ensemble4 | 0.6159 |
19 | vif innovations, llc | 0.6146 |
20 | sc464303 | 0.5967 |
21 | bioinf@jku-ensemble1 | 0.593 |
22 | swamidass lab | 0.5923 |
23 | bioinf@jku-ensemble3 | 0.5904 |
24 | bioinf@jku | 0.5857 |
25 | roadrunner | 0.5562 |
26 | the toxic avenger | 0.5536 |
27 | mlworks | 0.5448 |
28 | dmlab | 0.5254 |
29 | toxfit | 0.5147 |
30 | cgl | 0.5131 |
31 | frozenarm | 0.5106 |
Predictions for all models on all 12 tasks can be seen in the following plot:
AUC (Area Under Curve)
AUC is a commonly used metric in binary classification problems. This metric denotes the integrated area under a ROC (Receiver Operator Curve). AUC can help to convey a sense of how close a classifier is to being incorrect; i.e. how much margin a classifier has. This interpretation can become skewed, however, when the relative size of the two classes is highly imbalanced, as is the case with some of the tasks in the Tox21 data challenge. One can see those unskilled classifiers are heavily skewed towards achieving specificity over sensitivity and achieve respectable AUC values despite having very little predictive skill.
Ranking | Team | Mean AUC |
1 | Spektron Deep Learning GCN | 0.7782 |
2 | bioinf@jku-ensemble3 | 0.7779 |
3 | bioinf@jku-ensemble4 | 0.7779 |
4 | bioinf@jku-ensemble2 | 0.7688 |
5 | t | 0.7679 |
6 | bioinf@jku-ensemble1 | 0.7669 |
7 | bioinf@jku | 0.7652 |
8 | amaziz | 0.7639 |
9 | the toxic avenger | 0.7572 |
10 | dmlab | 0.7405 |
11 | tabula rasa | 0.7319 |
12 | kibutz | 0.7315 |
13 | Spektron XGBoost+ECFP | 0.7309 |
14 | winter is coming | 0.7265 |
15 | structuralbionformatics@charite | 0.721 |
16 | nci | 0.7204 |
17 | mml | 0.7195 |
18 | golda | 0.7183 |
19 | rcc.org.rs | 0.7158 |
20 | capuzzi_sc | 0.7151 |
21 | frozenarm | 0.7039 |
22 | toxfit | 0.6987 |
23 | cgl | 0.6961 |
24 | pass | 0.6724 |
25 | roadrunner | 0.668 |
26 | sc464303 | 0.6654 |
27 | vif innovations, llc | 0.6564 |
28 | pass affinities | 0.6389 |
29 | mlworks | 0.594 |
30 | swamidass lab | 0.5772 |
Predictions for all models on all 12 tasks can be seen in the following plot:
Specificity
Specificity is also known as the “true negative rate”. For this challenge, it is the probability of correctly predicting that a non-toxic molecule does not have toxicity. It is important that a model does not falsely declare molecules to be toxic at an excessive rate. Otherwise, in our molecular design process, we will discard many harmless molecule designs and artificially restrict our set of drug candidates. Note that many of the teams with extremely poor sensitivities have extremely good specificities. This indicates that these models are unskilled, despite the high specificity.
Ranking | Team | Mean Specificity |
1 | frozenarm | 0.9894 |
2 | toxfit | 0.9859 |
3 | cgl | 0.9857 |
4 | the toxic avenger | 0.9747 |
5 | dmlab | 0.9732 |
6 | bioinf@jku-ensemble3 | 0.9539 |
7 | bioinf@jku-ensemble1 | 0.9531 |
8 | bioinf@jku | 0.9523 |
9 | roadrunner | 0.948 |
10 | mlworks | 0.9425 |
11 | bioinf@jku-ensemble4 | 0.9247 |
12 | bioinf@jku-ensemble2 | 0.9243 |
13 | winter is coming | 0.8923 |
14 | sc464303 | 0.8633 |
15 | capuzzi_sc | 0.8588 |
16 | mml | 0.8532 |
17 | kibutz | 0.8274 |
18 | structuralbionformatics@charite | 0.8243 |
19 | t | 0.8105 |
20 | amaziz | 0.8032 |
21 | nci | 0.7928 |
22 | golda | 0.7626 |
23 | pass affinities | 0.7622 |
24 | rcc.org.rs | 0.7563 |
25 | tabula rasa | 0.753 |
26 | Spektron Deep Learning GCN | 0.7264 |
27 | pass | 0.7037 |
28 | Spektron ANN+Q-MAP™ Features | 0.6895 |
29 | Spektron XGBoost+ECFP | 0.6764 |
30 | vif innovations, llc | 0.6604 |
31 | swamidass lab | 0.4561 |
For the purposes of virtual screening of molecules for toxicity, the sensitivity score for a classifier is a crucial indicator of predictive power. Sensitivity is defined as the ratio of true positive predictions to all actual positives, including those predicted as negatives (false negatives). False negatives are problematic for the toxicity virtual screen as it would allow toxic molecules to pass through the screen by identifying them, falsely, as non-toxic.
Predictions for all models on all 12 tasks can be seen in the following plot:
Discussion of Results
We achieve superior performance on the metrics that matter most for toxicological screening- sensitivity and balanced accuracy. We also achieve leading performance on AUC, a performance metric of lesser importance. Our performance in this challenge is a strong testament to the capabilities of our modeling platform.
GCN (Graph Convolutional Network) Deep Learning Neural Networks + Neural Graph Fingerprint Featured
Spektron’s best individual model results were obtained with a Graph Convolutional Network. This is a powerful deep learning based method that has been developed in the last two years.
Key Takeaways on the GCN DL Neural Network + Neural Graph Fingerprint
Encoding Method
This method used “graphs” of atom connections and bond types between atoms. One does not directly engineer machine learning features with this method. Rather, features that convey the most explanatory power are learned as part of the training process and are used later to featurize molecules outside of the training set. The process is analogous to CNNs (convolutional neural networks) that have revolutionized computer vision algorithms in recent years. The actual features used are not chosen (engineered) by a modeler – they are another parameter learned during training.
Molecular Geometry Not Explicitly Used
This methodology uses SMILES strings as inputs. The SMILES strings are parsed and converted to a graph of nodes and edges fed into the convolutional neural network.
Multitask Learning
One key difference between this method and many other forms of predictive modeling is that all 12 tasks were trained concurrently. In multitask learning, one task (endpoint) is able to receive information from other tasks in the training set. For this method to work advantageously, the machine learning algorithms used must have a hierarchical structure where low-level information can be shared across multiple tasks. This only works advantageously if the tasks are related to each other. For example, training multiple liver toxicities in a multitask manner may prove fruitful, but training very different tasks (PANSS efficacies, liver toxicity, blood-brain barrier permeability, …) would likely not yield a better model than what could be obtained by training tasks independently.
XGBoost Classifier + ECFP Features
The XGBoost classifier showed encouraging performance on the official Tox21 data challenge test set. This is despite the fact that very little effort was put into this method – no hyperparameter optimization was performed. We used default settings on the algorithm. For this method, we used a featurization known as ECFP. This is an algorithm that encodes the presence or absence of atoms / structural features and their connection to other atoms/features within a bond radius. The output is expressed as a binary bit string vector for each molecule. Information on ECFP features can be found here: ECFP / FCFP Features
ECFP features share many common characteristics with the “neural fingerprint” features that we used to generate our best results. The XGBoost + ECFP classifier did not employ multitask learning. It is theoretically possible to employ multitask learning with such a method. This represents an opportunity for further study.
Single Hidden Layer ANN + Q-MAP™ Features
This was our worst performing model. Nevertheless, it still performed quite well in the most important metrics (sensitivity, balanced accuracy), beating nearly all of the 28 Tox21 data challenge teams for which we have full prediction data. The ANN+Q-MAP™ method achieved good sensitivity by using “downsample balancing” to ensure equal class size between actives and inactives in the training set. This meant that for some highly imbalanced tasks, the effective size of the training set was greatly reduced. Due to the massive size of the Tox21 training set, however, we still had amply sized training sets for all 12 tasks.
It is anticipated that Q-MAP™ based featurization methods will exhibit independence from GCN and ECFP algorithms. In this way, we might find it to be advantageous to employ this featurization method in ensembles with other methods.