Spektron’s Tox21 Data Challenge Results

The Tox21 Data Challenge was a 2014 – 2015 toxicological modeling challenge organized by the US EPA, NIH, and NCATS (National Center for Advancing Translational Science).  The challenge was to model a library of 10K+ molecules for 12 different toxicological endpoints (i.e. “tasks”).  More information about the challenge and the endpoints can be found at the challenge website: https://tripod.nih.gov/tox21/challenge/.  Information about the underlying Tox21 program from which the data was constructed can be found here: https://ntp.niehs.nih.gov/results/tox21/index.html

We present a summary and description of the Q-MAP™ modeling results against the twelve Tox21 data challenge endpoints.  We also compare results from competitors (to the extent they are known).  All results herein have been compiled against the predefined test set used in the Tox21 data challenge.  The Tox21 data challenge is a commonly referenced modeling challenge in predictive toxicology.  As such, achieving a good performance against the Tox21 test set is a powerful statement about our platform’s predictive modeling skills.

A Note of Caution on “the Tox21 Test Set”

There is a point of caution when assessing performance on the Tox21 test set.  There are actually three different commonly used versions.  Often it is not clear which version is being used when results are reported or discussed- papers and websites discussing Tox21 performance results often do not clarify which version of the test set they use to generate results.  The three test sets are very different in their constitution, however, and performance statistics can vary significantly between the three different test sets.

1 – Final Evaluation Test Set

This set of 645 molecules is the official holdout test set for the Tox21 challenge.  It is also the most challenging data set.  Nearly every team participating in the challenge obtained worse results on this dataset vs. the other test sets.  We have the most information from our competitors on this particular test set, so we will use this as our benchmark for the Tox21 challenge.  This test set has some missing class labels, but not nearly as many are missing as for the official Tox21 training set.

2 – Leaderboard Test Set

This dataset was used to show performance on the leaderboard as the Tox21 challenge was underway.  This test set got considerable attention and is referred to as “the Tox21 test set” in some papers and websites.

3 – DeepChem / MoleculeNet.ai Test Set

This test set was put together by the same folks at Stanford who created the DeepChem chemical modeling library and the MoleculeNet.ai data sharing consortium.  Performance statistics against this test set seem to be better for most modeling approaches than against the other two test sets.  Perhaps for this reason, this test set seems to be popular with modeling groups.  Confusingly, molecules in this test set are not labeled by name or ID code- only SMILES strings.

Points to Note for the Tox21 Data in general

Highly Imbalanced Training Sets

The Tox21 training and test sets are highly imbalanced.  Some of the 12 tasks have > 25 inactive molecules for every 1 active molecule.  This presents serious challenges for modeling methods that need relatively well-balanced training sets.

Missing Activities (Class Labels)

There are many missing class labels for the Tox21 training set.  That is, we do not know all the activities for all molecules.  In fact, most molecules in the training set lack activities for most tasks.  This means that any given task trained in isolation will have a much smaller training set than the entire training set.

Spektron Models Tested

For this comparison, we tested three (3) different machine learning methods with three different molecular featurization algorithms.  A featurization algorithm is a way in which molecular information is encoded into a format that is digestible for machine learning algorithms.

Modeling Method Machine Learning Algorithm Description
1 Deep Learning GCN (Graph Convolutional Network) A deep learning ANN which represents molecules by graphs of atoms and bond connections
2 XGBoost A “boosting” classifier using “off the shelf” structural features
(single hidden layer Artificial Neural Network)
Single layer ANNs using Q-MAP™ features

Performance Results Using the Official Tox21 Test Set

This comparison shows results using only the “official” Tox21 test set.  The EPA provided predictions on the official test set for 116 competing teams.  Note that of these 116 competitors, only 28 teams submitted predictions for all 12 tasks.  For this analysis, we restrict our comparisons to teams that generated predictions for all 12 tasks.

Since the emphasis is on screening toxic molecules, the most important statistic in the challenge is sensitivity.  But it is also important to maintain a sufficiently high specificity, or else one has a “dumb” classifier with no predictive skill.  Hence, balanced accuracy (the average of sensitivity and specificity) is also quite important.


Sensitivity is also known as the “true positive rate”.  For this challenge, it is the probability of correctly predicting that a toxic molecule has toxicity.  This statistic is extremely important for drug discovery as it is dependent on lower numbers of false negatives.  Note that for the Tox21 challenge, the number of toxic molecules for each of the 12 tasks is much smaller than the number of non-toxic molecules.  Thus, it is easy for “positive” samples to get overwhelmed by the sheer abundance of “negative” non-toxic samples in such an imbalanced dataset.  Numbers in this table denote the mean sensitivity of each model across all 12 tasks.

Ranking Team Mean Sensitivity
1 Spektron Deep Learning GCN 0.7398
2 swamidass lab 0.7285
3 Spektron XGBoost+ECFP 0.6619
4 Spektron ANN+Q-MAP™ Features 0.6341
5 tabula rasa 0.5768
6 golda 0.5735
7 vif innovations, llc 0.5688
8 t 0.5604
9 amaziz 0.5473
10 pass 0.5369
11 rcc.org.rs 0.5214
12 nci 0.5206
13 pass affinities 0.4779
14 kibutz 0.4651
15 mml 0.4543
16 structuralbionformatics@charite 0.4365
17 capuzzi_sc 0.4282
18 winter is coming 0.3456
19 sc464303 0.33
20 bioinf@jku-ensemble2 0.3078
21 bioinf@jku-ensemble4 0.3071
22 bioinf@jku-ensemble1 0.2329
23 bioinf@jku-ensemble3 0.2269
24 bioinf@jku 0.219
25 roadrunner 0.1645
26 mlworks 0.147
27 the toxic avenger 0.1324
28 dmlab 0.0775
29 toxfit 0.0435
30 cgl 0.0405
31 frozenarm 0.0318


Predictions for all models on all 12 tasks can be seen in the following plot:


Balanced Accuracy

Balanced accuracy is defined as the average of the sensitivity and specificity.  This metric is important for noting the balance of a model.  A model that is high in sensitivity but low in specificity (or vice versa) is not a skilled model that one could rely on for predicting on unknown molecules.

Ranking Team Mean Balanced Accuracy
1 Spektron Deep Learning GCN 0.7331
2 t 0.6855
3 amaziz 0.6752
4 Spektron XGBoost+ECFP 0.6693
5 golda 0.668
6 tabula rasa 0.6649
7 Spektron ANN+Q-MAP™ Features 0.6618
8 nci 0.6567
9 mml 0.6537
10 kibutz 0.6463
11 capuzzi_sc 0.6435
12 rcc.org.rs 0.6389
13 structuralbionformatics@charite 0.6304
14 pass 0.6203
15 pass affinities 0.62
16 winter is coming 0.6189
17 bioinf@jku-ensemble2 0.6161
18 bioinf@jku-ensemble4 0.6159
19 vif innovations, llc 0.6146
20 sc464303 0.5967
21 bioinf@jku-ensemble1 0.593
22 swamidass lab 0.5923
23 bioinf@jku-ensemble3 0.5904
24 bioinf@jku 0.5857
25 roadrunner 0.5562
26 the toxic avenger 0.5536
27 mlworks 0.5448
28 dmlab 0.5254
29 toxfit 0.5147
30 cgl 0.5131
31 frozenarm 0.5106


Predictions for all models on all 12 tasks can be seen in the following plot:

AUC (Area Under Curve)

AUC is a commonly used metric in binary classification problems.  This metric denotes the integrated area under a ROC (Receiver Operator Curve).  AUC can help to convey a sense of how close a classifier is to being incorrect; i.e. how much margin a classifier has.  This interpretation can become skewed, however, when the relative size of the two classes is highly imbalanced, as is the case with some of the tasks in the Tox21 data challenge.  One can see those unskilled classifiers are heavily skewed towards achieving specificity over sensitivity and achieve respectable AUC values despite having very little predictive skill.

Ranking Team Mean AUC
1 Spektron Deep Learning GCN 0.7782
2 bioinf@jku-ensemble3 0.7779
3 bioinf@jku-ensemble4 0.7779
4 bioinf@jku-ensemble2 0.7688
5 t 0.7679
6 bioinf@jku-ensemble1 0.7669
7 bioinf@jku 0.7652
8 amaziz 0.7639
9 the toxic avenger 0.7572
10 dmlab 0.7405
11 tabula rasa 0.7319
12 kibutz 0.7315
13 Spektron XGBoost+ECFP 0.7309
14 winter is coming 0.7265
15 structuralbionformatics@charite 0.721
16 nci 0.7204
17 mml 0.7195
18 golda 0.7183
19 rcc.org.rs 0.7158
20 capuzzi_sc 0.7151
21 frozenarm 0.7039
22 toxfit 0.6987
23 cgl 0.6961
24 pass 0.6724
25 roadrunner 0.668
26 sc464303 0.6654
27 vif innovations, llc 0.6564
28 pass affinities 0.6389
29 mlworks 0.594
30 swamidass lab 0.5772


Predictions for all models on all 12 tasks can be seen in the following plot:


Specificity is also known as the “true negative rate”.  For this challenge, it is the probability of correctly predicting that a non-toxic molecule does not have toxicity.  It is important that a model does not falsely declare molecules to be toxic at an excessive rate.  Otherwise, in our molecular design process, we will discard many harmless molecule designs and artificially restrict our set of drug candidates.   Note that many of the teams with extremely poor sensitivities have extremely good specificities.  This indicates that these models are unskilled, despite the high specificity.

Ranking Team Mean Specificity
1 frozenarm 0.9894
2 toxfit 0.9859
3 cgl 0.9857
4 the toxic avenger 0.9747
5 dmlab 0.9732
6 bioinf@jku-ensemble3 0.9539
7 bioinf@jku-ensemble1 0.9531
8 bioinf@jku 0.9523
9 roadrunner 0.948
10 mlworks 0.9425
11 bioinf@jku-ensemble4 0.9247
12 bioinf@jku-ensemble2 0.9243
13 winter is coming 0.8923
14 sc464303 0.8633
15 capuzzi_sc 0.8588
16 mml 0.8532
17 kibutz 0.8274
18 structuralbionformatics@charite 0.8243
19 t 0.8105
20 amaziz 0.8032
21 nci 0.7928
22 golda 0.7626
23 pass affinities 0.7622
24 rcc.org.rs 0.7563
25 tabula rasa 0.753
26 Spektron Deep Learning GCN 0.7264
27 pass 0.7037
28 Spektron ANN+Q-MAP™ Features 0.6895
29 Spektron XGBoost+ECFP 0.6764
30 vif innovations, llc 0.6604
31 swamidass lab 0.4561


For the purposes of virtual screening of molecules for toxicity, the sensitivity score for a classifier is a crucial indicator of predictive power.  Sensitivity is defined as the ratio of true positive predictions to all actual positives, including those predicted as negatives (false negatives).  False negatives are problematic for the toxicity virtual screen as it would allow toxic molecules to pass through the screen by identifying them, falsely, as non-toxic.

Predictions for all models on all 12 tasks can be seen in the following plot:

Discussion of Results

We achieve superior performance on the metrics that matter most for toxicological screening- sensitivity and balanced accuracy.  We also achieve leading performance on AUC, a performance metric of lesser importance.  Our performance in this challenge is a strong testament to the capabilities of our modeling platform.

GCN (Graph Convolutional Network) Deep Learning Neural Networks + Neural Graph Fingerprint Featured

Spektron’s best individual model results were obtained with a Graph Convolutional Network.  This is a powerful deep learning based method that has been developed in the last two years.

Key Takeaways on the GCN DL Neural Network + Neural Graph Fingerprint

Encoding Method

This method used “graphs” of atom connections and bond types between atoms.  One does not directly engineer machine learning features with this method.  Rather, features that convey the most explanatory power are learned as part of the training process and are used later to featurize molecules outside of the training set.  The process is analogous to CNNs (convolutional neural networks) that have revolutionized computer vision algorithms in recent years.  The actual features used are not chosen (engineered) by a modeler – they are another parameter learned during training.

Molecular Geometry Not Explicitly Used

This methodology uses SMILES strings as inputs.  The SMILES strings are parsed and converted to a graph of nodes and edges fed into the convolutional neural network.

Multitask Learning

One key difference between this method and many other forms of predictive modeling is that all 12 tasks were trained concurrently.  In multitask learning, one task (endpoint) is able to receive information from other tasks in the training set.  For this method to work advantageously, the machine learning algorithms used must have a hierarchical structure where low-level information can be shared across multiple tasks.  This only works advantageously if the tasks are related to each other.  For example, training multiple liver toxicities in a multitask manner may prove fruitful, but training very different tasks (PANSS efficacies, liver toxicity, blood-brain barrier permeability, …) would likely not yield a better model than what could be obtained by training tasks independently.

XGBoost Classifier + ECFP Features

The XGBoost classifier showed encouraging performance on the official Tox21 data challenge test set.  This is despite the fact that very little effort was put into this method – no hyperparameter optimization was performed. We used default settings on the algorithm.   For this method, we used a featurization known as ECFP.  This is an algorithm that encodes the presence or absence of atoms / structural features and their connection to other atoms/features within a bond radius.  The output is expressed as a binary bit string vector for each molecule.  Information on ECFP features can be found here: ECFP / FCFP Features

ECFP features share many common characteristics with the “neural fingerprint” features that we used to generate our best results.  The XGBoost + ECFP classifier did not employ multitask learning.  It is theoretically possible to employ multitask learning with such a method.  This represents an opportunity for further study.

Single Hidden Layer ANN + Q-MAP™ Features

This was our worst performing model.  Nevertheless, it still performed quite well in the most important metrics (sensitivity, balanced accuracy), beating nearly all of the 28 Tox21 data challenge teams for which we have full prediction data.  The ANN+Q-MAP™ method achieved good sensitivity by using “downsample balancing” to ensure equal class size between actives and inactives in the training set.  This meant that for some highly imbalanced tasks, the effective size of the training set was greatly reduced.  Due to the massive size of the Tox21 training set, however, we still had amply sized training sets for all 12 tasks.

It is anticipated that Q-MAP™ based featurization methods will exhibit independence from GCN and ECFP algorithms.  In this way, we might find it to be advantageous to employ this featurization method in ensembles with other methods.