Automated high confidence compound identification of electron ionization mass spectra for nontargeted analysis

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.34 MB, 8 trang )

Journal of Chromatography A 1660 (2021) 462656

Contents lists available at ScienceDirect

Journal of Chromatography A
journal homepage: www.elsevier.com/locate/chroma

Automated high conﬁdence compound identiﬁcation of electron
ionization mass spectra for nontargeted analysis
Joseph Bendik a,1, Richa Kalia a,1, Jeet Sukumaran b, William H. Richardot c,d, Eunha Hoh d,
Scott T. Kelley a,b,∗
a

Department of Biology, San Diego State University, San Diego, CA, USA
Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, CA 92104, USA
San Diego State University Research Foundation, San Diego, CA, USA
d
School of Public Health, San Diego State University, San Diego, CA, USA
b
c

a r t i c l e

i n f o

Article history:
Received 30 July 2021
Revised 26 October 2021
Accepted 27 October 2021
Available online 31 October 2021
Keywords:

ChromaTOF
PyAutoGUI
Mass spectral comparison
Nontargeted analysis
Suspect screening
Machine learning

a b s t r a c t
Nontargeted analysis based on mass spectrometry is a rising practice in environmental monitoring for
identifying contaminants of emerging concern. Nontargeted analysis performed using comprehensive
two-dimensional gas chromatography coupled with time-of-ﬂight mass spectrometry (GC×GC/TOF-MS)
generates large numbers of possible analytes. Moreover, the default spectral library similarity score-based
search algorithm used by LECO® ChromaTOF® does not ensure that high similarity scores result in correct library matches. Therefore, an additional manual screening is necessary, but leads to human errors
especially when dealing with large amounts of data. To improve the speed and accuracy of the chemical
identiﬁcation, we developed CINeMA.py (Classiﬁcation Is Never Manual Again). This programming suite
automates GC×GC/TOF-MS data interpretation by determining the conﬁdence of a match between the
observed analyte mass spectrum and the LECO® ChromaTOF® software generated library hit from the
NIST Electron Ionization Mass Spectral (NIST EI-MS) library. Our script allows the user to evaluate the
conﬁdence of the match using an algorithmic method that mimics the manual curation process and two
different machine learning approaches (neural networks and random forest). The script allows the user
to adjust various parameters (e.g., similarity threshold) and study their effects on prediction accuracy.
To test CINeMA.py, we used data from two different environmental contaminant studies: an EPA study
on household dust and a study on stormwater runoff. Using a reference set based on the analysis performed by highly trained users of the ChromaTOF and GC×GC/TOF-MS systems, the random forest model
had the highest prediction accuracies of 86% and 83% on the EPA and Stormwater data sets, respectively.
The algorithmic approach had the second-best prediction accuracy (82% and 79%), while the neural network accuracy had the lowest (63% and 67%). All the approaches required less than 1 min to classify
986 observed analytes, whereas manual data analysis required hours or days to complete. Our methods
were also able to detect high conﬁdence matches missed during the manual review. Overall, CINeMA.py
provides users with a powerful suite of tools that should signiﬁcantly speed-up data analysis while reducing the possibilities of manual errors and discrepancies among users, and can be applicable to other
GC/EI-MS instrument based nontargeted analysis.
© 2021 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license
( />
1. Introduction
Environmental monitoring for chemical contaminants typically
requires using targeted analysis, in which a priori information
∗
Corresponding author at: Department of Biology, San Diego State University, San
Diego, CA, USA.
E-mail address: (S.T. Kelley).
1
These authors contributed equally to this work.

(mass spectra, retention times, etc.) on speciﬁc chemicals is used
to detect compounds of interest. While these methods are sensitive
and quantitative for a known set of compounds, they miss undeﬁned compounds regardless of their abundance. Nontargeted analysis (NTA), including suspect screening, was developed to detect
multiple compounds simultaneously, including novel compounds,
and involves comprehensive sample preparation and chromatography followed by full mass spectrometry analysis [1–3]. Comprehensive two-dimensional gas chromatography coupled with time-

/>0021-9673/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( />

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

of-ﬂight mass spectrometry (GC×GC/TOF-MS) has proven to be
one of the useful techniques for performing NTA of environmental samples [4–7]. GC×GC/TOF-MS has a superior ability to identify compounds due to the enhanced sensitivity and separation
power of the GC×GC chromatography system and acquisition of
full scan mass spectra at low concentrations via the fast acquisition rate of the TOF-MS compared to one dimensional GC coupled
to a quadrupole MS [8]. As a result, the GC×GC/TOF-MS based NTA
generates thousands of chromatographic features, and each feature

has a full scan mass spectrum and chromatographic information.
Data analysis is required to process thousands of features, and in
NTA projects there are a signiﬁcant number of analytes that need
to be studied [9].
In GC×GC/TOF-MS based NTA, the raw data is analyzed using data processing software such as LECO® ChromaTOF®. ChromaTOF’s “automatic peak search” ﬁrst identiﬁes features based
upon certain conditions (i.e., S/N ratio, GC retention time, etc.).
Additionally, ChromaTOF’s peak table alignment feature “Statistical
Compare”, enables users to make comparisons between groups of
samples (ex. Samples vs Controls) to eﬃciently isolate compounds
of interest. “Statistical Compare” aligns peaks across sample groups
based upon 1st and 2nd dimension GC retention times, as well as
mass spectral similarity. In order to identify compounds of interest,
each peak is compared against the National Institute of Standards
and Technology electron ionization mass spectral (NIST EI-MS) library (or custom MS library depending on the user), generating a
list of ranked suggested compounds (Library Hits) and “similarity
score” by ChromaTOF utilizing the NIST Similarity score based on
the relative abundances of the matched pairs of masses and the
abundance ratios of adjacent matching peaks [10,11]. Afterwards,
each library hit must be manually reviewed to further evaluate the
conﬁdence of a match between the library hit mass spectra and the
observed mass spectra (after deconvolution), known as the Peak
True mass spectra in ChromaTOF.
Currently, the observed mass spectra and library hit mass spectra are either manually reviewed in ChromaTOF, or the data can
be exported as a PDF. Fig. 1A shows the workﬂow for manual data
analysis [12]. Once the best matches are obtained using the spectral library search algorithms, analytical reference standards are
procured, and their respective retention times and mass spectra
are obtained from the same instrumental condition of GC×GC/TOFMS. The veriﬁcation success rates were 94% and 96% in our studies [4,13]. This supports the notion that the manual review works
for determination of high conﬁdence identiﬁcation. However, this
manual review can be time consuming and error prone when the
data size is large, and results can be inconsistent among users.

For instance, reviewing thousands of compound’s mass spectra and
their matching mass spectra from a MS library (e.g., the NIST EIMS library) can take many hours or even days depending on user
experience. This high level of manual data handling leads to numerous errors necessitating multiple independent reviews for all
results to minimize errors. Thus, automation of these tasks would
be extremely valuable to improve the accuracy and increase the
analysis throughput [14].
To improve the speed and accuracy of identiﬁcation based
on mass spectral matching, we developed two programs: chromaTOF_auto.py and CINeMA.py (Classiﬁcation Is Never Manual
Again). The chromaTOF_auto.py script automates GCìGC/TOF-MS
data download from LECOđ ChromaTOF® software, while CINeMA.py facilitates the conﬁrmation of analyte matches between
the NIST mass spectral library and the experimental mass spectra using two different approaches: an algorithmic method based
directly on the manual curation method and machine learning approaches using neural networks and random forests trained on
manually curated data sets. These machine learning techniques
have been used for similar mass spectrometry applications in pre-

vious studies, demonstrating their potential ability to aid compound identiﬁcation [15–17]. Our results show that our scripts
greatly reduce the time needed for GC×GC/TOF-MS based nontargeted analysis, while still maintaining high accuracy.
2. Materials and methods
2.1. Automated data collection
All samples used to develop and test our scripts were analyzed
using Pegasus 4D GC×GC/TOF-MS (LECO, St. Joseph, MI). LECO®
ChromaTOF® software (version 4.50.8.0 optimized for Pegasus)
was used for data processing. The stormwater runoff samples (aka.
the Stormwater data set) were collected by the San Francisco Estuary Institute (SFEI) from Napa, Sonoma, and Santa Rosa counties
in California following the 2017 Northern California wildﬁres [13].
The household dust samples (aka. the EPA data set) were provided
as part of the U.S. Environmental Protection Agency (EPA)’s Nontargeted Analysis Collaborative Trial (ENTACT), an inter-laboratory
study designed to compare the various workﬂow techniques implemented within the NTA research community [18,19]. In brief,
participants were given a series of samples in a blind trial some
of which had been spiked with a cocktail of various compounds

and were instructed to conduct NTA. The EPA data set contained
986 observed analytes from the analysis of LECO® ChromaTOF®
software and the Stormwater data set contained 892 observed analytes. In the EPA data set, 409 compounds were manually reviewed
to be high conﬁdence matches, and 577 were reviewed as low conﬁdence. In the Stormwater data set, 373 were reviewed as high
conﬁdence and 519 were reviewed as low conﬁdence. The LECO®
ChromaTOF® software assigns each chromatographic peak a name
based upon mass spectral similarity to compounds within the 2011
NIST EI-MS library. After isolating all compounds of interest during
review, the user sorts the “peak table” in ChromaTOF so that each
compound of interest is in sequential order. To do so, the “peak
table” was sorted by “comment” and “peak number”. The “peak
true” (deconvoluted mass spectra) data of all compounds of interest were then exported in MSP format (peak_true.msp). Next,
the mass spectra of each compound’s assigned name from the
2011 NIST EI-MS library (library hit) were exported using the chromaTOF_auto.py script. The chromaTOF_auto.py is based on PyAutoGUI, a python module to control the use of mouse and keyboard
for automation of any Graphical User Interface. PyAutoGUI reproduces human actions such as moving, clicking and dragging the
mouse, pressing and holding keys, and pressing keyboard hotkey
combinations [20]. Using this script an analyst can easily extract
the GCìGC/TOF-MS library hits data from the LECOđ ChromaTOFđ
software for further analysis in a signiﬁcantly reduced time and
with negligible human effort. The chromaTOF_auto.py script does
not modify, manipulate, or extend the software or databases of the
LECO® ChromaTOF® software.
Fig. 1B shows the workﬂow for automated data download with
chromaTOF_auto.py. The LECO® ChromaTOF® workspace is composed in left to right order with the following components - the
directory for accessing tools and options (Acquisition Que, GC and
MS Methods, Acquired Samples etc.), peak table, and the library
hit mass spectrum (Fig. S1). The chromaTOF_auto.py script saves
the library hit ﬁles sequentially in the most recent directory used
by the user, renaming the ﬁles (1.msp, 2.msp, etc.) for easy access.
2.2. Data parsing

The data obtained from the GCìGC/TOF-MS data analysis by
the LECOđ ChromaTOF® software on both the EPA and Stormwater
data sets was parsed using CINeMA.py to extract: (1) Analyte name
2

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

Fig. 1. Workﬂow for manual (A) or automated (B) data analysis. An environmental sample once collected is processed using GCìGC/TOF-MS and analyzed using the LECOđ
ChromaTOFđ software. The LECO® ChromaTOF® software outputs a list of observed analytes present in the sample. For a manual analysis, this processed data for each
observed analyte and their respective library hits are then manually downloaded by the analyst. Next, the analyst reviews this manually downloaded data to evaluate the
conﬁdence of the match (High or Low) between the mass spectra of each observed analyte and their corresponding library hit. For the automated analysis, the user creates
a directory to save the observed analyte’s (SA) library hit ﬁles and then downloads them sequentially using the chromaTOF_auto.py script.

(“Name”); (2) Mass-to-charge ratio (m/z) of the ions and their respective intensities; (3) Similarity Score between the observed analyte and library hit #1 from the LECO® ChromaTOF® software
(only present in library hits); and (4) Total number of ions in the
mass spectrum. This data was necessary for the script CINeMA.py
to analyze the conﬁdence of a match between the observed analytes and library hits. In addition, since the lowest mass spectral
acquisition ion was m/z 50, the manual review of matches ignores
all ions below m/z 50 present in the library hit. CINeMA.py parsed
all the ﬁles in the given data directory into the required data structure to train, test, or make predictions, using either the algorithmic
model or the machine learning models [21–23]. Depending on the
user action (predict, train, or test), CINeMA.py requires the data directory to have a speciﬁc organizational structure (Fig. 2).
CINeMA.py results were benchmarked with those obtained via
manual analysis to establish the reliability of our CINeMA.py results and the effectiveness of CINeMA.py in reducing GC×GC/TOFMS data analysis time. The peak_true.msp ﬁle contains data for
all the observed analytes together as shown in Fig. S2. To verify the completeness of the analyte data, the script parses the
peak_true.msp ﬁle using a state machine as shown in Fig. S3. Finally, each compound’s library hit is output to an individual ﬁle as
shown in Fig. S4.

the observed analyte mass spectrum and the library hit from the
2011 NIST EI-MS library matches. The user can alter this similarity score threshold using the command line inputs for CINeMA.py.
The algorithm compares the library hit mass spectra with the observed mass spectra from LECO® ChromaTOF® software. A match
is deemed a “high conﬁdence” match if the following are true: the
similarity score is greater than or equal to the user provided similarity score, the most abundant three ions of the library hit are
present in the observed mass spectra (and vice versa), the molecular ion is present, and the correlation percentage between the
spectra of the library hit and the observed mass spectra is at least
80%.
2.4. Machine learning models
Two types of machine learning approaches were used to determine if the best library hit is a high- or low-conﬁdence match
to the observed mass spectra: a random forest algorithm, and a
neural network. Random forest and neural networks were both selected for this study primarily because of their effectiveness when
working with classiﬁcation problems such as this. Neural networks
can analyze complex relationships between inputs, which makes it
a good choice to detect differences in mass spectra that can contain large amounts of ion intensity data. However, neural networks
usually require vast amounts of samples for training. Conversely,
random forest works well with smaller amounts of data with more
clearly deﬁned features, such as the spectra features a reviewer
looks for during a manual review. In addition, feature importance
can be easily provided with random forest, allowing the user to
visualize the aspects of their manual review that the machine considers the most important.

2.3. Algorithmic model
The algorithmic model, outlined in Fig. 3, begins by checking
for the similarity score threshold, which by default is set to 600
in this study, but the threshold can be changeable (out of 999).
This similarity score from NIST is an output from the LECO® ChromaTOF® software describing the measure of similarity between
3

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

Fig. 2. Data directory structures. (A) Under the sample directory there is a subdirectory called ‘hits’ and the peak_true.msp ﬁle that contains the data for observed analytes.
The user should use the ‘hits’ directory to save all the library hits ﬁles obtained through using chromaTOF_auto.py. Each sample subdirectory should contain a compounds.tsv
ﬁle, which contains the m/z ratio for the molecular ion in the library hit ﬁle. (B) For training or testing the accuracy of a machine learning model with a new data set, the root
directory should contain sub directories, which are sample names. Each sample subdirectory should contain a ground_truth.tsv ﬁle, which contains the manual interpretation
of the conﬁdence of a match of observed analytes and library hits obtained from GCìGC/TOF-MS data analysis by the LECOđ ChromaTOFđ software.

Fig. 3. Algorithmic model. If the similarity score from the LECO® ChromaTOF® software is less than the similarity score threshold, the algorithm classiﬁes the match as a
low conﬁdence match. If the similarity score is higher, then the model normalizes the spectrum data for both the observed analyte (SA) and the library hit (LH) and checks
the following set of conditions: (1) presence of most abundant three ions (Top 3 ions) of the library hit in the observed analyte, (2) presence of molecular ion of the library
hit in the observed analyte, (3) presence of top three ions of the observed analyte in the library hit and (4) correlation (>=80) between the spectra of the library hit and
the observed analyte. If all these conditions are met, it interprets the match as a “high conﬁdence match.”.

4

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

Fig. 4. Neural network model’s structure. The ﬁrst 10 0 0 inputs are the library hit
ion intensities and the next 10 0 0 are the observed analytes’ ion intensities. There
are three hidden layers of size 10 0 0, 10 0 and 10 neurons, and have softsign activation functions. The last layer of the network uses a softmax activation function and
is composed of two neurons for high or low predictions. The model was trained
with 5 epochs and a batch size of 128.

The input data for random forest consisted of the same mass
spectra features checked when using the algorithmic model: similarity score, correlation percentage, molecular ion presence, and
the number of top ions present in the hit that are also present in
the observed analyte (and vice versa). The random forest model
was built in python using the Scikit-Learn package [24,25]. The hyperparameters for the model were tuned based on optimizing the
accuracy metric, resulting in 100 trees and a max depth of 4. The
input data for the neural network consisted of the ion intensities
for each observed analyte and its best hit to detect if the two spectra are similar enough to be considered a high-conﬁdence match.
This model was built in python using the Keras and Tensorﬂow
packages [26,27]. Fig. 4 illustrates the structure of the neural network model. Activation functions, the number of epochs, and the
batch size were selected for the neural network based on the accuracy metric, aided with the use of GridSearchCV in the Scikit-Learn
package. Model performance was examined through confusion matrices, receiver operating characteristic (ROC) curves, and 10-fold
cross validation. All models were trained on one of the two data
sets and tested on the other using an expert’s manual review of
high and low conﬁdence for the data labels. Additionally, to provide more data for model training, these data sets were also combined into one large data set and then trained and tested in three
ways: (1) Train on 80% of the combined set and test on the remaining 20%; (2) Train on 80% of the EPA data set plus 100% of the
Stormwater data set, and then test the remaining 20% of the EPA
data set; (3) Train on 80% of the Stormwater data set plus 100% of
the EPA data set, and then test the remaining 20% of the Stormwater data set. Random splits were performed on all train test split
cases. CINeMA.py also allows the analyst to train and save their
own machine learning model on a given data set. The saved models can then be used for testing or making predictions for new data
sets.

Fig. 5. Mirror plots comparing observed analyte and library hit mass spectra. The
mirror plots are provided by CINeMA.py for all matches from the non-targeted analysis to the given library spectra, allowing straightforward manual conﬁrmation. The
top spectra (positive values in blue) is the spectrum from the observed analyte in
the sample, while the bottom mirrored spectra (negative values in red) is the spectrum of the corresponding library hit for the observed analyte. (A) An example of
a high conﬁdence library match. (B) Example of a low conﬁdence match (For interpretation of the references to color in this ﬁgure legend, the reader is referred to
the web version of this article.).

pare the observed analyte’s mass spectrum and the corresponding library hit’s mass spectrum if desired. The mirror-plot of the
two mass spectra makes visual comparison easy while comparing the two separate plots produced by LECO® ChromaTOF® software. When training a neural network model, CINeMA.py produces
model_performance.pdf containing loss curves for each fold during
cross-validation, shown in Fig. 6. When testing either of the machine learning models, the script will produce measures.pdf containing the confusion matrix and the ROC curve, as in Fig. 7 [29].
By considering low-conﬁdence matches as “negatives,” and highconﬁdence matches as “positives,” the user can use the confusion
matrix to calculate performance metrics such as accuracy, sensitivity, speciﬁcity, and balanced accuracy. When training the random
forest model with feature input data, the script will produce importance.pdf containing a bar plot with the relative importance for
each feature (Fig. 8). Source code for chromaTOF_auto.py and CINeMA.py, along with tutorials and test data are available on Github
at />
2.5. Report generation
The CINeMA.py generates reports in the form of two ﬁles report.tsv and report.pdf. The report.tsv ﬁle contains information
about the peak number, name of the observed analyte and the
predicted match between the library hit and the observed analyte. The report.pdf ﬁle contains mirror plots between each observed analyte’s mass spectrum and its corresponding library hit’s
mass spectrum [28]. Fig. 5 shows example plots of high and low
conﬁdence matches. The plots allow the analyst to visually com5

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

Fig. 8. Example feature importance for the random forest model trained on the EPA
data set and tested on the Stormwater data set.

Fig. 6. Example training loss generated during one-fold of the 10-fold cross Validation on the EPA data set. The blue curve (top) indicates the loss on the training
samples, and the orange curve indicates the loss on the samples held out for validation in that fold. This results shows Neural Network Model loss using ion intensity
data trained for 5 epochs and a batch size of 128 (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this
article.).

3. Results and discussion

The automated data collection workﬂow process implemented
in chromaTOF_auto.py needed only a few minutes on an Intel®
CoreTM i7–6700 Quad CPU, with 8 GB RAM running Windows®
10, 64-bit to download library hit data (∗ .msp) ﬁles from a given
GC×GC/TOF-MS data output analysis from the LECO® ChromaTOF®
software. Because of computational speed, chromaTOF_auto.py initially caused the ChromaTOF® GUI to crash. To overcome this issue, we included a delay timer in the chromaTOF_auto.py script,
allowing the user to set up the screen as described above before
the automation takes over to download the library hit ﬁles.
For generating predictions, CINeMA.py was able to produce results within a minute. When testing the algorithm model on the
complete data sets, an accuracy of 81.54% was achieved on the
986 compounds in the EPA set and an accuracy of 78.70% was
achieved on the 892 compounds in the Stormwater set. For the
machine learning models, the highest accuracy value and Area under the ROC curve score (AUC) seen on the complete EPA set was
achieved using the random forest model on the algorithm’s feature
data. This model had an accuracy of 85.60% and had an AUC score
of 0.887. The highest accuracy value and AUC score seen on the
Stormwater set was also achieved using the random forest model
on the algorithm’s feature data. This model had an accuracy value
of 82.85% and an AUC score of 0.899 (Table 1). The neural network
did not perform as well as the other models based on the testing
accuracies, AUC scores, and cross-validation accuracies (Tables 1
and 2). Combining data sets did somewhat improve the testing accuracy and AUC score for this model however (Table 1). Agreement
rates between the human user’s decision vs. a model decision per
“High” and “Low” conﬁdence were similar, with a slightly higher
agreement by the algorithm model in “High” than in “Low" (Table
S1). This demonstrates that the models work equally for compound
identiﬁcation regardless of “High” and “Low” conﬁdence matching.
To identify reasons for discrepancy between classiﬁcations (human vs computer), we manually reviewed “incorrect” classiﬁcations. The main source of discrepancy when comparing human
classiﬁcations to the algorithm’s classiﬁcations appeared to come
from instances in which observed mass spectra and library hit

mass spectra were very similar, but were on the cusp of either high
or low conﬁdence. This often occurred in instances in which the library hit mass spectra contained numerous ions with low relative
abundance. Since NTA of environmental samples often involves the
detection of trace contaminants, compounds present at low con-

Fig. 7. Example eﬃcacy outputs following random forest model training on the EPA
data set and testing on the Stormwater data set. (A) Confusion matrix. (B) Receiver
Operating Characteristic curve (ROC).

6

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

Table 1
Random forest and neural network model performances across the EPA dust and Stormwater data sets. Includes the number of compounds
present in the training and test sets, the accuracy on the test set, and the Area Under the ROC Curve (AUC) score.

RF Features
Train EPA Test Stormwater
Train Stormwater Test EPA
Train 80% EPA Test 20% EPA
Train 80% Stormwater Test 20% Stormwater
Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater)
Train EPA + 80% Stormwater Test 20% Stormwater
Train Stormwater + 80% EPA Test 20% EPA
NN Intensities
Train EPA Test Stormwater

Train Stormwater Test EPA
Train 80% EPATest 20% EPA
Train 80% Stormwater Test 20% Stormwater
Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater)
Train EPA + 80% Stormwater Test 20% Stormwater
Train Stormwater + 80% EPA Test 20% EPA

#Training Compounds

#Test Compounds

Testing Accuracy

AUC

986
892
788
713
1502
1699
1680

892
986
198
179
376
179
198

82.85%
85.60%
89.39%
81.01%
82.45%
79.33%
89.39%

0.899
0.887
0.943
0.883
0.873
0.887
0.936

986
892
788
713
1502
1699
1680

892
986
198
179
376

179
198

66.82%
63.49%
75.25%
69.83%
70.48%
69.83%
77.78%

0.706
0.641
0.846
0.742
0.761
0.732
0.836

Table 2
10-fold cross validation mean accuracy +/- standard deviation on the two neural network
models across the EPA dust and Stormwater data sets.
NN Intensities
Train
Train
Train
Train
Train
Train
Train

EPA Test Stormwater
Stormwater Test EPA
80% EPA Test 20% EPA
80% Stormwater Test 20% Stormwater
80% (EPA + Stormwater) Test 20% (EPA + Stormwater)
EPA + 80% Stormwater Test 20% Stormwater
Stormwater + 80% EPA Test 20% EPA

centrations may not produce enough low abundance ions to be detected by the mass spectrometer. As the algorithm is conﬁned by
a strict set of rules (i.e., correlation percentage ≥ 80%), some compounds may be classiﬁed as “low” while a human user may take
additional factors into account and classify as “high”.
Additionally, both the algorithm and random forest model corrected human errors. As shown in Fig. S5, some compounds in
which the observed mass spectra and library hit mass spectra were
near perfect matches were erroneously classiﬁed as “not a match
(low)” by the human user but classiﬁed correctly as a “match
(high)” by the algorithm. Conversely, there were instances in which
the observed mass spectra and library hit mass spectra did not
match well but were classiﬁed as “high” by the human user and
classiﬁed as “low” by the algorithm. Such errors were due to fatigue experienced by the human user comparing hundreds of mass
spectral matches in succession.
While the random forest model had the highest accuracy scores,
there are still some beneﬁts to the use of a simpliﬁed algorithm
over the machine learning techniques. The simpliﬁed algorithm is
capable of working with extremely small data sets and does not require an outside source of data for training. Both types of machine
learning techniques require data for training and, especially in the
case of neural networks, large amounts of data may be necessary.
The algorithmic approach however avoids this issue, meaning users
may prefer this method over training their own machine learning model. Consequentially, this may explain the low performance
metrics in the neural network compared to the other models as

the number of samples contained in the data sets was relatively
small for this type of model. Furthermore, the algorithm is easily
tunable, allowing the user to specify their own similarity score and
correlation percentage thresholds when testing their own data sets.
This ability to easily tune the algorithm makes it applicable for use
with programs other than ChromaTOF, as their spectral matching
components may use a scale different than ChromaTOF’s similarty
score (0–999).

74.44%
71.30%
72.21%
70.83%
73.50%
75.40%
73.10%

(+/(+/(+/(+/(+/(+/(+/-

3.28%)
4.46%)
5.31%)
4.79%)
2.68%)
2.99%)
2.70%)

4. Conclusions
Overall, the random forest model provided the best accuracy
value for both data sets, and we showed that compounds missed

by the algorithm were often recognized by machine learning. Furthermore, by ranking feature importance a machine learning approach can highlight ways to improve the algorithmic approach
by illustrating which feature thresholds can be tuned in the algorithm. The neural network model with intensities has the potential to predict unknown rules and patterns for analyzing the data
set, which the feature-using models lack. Feature models are based
on man-made rules and likely have room for improvement since
it could be diﬃcult to hardcode all possible rules. Thus, in principle, with larger data sets a neural network approach using ion
intensities has the potential to ﬁnd patterns and rules that cannot
be coded via an algorithm. Furthermore, it can be improved by increasing the size and accuracy of training data sets. In future work,
we will continue to explore the potential of neural networks with
intensity data to enhance the accuracy of NTA.
In terms of speed, CINeMA.py is able to provide prediction results within a minute. Manual data analysis by multiple people required hours or even days for the same data sets of observed analytes. CINeMA.py’s capacity to rapidly evaluate the conﬁdence of a
match between observed analytes and library matches represents
a signiﬁcant improvement over manual analysis that can take substantial time depending on the data size and can be error-prone
during heavy data handling. CINeMA.py gives the user the ﬂexibility to not only automate the interpretation of the conﬁdence
of the match of observed analytes and their corresponding library
matches, but also to experiment with various test parameters to
study its effects on the analysis. In addition, the user can choose to
use either or both the algorithmic model and any of the machine
learning models to analyze their data and compare their predictions. The user can also train the machine learning models with
relevant data sets to improve predictions on new data sets. Because
7

J. Bendik, R. Kalia, J. Sukumaran et al.

Journal of Chromatography A 1660 (2021) 462656

the machine learning approaches can ﬁnd rules and patterns that
cannot be coded via standard algorithmic approaches, these techniques have potential for compound identiﬁcation in the future. Although our study was conducted primarily on LECO’s ChromaTOF
platform, these approaches are applicable to other GC–MS based
nontargeted and/or suspect screening analyses for high matching

compound identiﬁcation by EI mass spectral similarity comparison.

[7] K.A. Phillips, A. Yau, K.A. Favela, K.K. Isaacs, A. McEachran, C. Grulke,
A.M. Richard, A.J. Williams, J.R. Sobus, R.S. Thomas, J.F. Wambaugh, Suspect
Screening analysis of chemicals in consumer products, Environ. Sci. Technol.
52 (2018) 3125–3135, doi:10.1021/acs.est.7b04781.
[8] J.M.D. Dimandja, Peer reviewed: GC X GC, Anal. Chem. 76 (2004) 167–174
10.1021/ac041549+.
[9] I.A. Titaley, O.M. Ogba, L. Chibwe, E. Hoh, P.H.Y. Cheong, S.L.M. Simonich, Automating data analysis for two-dimensional gas chromatography/time-of-ﬂight
mass spectrometry non-targeted analysis of comparative samples, J. Chromatogr. A 1541 (2018) 57–62, doi:10.1016/j.chroma.2018.02.016.
[10] LECO accurate mass library. />accurate- mass- library- 209- 272/viewdocument?Itemid=1761.
[11] S.E. Stein, Estimating probabilities of correct identiﬁcation from results of mass
spectral library searches, J. Am. Soc. Mass Spectrom. 5 (1994) 316–323, doi:10.
1016/1044- 0305(94)85022- 4.
[12] E.G. Xu, W.H. Richardot, S. Li, L. Buruaem, H.H. Wei, N.G. Dodder, S.F. Schick,
T. Novotny, D. Schlenk, R.M. Gersberg, E. Hoh, Assessing toxicity and in vitro
bioactivity of smoked cigarette leachate using cell-based assays and chemical
analysis, Chem. Res. Toxicol. 32 (2019) 1670–1679, doi:10.1021/acs.chemrestox.
9b00201.
[13] D. Chang, W.H. Richardot, E.L. Miller, N.G. Dodder, M.D. Sedlak, E. Hoh, R. Sutton, Framework for non-targeted investigation of contaminants released by
wildﬁres into stormwater runoff: case study in the Northern San Francisco Bay
area, Integr. Environ. Assess. Manag. (2021) Online ahead of print, doi:10.1002/
ieam.4461.
[14] H. Mol, Non-targeted is our target, The Anal. Scientist (2013) https://
theanalyticalscientist.com/techniques- tools/non- targeted- is- our- target.
[15] E.D. Strozier, D.D. Mooney, D.A. Friedenberg, T.P. Klupinski, C.A. Triplett, Use of
comprehensive two-dimensional gas chromatography with time-of-ﬂight mass
spectrometric detection and random forest pattern recognition techniques for
classifying chemical threat agents and detecting chemical attribution signatures, Anal. Chem. 88 (2016) 7068–7075, doi:10.1021/acs.analchem.6b00725.
[16] F. Allen, A. Pon, R. Greiner, D. Wishart, Computational prediction of electron ionization mass spectra to assist in GC/MS compound identiﬁcation, Anal.

Chem. 88 (2016) 7689–7697, doi:10.1021/acs.analchem.6b01622.
[17] D.D. Matyushin, A.Y. Sholokhova, A.K. Buryak, Deep learning driven GC-MS
library search and its application for metabolomics, Anal. Chem. 92 (2020)
11818–11825, doi:10.1021/acs.analchem.0c02082.
[18] E.M. Ulrich, J.R. Sobus, C.M. Grulke, A.M. Richard, S.R. Newton, M.J. Strynar,
K. Mansouri, A.J. Williams, EPA’s non-targeted analysis collaborative trial (ENTACT): genesis, design, and initial ﬁndings, Anal. Bioanal. Chem. 411 (2019)
853–866, doi:10.10 07/s0 0216- 018- 1435- 6.
[19] S.R. Newton, J.R. Sobus, E.M. Ulrich, R.R. Singh, A. Chao, J. McCord, S. LaughlinToth, M. Strynar, Examining NTA performance and potential using fortiﬁed
and reference house dust as part of EPA’s non-targeted analysis collaborative trial (ENTACT), Anal. Bioanal. Chem. 412 (2020) 4221–4233, doi:10.1007/
s00216- 020- 02658- w.
[20] A. Sweigart, PyAutoGUI, GitHub Repository, 2014. />asweigart/pyautogui.
[21] V. Keleshev, Docopt, GitHub Repository, 2012. />docopt.
[22] C.R. Harris, K.J. Millman, S.J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N.J. Smith, R. Kern, M. Picus, S. Hoyer,
M.H. van Kerkwijk, M. Brett, A. Haldane, J.F. del Río, M. Wiebe, P. Peterson,
P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke,
T.E. Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362,
doi:10.1038/s41586- 020- 2649- 2.
[23] W. McKinney, Data structures for statistical computing in python, in: Proceedings of the 9th Python in Science Conference, 1, 2010, pp. 56–61, doi:10.25080/
majora- 92bf1922- 00a.
[24] F. Pedregosa, O. Grisel, R. Weiss, A. Passos, M. Brucher, G. Varoquax, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, M. Brucher, Scikit-learn: machine learning in python, J. Mach.
Learn. Res. 12 (2011) 2825–2830.
[25] G. Varoquaux, Joblib, GitHub Repository, 2009. />[26] F. Chollet, Keras, GitHub Repository, 2015. />[27] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O.
Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow:
large-scale machine learning on heterogeneous distributed systems, (2016).
/>[28] J.D. Hunter, Matplotlib: a 2D graphics environment, Comput. Sci. Eng. 9 (2007)

90–95, doi:10.1109/MCSE.2007.55.
[29] M. Waskom, Seaborn, GitHub Repository, 2013. />seaborn.

Declaration of Competing Interest
The authors declare they have no known competing ﬁnancial
interests.
CRediT authorship contribution statement
Joseph Bendik: Software, Investigation, Formal analysis, Validation, Visualization, Writing – original draft. Richa Kalia: Software, Investigation, Visualization, Formal analysis, Writing – original draft. Jeet Sukumaran: Software, Methodology. William H.
Richardot: Validation, Data curation, Resources. Eunha Hoh:
Methodology, Validation, Funding acquisition, Writing – review &
editing. Scott T. Kelley: Conceptualization, Writing – original draft,
Writing – review & editing, Supervision, Project administration.
Funding
This work was funded in part by the California Tobacco Related
Disease Research Program funded grant (27IP-0028C).
Acknowledgments
We would like to thank Dr. Nathan Dodder, Ying Xu, Bryan Ho,
and Basilin Benson for their valuable insights during the study design.
Supplementary materials
Supplementary material associated with this article can be
found, in the online version, at doi:10.1016/j.chroma.2021.462656.
References
[1] L. Chibwe, I.A. Titaley, E. Hoh, S.L.M. Simonich, Integrated framework for identifying toxic transformation products in complex environmental mixtures, Environ. Sci. Technol. Lett. 4 (2017) 32–43, doi:10.1021/acs.estlett.6b00455.
[2] J. Hollender, E.L. Schymanski, H.P. Singer, P.L. Ferguson, Nontarget screening
with high resolution mass spectrometry in the environment: ready to go? Environ. Sci. Technol. 51 (2017) 11505–11512, doi:10.1021/acs.est.7b02184.
[3] J.R. Sobus, J.F. Wambaugh, K.K. Isaacs, A.J. Williams, A.D. McEachran,
A.M. Richard, C.M. Grulke, E.M. Ulrich, J.E. Rager, M.J. Strynar, S.R. Newton,
Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA, J. Expo. Sci. Environ. Epidemiol. 28 (2018) 411–426,
doi:10.1038/s41370- 017- 0012- y.
[4] C.D. Tran, N.G. Dodder, P.J.E. Quintana, K. Watanabe, J.H. Kim, M.F. Hovell,

C.D. Chambers, E. Hoh, Organic contaminants in human breast milk identiﬁed by non-targeted analysis, Chemosphere 238 (2020) 124677, doi:10.1016/
j.chemosphere.2019.124677.
[5] M.B. Alonso, K.A. Maruya, N.G. Dodder, J. Lailson-Brito, A. Azevedo, E. SantosNeto, J.P.M. Torres, O. Malm, E. Hoh, Nontargeted screening of halogenated
organic compounds in bottlenose dolphins (tursiops truncatus) from Rio de
Janeiro, Brazil, Environ. Sci. Technol. 51 (2017) 1176–1185, doi:10.1021/acs.est.
6b04186.
[6] C.A. Manzano, N.G. Dodder, E. Hoh, R. Morales, Patterns of personal exposure
to urban pollutants using personal passive samplers and GC × GC/ToF-MS, Environ. Sci. Technol. 53 (2019) 614–624, doi:10.1021/acs.est.8b06220.

8

Automated high confidence compound identification of electron ionization mass spectra for nontargeted analysis

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về