Building machine-learning-based models for retention time and resolution predictions in ion pair chromatography of oligonucleotides

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.35 MB, 10 trang )

Journal of Chromatography A 1671 (2022) 462999

Contents lists available at ScienceDirect

Journal of Chromatography A
journal homepage: www.elsevier.com/locate/chroma

Building machine-learning-based models for retention time and
resolution predictions in ion pair chromatography of oligonucleotides
Martin Enmark, Jakob Häggström, Jörgen Samuelsson∗, Torgny Fornstedt∗
Department of Engineering and Chemical Sciences, Karlstad University, SE-651 88 Karlstad, Sweden

a r t i c l e

i n f o

Article history:
Received 8 December 2021
Revised 22 March 2022
Accepted 25 March 2022
Available online 27 March 2022
Keywords:
Machine-learning
Support vector regression (SVR) model
Oligonucleotides
Ion-pair chromatography
Resolution

a b s t r a c t
Support vector regression models are created and used to predict the retention times of oligonucleotides
separated using gradient ion-pair chromatography with high accuracy. The experimental dataset consisted

of fully phosphorothioated oligonucleotides. Two models were trained and validated using two pseudoorthogonal gradient modes and three gradient slopes. The results show that the spread in retention time
differs between the two gradient modes, which indicated varying degree of sequence dependent separation. Peak widths from the experimental dataset were calculated and correlated with the guaninecytosine content and retention time of the sequence for each gradient slope. This data was used to predict the resolution of the n – 1 impurity among 250 0 0 0 random 12- and 16-mer sequences; showing
one of the investigated gradient modes has a much higher probability of exceeding a resolution of 1.5,
particularly for the 16-mer sequences. Sequences having a high guanine-cytosine content and a terminal
C are more likely to not reach critical resolution. The trained SVR models can both be used to identify
characteristics of different separation methods and to assist in the choice of method conditions, i.e. to
optimize resolution for arbitrary sequences. The methodology presented in this study can be expected to
be applicable to predict retention times of other oligonucleotide synthesis and degradation impurities if
provided enough training data.
© 2022 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license ( />
1. Introduction
Ion-pair chromatography (IPC) is an important technique for
separating synthetic oligonucleotides, which are a class of DNAor RNA-based molecules with widespread and well-known applications in diagnostics [1,2], research [3], and, recently, therapeutic applications [4,5]. Oligonucleotides used for antisense therapy
[6] are typically produced using stepwise solid-phase synthesis
via the β -cyanoethyl phosphoramidite method [7]. Depending on
the length, sequence, and miscellaneous chemical modiﬁcations of
these antisense active pharmaceutical ingredients (APIs) [8], the ﬁnal synthesis product will contain a large fraction of impurities.
The polymeric nature of the oligonucleotides and the many impurities challenge analytical separations, and phosphorothioated (PS)
oligonucleotides are especially diﬃcult to analyze [9–12]. In this
study, we will focus on the shortmer impurities with respect to the
parent full-length product (FLP). In this study we put particular focus on the n – 1 impurity generated due to e.g. failed coupling in
the last coupling step, i.e. trityl-off.
∗

Corresponding authors.
E-mail addresses: (J. Samuelsson),
(T. Fornstedt).

Amphipathic [13] oligonucleotides are predominately separated

and analyzed using IPC [9,14,15]. The most-used stationary phase
is the C18 column, typically pH-stable variants such as the XBridge
C18 and other reversed-phase chemistries [11,12,15,16]. Many different combinations of ion-pairing reagents (IPRs) have been evaluated [9,15]. For the separation of PS oligonucleotides, methods
using tributyl ammonium acetate (TBuAA) as the IPR have been
proven successful [11,15,17]. In this study, we will use TBuAA in
two previously investigated gradient modes [18]. In the aforementioned study we could show that using the phenyl column resulted
in slightly improved n – 1 selectivity compared to the C18 column
in the IPR gradient mode. In the co-solvent gradient elution mode,
the co-solvent fraction increases over time, while the IPR concentration typically remains constant. In the IPR gradient mode, the
IPR concentration decreases over time while the co-solvent fraction
remains constant. Both modes elute oligonucleotides by decreasing
the apparent electrostatic potential generated by the adsorption of
the IPR. We have previously shown that the IPR gradient increases
the selectivity for oligonucleotide impurities of the same charge,
for example phosphodiester (P=O)1 impurities of fully phosphorothioated oligonucleotides, especially using a phenyl column [18].
Other chromatographic modes not using IPRs such as HILIC have

/>0021-9673/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( />

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

also been investigated for the separation of PS-modiﬁed oligonucleotides [19].
Retention time prediction models for the IPC separation of
oligonucleotides are few, and noteworthy works include those of
Gilar et al. [20], Studzinska and Buszewski [21], Sturm et al. [22],
Liang et al. [23], and Kohlbacher et al. [24]. These models are well
established for peptides and are routinely employed, for example,
in shotgun proteomics to design targeted proteomics experiments

and to reduce false-positive hits in mass spectrometry analysis. The
many different approaches used can roughly be divided into (i)
index-based, (ii) modeling-based, and (iii) machine-learning (ML)based methods [25]. In index-based methods, the effect of each
amino acid in a sequence is estimated using the multilinear regression of a large set of peptides with known retention times
[26,27]. In modeling-based methods, the physicochemical properties of the peptide are used to predict the retention times [27]. In
ML-based methods, a training set of peptides is used to estimate
the parameters of a predeﬁned mathematical model; many different approaches have been used for this, such as artiﬁcial neural
networks [28] and support vector regression (SVR) [29,30].
Gilar et al. have developed an empirical logarithmic model
(hereafter denoted as LM) to predict the retention of synthetic
oligonucleotides [20]. Their modeling-based method has ﬁve input
variables, i.e., the amount of each nucleotide (T, C, G, and A) as
well as the total number of nucleotides in the oligo. Studzinska
and Buszewski used quantitative structure–retention relationships
(QSRRs) to predict the retention based on descriptors such as van
der Waals surface area, solvent-accessible area, dipole moment, total energy, and hydration energy [21]. All these parameters were
numerically estimated and ﬁtted to simple functions. Neither of
these methods delivers excellent predictivity. The great advantage
of the LM model is that it is easy to use, requires few data points
for calibration, and has been shown to be rather good for predicting the retention of non-phosphorothioated oligonucleotides. However, due to the selection of descriptors, the model cannot address
potential structural changes such as grove and hairpin formation
as well as whether the retention is dependent on sequence and
not just on composition. The same could also be true of the QSRR
method, which shares the problems of descriptor selection and of
ﬁnding accurate descriptors of more complicated molecules such
as oligonucleotides. Sturm et al. used SVR for retention predictions
[22], mainly using sequence-based descriptors as well as descriptors correlating to stacking energies occurring in hairpin formation.
Sturm et al. showed that their model had better predictive power
than did the LM model and also could predict the retention change
due to hairpin formation. Since the experimental system including solutes investigated by Gilar et. al. and Sturm et. al. is similar,

it is relevant to compare both approaches for phosphorothioated
oligonucleotides separated in different experimental systems. Later,
Liang et al. used a similar SVR model to investigate how to optimize the selectivity in gradient elution [23]. In all above studies,
the authors investigated non-phosphorothioated oligonucleotides
using triethylamine as the IPR as well as co-solvent gradient mode.
Due to the successful utilization of SVR models in [22,23] we decided to investigate if such models also can be successfully used
to predict retention of phosphorothioated oligonucleotides eluted
using tributylamine as IPR.
The aim of this study is to build SVR IPC retention time prediction models based on the oligonucleotide sequence for two different gradient modes, i.e., the conventional co-solvent gradient and
the IPR gradient modes. As training and testing solutes, around 100
heteromeric, fully phosphorothioated oligonucleotides will be used.
As the IPR, TBuAA will be used to reduce diastereomer separation.
Finally, and most importantly, the retention time prediction models will be used to predict the probability of successfully separating
the impurities from synthetic oligonucleotides as well as compar-

ing the two different gradient modes; (i) co-solvent gradient and
(ii) IPR gradient mode using three gradient slopes.
2. Materials and methods
2.1. Chemicals and materials
The IPRs TBuAA and triethylammonium acetate (TEtAA) were
prepared from tributylamine (≥99.5%, CAS number: 121-44-8) and
triethylamine (≥99.5%, CAS number 121-44-8) with acetic acid
(≥99.8%, CAS number 64-19-7), all purchased from Sigma-Aldrich
(St. Louis, MO, USA). The mobile phases were prepared using HPLC
gradient-grade acetonitrile (CAS number 75-05-8) from VWR (Radnor, PA, USA) and deionized water with a resistivity of 18.2 M /cm
from a Milli-Q water puriﬁcation system (Merck Millipore, Darmstadt, Germany). An XBridge Phenyl column, 150 × 3.0 mm, 3.5
μm, 100 A˚ pore size from Waters (Milford, MA, USA) was used
in all experiments. Fully phosphorothioated oligonucleotides were
purchased in 0.25-μmol scale from Integrated DNA Technologies
(Leuven, Belgium) and delivered desalted and lyophilized. The purchased FLP oligonucleotides were not puriﬁed before use. A list of

all oligonucleotide sequences can be found in Supplementary material Table S1.
2.2. Instrumentation
Experiments were conducted on an Agilent 1260 Inﬁnity II
HPLC system (Agilent Technologies, Palo Alto, CA, USA), conﬁgured
with a binary pump, a 100-μL injection loop, a diode-array UV detector, single quadrupole MS, and a column thermostat.
2.3. Procedures
2.3.1. Selection of oligonucleotides
The ﬁrst part of the dataset was selected to explore the effects of length, nucleobase composition, and sequence. It contains three different 8-, 12-, and 16-mer oligonucleotide sequences.
These were designed in silico by ﬁrst generating one million sequences of length 8, 12 and 16 by randomly picking adenine (A),
thymine (T), cytosine (C), or guanine (G) at each position in the
sequence. The retention time of all sequences was then calculated
using the LM model described by Gilar et al. [20]. This allows us
to estimate the variance in retention time for each population of
8, 12 and 16mers. Then, we randomly picked three sequences of
each length from each population mean – 2 standard deviations,
mean and ﬁnally mean + 2 standard deviations, labeled SnA, SnB,
or SnC, where n = 8, 12, or 16, respectively. These sequences can
be found in Supplementary material Table S1. Since the LM predicts that the contribution to retention time increases according to
the nucleobase in the order C < G < A < T, the base composition of the sequences will vary from high proportions of guaninecytosine content (GC-content) in the SnA sequences to high proportions A and T in the SnC sequences, respectively. The second
part of the dataset was selected to test whether the secondary
oligonucleotide structure inﬂuences the retention time. The 16-mer
sequences referred to as reference hairpin (RHA) and model hairpin (MHA) by Stellwagen et al. [31] were then selected; Stellwagen
et al. investigated the effect of monovalent cations on the thermal
stability of MHA, as measured by capillary electrophoresis. In this
case the MHA should contain more than 10% hairpin structures at
50 °C at least in a solution containing 100 mM tetrabutyl ammonium, no organic solvent and high amount of other background
electrolytes. They also found that the DNA melting point decreases
with increasing lipophilicity of the IPR [31]. In our study, we therefore included permutated variants of RHA and MHA that minimize
2

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Table 1
Summary of experimental gradient conditions.
Elution mode
Co-solvent
gradient
IPR gradient

G1
Initial MeCN (v%)
TBuAA (mM)
Slope (v% MeCN min–1 )
Initial TEtAA (mM)
MeCN (v%)
Slope (mM TEtAA min–1 )

38
5
2.22
0.1
41.5
0.32

G2

G3

1.23

0.81

the co-solvent gradient experiments. A list of all oligonucleotide
sequences as well as their retention times can be found in Supplementary material Table S1; the peak widths were obtained from
the n –1, n –2, and n – 3 peaks by ﬁrst interpolating the actual
peak and then determining the corresponding width at half height.
3. Calculations

0.16

0.08

All general computations were performed using Python with
the Numpy supporting libraries and all graphics were generated
using Matplotlib.
The ﬁrst step in ﬁnding an ML model is processing the data.
Our dataset consists of the output data, i.e., the retention times
and the corresponding oligonucleotides, represented by a string of
different combinations of A, T, G, and C, serving as input data.
Since ML models require numerical input, the oligonucleotides
must be encoded. In our implementation, we encoded the oligonucleotides in terms of different frequencies based on their primary
and secondary structural properties, as described by Sturm et al.
[22]. These different features were divided into groups, as done
by Sturm et al., where COUNT contains the frequency of each nucleotide in the sequence, CONTACT contains the frequencies of all
possible dinucleotides in terms of their order (e.g., the numbers
of CG, CA, CT, CC etc. occurring in the sequence), SCONTACT contains the frequencies of all dinucleotides bases, disregarding their
order (e.g., the numbers of CG + GC, CA + AC, CC, etc.), and ﬁnally

HAIRPIN contains the numbers of stem, loop, and free bases [22].
The secondary structure of the sequences was calculated using the
seqfold module [34] assuming the temperature 50°C.
The next step in the search for a model was the training, and
then ﬁnding the best-performing features and hyperparameters.
This was done by performing a nested cross-validation, the purpose of which was to estimate how well the model responded to
new data, to reduce the risk of model overﬁtting. First, one split
the dataset into to k subsets. Then, one chose one subset to be
omitted from the training to act as validation data (1/3 of all data),
while the rest of the dataset was used for training (2/3 of all data).
The chosen training set was then further split into n subsets, and
the same procedure as described before was repeated. This approach is visualized in Fig. 1. The best-performing model on average after the inner cross-validation was chosen to be tested on the
outer validation set. Then the result was evaluated based on the
average performance in the outer validation, and the main metric
that this implementation used was the root mean squared error
(RMSE).
This procedure was performed for each sub-dataset, where every unique combination of the described feature groups was evaluated. The inner cross-validation was done using gridsearchcv from
the sklearn ML library, which performs a k-fold cross-validation
for a given model (SVR) and lists of hyperparameters (regularization parameter C, epsilon tube ε , and kernel coeﬃcient γ ). When
gridsearchcv found a ﬁt for each combination of hyperparameters,
then the best-performing model was chosen and further evaluated on the outer validation set, which was randomly split using
the sklearn function kfold [35]. The number of folds in both the
outer and inner cross-validations was chosen to be three. Furthermore, results might vary due to the stochastic nature of the algorithms when performing a ﬁt and due to the randomized split of
the datasets, so the process was performed another three times to
reduce the variance of the results. As a comparison, the LM model
developed by Gilar et al. (equation 7 in [20]) was ﬁtted to each
sub-dataset. A nonlinear least squared regression was performed to
ﬁnd the optimal weights by using the lmﬁt module [36]. The LM
requires no hyperparameter optimization and was therefore only
evaluated on the outer validation split. When the best-performing

features were found, a ﬁnal training was then done using the best-

hairpin formation (i.e., RHB and MHB). Finally, a sequence mimicking the MALAT-1 transcript targeting ASO described by Nilsson
et al. [32] was included in the dataset. The 8-, 12-, and 16-mer sequences synthesized are hereafter referred to as FLPs of length n.
2.3.2. Experimental
All samples were prepared by dissolving the lyophilized
oligonucleotides by vortexing them in deionized water prepared
using a Milli-Q water puriﬁcation system (Merck Millipore). The
stock concentration was 1 mg mL–1 and the injection concentration was 0.2 mg mL–1 . 3 μL was injected into the column of this
solution. Mobile phases were prepared by weight using the density
of water and acetonitrile (MeCN) at room temperature. For the cosolvent gradient experiments, 10 and 80 v% MeCN solutions were
prepared, while for the IPR concentration gradient experiments,
two 41.5 v% solutions were made. During stirring, acetic acid was
added followed by tributylamine (to both eluents for co-solvent
gradient experiments) or tributylamine or triethylamine separately
for IPR concentration gradient experiments. All mobile phases were
stirred for at least 12 hours before use to ensure that the all IPRmolecules are fully dissolved. Before use, the sw pH of all mobile
phases (solvent/water) was determined using a pH electrode calibrated in aqueous buffer. The measured pH value of the mobile
phase ranged between 7-8 depending on the mobile phase composition; 7 at low concentration of MeCN and 8 at high concentration of MeCN. All experiments were performed using still-air column temperature control at 50°C. The ﬂow rate was 0.5 mL min–1
which provided suﬃciently good MS signals, i.e., good enough nebulization in the spray chamber. Three gradient slopes were evaluated for each of the two gradient methods, and their details can
be found in Table 1. A re-equilibration time of about three column
volumes was used after the end of each gradient. A 0.01 mg mL−1
sample of uracil was prepared in deionized water and used as the
void volume marker.
The UV signal was recorded at 260 nm. Mass spectrometry
analysis was performed using negative polarity in API-ES ionization
mode. More details of the mass spectrometry settings can be found
in Roussis et al. [33]. Retention times were obtained from both UV
and MS signals. The retention time of the full-length sequence was
determined from the peak apex of the UV signal. Retention times

of shortmer impurity sequences were obtained by the selective ion
monitoring of charge states 3 and 4. For the 8-mer samples, a retention time of n = 8, 7, 6, 5, or 4 was obtained in a single injection, whereas for the 16-mer samples, retention times of n = 16,
…, 12 and 11, …, 8 were obtained in two separate injections. This
allowed the repeatability of experiments to be monitored. Retention times were adjusted for the additional dwell volume introduced by the tubing to the MS. To determine the correct time for
samples having overlapping m/z values for different charge states,
it was assumed that the retention time of the n – x-mer was always less than that of the n-mer.
Some mentioning on the amounts of data used; in total, retention times for 98 unique sequences were collected and determined
for all gradient slopes in the IPR-gradient experiments, 96 for the
G1 and G2 gradient slopes and 91 for the G3 gradient slope, for
3

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

from its n – 1 impurity. We will also demonstrate how the choice
of elution method, conditions, and sequence characteristics affect
the probability of success.
4.1. Retention times
The ﬁrst observation of both the co-solvent gradient and IPR
gradient was that the retention times of sequences with n = 8,
12, and 16 increased with increasing proportions of A and T (samples SnA through SnC in Supplementary material Table S1). The retention time also increased with decreasing gradient slope. Very
short oligonucleotides, i.e., n < 5, were only marginally affected
by the gradient compared with longer sequences, i.e., n = 16, as
the system dwell volume had less of an effect on strongly retained
oligonucleotides. The oligonucleotide 3 -ACGACCGGGCGGAGTC-5
(S16A) had similar retention times using either method for all
three gradients, as it was used to normalize the effects of gradient slope and starting point between the methods. This normalization had the unexpected effect that the shorter oligonucleotides,
i.e., the S8x and S12x samples, were eluted signiﬁcantly earlier

using the IPR gradient than the co-solvent gradient. Clearly, the
two methods cannot be normalized for oligonucleotides of different lengths without also changing the shape of the gradient. Other
16-mer sequences than S16A had different retention times in the
two modes, indicating that there were different sequence-speciﬁc
contributions to retention. The hairpin-forming sequence MHA had
about a 0.15-min shorter retention time than did its permutated
sequence MHB in the co-solvent gradient system and about a 0.3min difference in the IPR gradient system using the shallower gradient (G3). The second hairpin-forming sequence RHA had retention almost identical to that of its permutated variant RHB in both
systems at the same gradient slope.
In Fig. 2a, we can see the difference between the two gradient
modes. The shortest oligonucleotides display better selectivity, i.e.,
a large change in the y-direction with the addition or removal of
a nucleobase subunit in the co-solvent gradient method; whereas
the opposite trend holds for the longest oligonucleotides in the IPR
gradient method (the larger change is in the x-direction). However,
as can also be seen in Fig. 2b, the eluted peaks in the IPR gradient
are wider than in the co-solvent gradient. How this affects resolution will be investigated further, see Section 4.3 below.

Fig. 1. Flowchart showing the steps required to train an SVR model to predict retention times.

performing features on two thirds of the dataset to visualize the
results in plots. Also, the models that was trained on the whole
dataset was saved for later use.
To evaluate the characteristics of the SVR model, we generated
250,0 0 0 unique random sequences with n = 12 and 16. We then
calculated their retention times and ﬁtted them using a normal
distribution. The peak width at half height (w0.5,i ) was assumed to
be described by the GC-content (sum of fractions of C and G) of the
sequence and its retention time plus a constant. The solution to the
resulting linear matrix equation (Supplementary material S4) was
determined using the least-squares method. The half-height width

of the UV trace of FLP and mass trace of the n – 1 to n - 7-mers
of 16-mer FLPs in the dataset as well as n – 1 to n - 3-mers of the
12-mer FLPs were used as input.
The SVR model can be downloaded from the Supplementary
material.

4.2. Machine learning model to predict retention times
The ﬁrst step in ﬁnding the best ML model was to evaluate
the model performance as a function of numbers of features, i.e.,
count, contact, scontact, and hairpin (see Section 3 for more details
about the features). We found that for all combinations of gradient modes and slopes, count gave the smallest RMSE for three out
of six systems (for a summary of all models, see Supplementary
material Table S2). For the remaining three systems, different combinations of features gave only marginally improved model RMSE.
This result could already be anticipated from the retention data,
with permutations of the strong hairpin structures MHA and RHA
only marginally affecting the retention time. We therefore decided
to continue using the model but with only the count feature.
In the study by Sturm et al. [22] all features were found
required to properly predict the retention times. However, this
ﬁnding cannot be directly extrapolated to our study since there
are two main experimental difference between the experiments
conducted by Sturm et al. and by us. Firstly, they uses another
IPR (TEA) and, secondly, they uses unmodiﬁed oligonucleotides
whereas we used TBuAA as IPR and fully phosphorothioated
oligonucleotides as solutes. As a consequence, Sturm et al. conducted their separations with much lower amounts of acetonitrile
(MeCN), 0–16% MeCN gradient, as compared to 38 – 70% as in this

4. Results and discussion
The shortmer population (n -1, n -2, …, n – n +1)) constitutes
the largest number of impurities generated by the solid-phase synthesis. Successful separation and quantiﬁcation of the individual

shortmers are necessary for the quality control of APIs. Generally,
the separation of the n – 1-mer is the most relevant and most
challenging problem. Therefore, it is beneﬁcial to have a tool that
can assist in the selection of chromatographic methods and the
corresponding conditions necessary to achieve critical resolution of
the pair, here deﬁned as ≥ 1.5. In Section 4.1, we will present experimental retention data obtained using two methods for three
gradient slopes and discuss the characteristics of the two systems.
The determined retention data will then be used to train ML models, whose performance and characteristics will be discussed in
Section 4.2. Finally, in Section 4.3, we will use the ML model to
estimate the probability of resolving an arbitrary oligonucleotide
4

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Fig. 2. Normalized experimental retention times obtained in co-solvent and IPR gradients using gradient G3 (Table 1) (a) and b) chromatogram showing the separation of
sequence MHB (Supplementary Material Table S1) b) at gradient G3 (Table 1).
Table 2
Summary of model performance on the training and validation sets.
Gradient mode

Gradient slope

Model

RMSE Training set (min)

RMSE Validation set (min)

R2 Training set

Q2 Validation set

IPR gradient

G1
G1
G2
G2
G3
G3
G1
G1
G2
G2
G3
G3

SVR
LM
SVR
LM
SVR
LM
SVR
LM
SVR
LM

SVR
LM

0.055
0.280
0.037
0.583
0.105
0.976
0.073
0.130
0.088
0.178
0.073
0.383

0.076
0.278
0.120
0.640
0.181
1.270
0.091
0.129
0.123
0.260
0.127
0.478

0.999

0.977
0.999
0.949
0.999
0.925
0.998
0.993
0.999
0.995
0.999
0.985

0.998
0.974
0.998
0.937
0.997
0.852
0.997
0.993
0.996
0.984
0.998
0.975

Co-solvent
gradient

study. Previously it was shown that in separations conducted at
higher amounts of MeCN the separation systems ability to separate charge differences is increased while systems ability to separate compounds with same charge is decreased [18]. This result in

that the feature count will be more important and that the nextneighbor effect indicating features contact and scontact will contribute less to the model, which was also observed in our study.
We also compared the SVR model with the LM. The results
indicated that the SVR model gave lower RMSE in all cases (see
Table 2). The relative difference in RMSE between the SVR and
the LM models increased with decreasing gradient slope for both
gradient modes. SVR was also markedly better at accurately predicting retention times for the IPR gradient at all gradient slopes.
This could be expected since the LM model was developed for cosolvent gradient elution, native oligonucleotide samples, and different IPR and stationary phases. Furthermore, this model was developed to give a rough estimate of the amount of acetonitrile required to elute an oligonucleotide based on its length and relative
proportions of nucleobases, for which it would still be useful given
the current datasets. Another way to estimate the model ﬁt is to
calculate the correlation coeﬃcients R2 and Q2 , where R2 is estimated from the training set and Q2 is estimated from data not
used in the training set; R2 will therefore estimate the goodness of
ﬁt and Q2 will estimate the goodness of prediction. From Table 2,
we can see that: (i) R2 was always greater than Q2 , as expected;
(ii) both R2 and Q2 were substantially larger for the SVR model
than the LM model; (iii) the LM model was much worse in pre-

dicting the IPR gradient than the co-solvent gradient; and (iv) the
SVR model was only slightly worse in predicting the IPR-gradient
than the co-solvent gradient.
Plots of predicted versus experimental retention times for the
validation subset of the experimental data obtained at gradient G3
are shown Fig. 3a and c for the co-solvent and IPR gradient modes,
respectively. The validation subset shown in this plot contains one
third of the sequences in the complete dataset. The corresponding
box plot of the relative error for the SVR and LM models are shown
in Fig. 3b and d.
The characteristics of the SVR models were evaluated by calculating the retention times of 250,0 0 0 unique random 12- and 16mers. The distribution of retention times can be found in Fig. 4.
The spread of the distributions increased with increasing oligonucleotide length and decreasing gradient slope for both gradient
modes which could be expected. In general, the spread of retention
times was higher for the IPR gradient mode suggesting that the hydrophobicity of the base pairs has a larger impact in this mode.

The larger variance observed for 16-mers could already be predicted from Fig. 2a. Analyzing the base composition of sequences
by ﬁtting a normal distribution to the retention data shows that,
for both gradient modes, 12-mer sequences obtained at below 1.5
standard deviations had a higher proportion of G and especially
C compared with the baseline of 25% each (see Supplementary
material Table S3). On the other hand, 12-mer sequences having
retention times of above 1.5 standard deviations had larger than
baseline (25%) proportions of A and especially T for both gradient
5

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Fig. 3. Experimental (tR, exp ) and predicted (tR, pred ) retention times in the validation dataset obtained using the SVR model (dots) or LM model (crosses) for the co-solvent
gradient mode (a, b) and the IPR gradient mode (c, d), respectively. In c) and d), the relative errors of the predictions are summarized in boxplots: the line in the boxplot is
the median and the whiskers are the ﬁrst and third quantiles.

modes. For the 16-mer sequences, the differences in base composition was less pronounced below 1.5 standard deviations for both
gradient modes but the GC-content remains above 50%. Among
the strongest retained 16-mers, over 40% of nucleobases in the sequence are T for both modes and all gradient slopes.

only a weak correlation for the co-solvent gradient but a more pronounced correlation for the IPR gradient. In both gradient modes,
the peak widths increased with increasing retention time (see Supplementary material Fig. S1). The peak widths obtained in the IPR
gradient mode were greater than in the co-solvent gradient mode,
both in absolute terms and by having a larger sequence variance.
One possible explanation is the gradient compression experienced
by each solute differed because they have different sensitivity to
the gradient change. Also, the effective gradient slope (G) could

be different between the two gradient modes. However, since the
retention time shift of sample S16A was shown to be about the
same for gradient slopes G1, G2, and G3 between the two modes,
a signiﬁcant difference in effective gradient slope was unlikely. Another explanation could be that the peak broadening due to partial diastereomer separation was greater in the IPR gradient mode
than the co-solvent mode. This explanation is plausible since we

4.3. Predictions of the probability of resolving the FLP from the n – 1
impurity
Of particular interest for the quality control of synthetic
oligonucleotides is determining the purity of the FLP, which requires suﬃcient (i.e., Rs > 1.5) resolution when using UV detection. To calculate the resolution, we need accurate predictions of
retention time and peak width. In addition to retention times, we
therefore also investigated how the peak widths correlated with
the retention times in each gradient mode; we found there was
6

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Fig. 4. Distributions of the predicted retention times for 250,0 0 0 unique 12- and 16-mer sequences (blue and orange ﬁll, respectively) calculated using the SVR model.
Subplots a)–c) show the co-solvent gradient and d)–f) the IPR gradient. Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f)

have shown that the diastereomer separation increased at lower
and constant co-solvent concentration in the IPR gradient mode
as compared with co-solvent gradient elution [18]. This would explain why the peak width increased with both decreasing gradient
slope and increasing retention time. We have previously showed
the diastereomer separation involving C and G was greater than
that involving A and T [17] and therefore attempted to correlate
the GC-content together with the retention time and a constant, to

the observed peak width. This simple correlation provides a reasonable approximation of peak width, as summarized in Supplementary material S2.
The predicted versus experimentally calculated resolutions for
12- and 16-mer samples are presented in Table 4. Except for the
steepest gradient slope investigated using both gradient modes, the
prediction error is less than 10%. We also observe that the absolute mean error of prediction decreases with decreasing gradient
slope. Investigating the details, we see that the n – 1 impurity
of sample S12A and S12C are always resolved at a resolution of
more than 1.5 regardless of investigated gradient slope or mode.
For the 16-mer sequences, the critical resolution is reached at a
steeper gradient slope using the IPR gradient mode compared to
co-solvent gradient mode. Interestingly, the two 12-mer samples
always have higher resolution using the co-solvent gradient at any
gradient slope whereas the GC rich sample S12A has a lower resolution than 2 out of the 8 investigated 16-mers using the IPR
gradient mode. This again highlights that the IPR gradient mode
has a higher degree of separation based on sequence rather than
length as compared to the co-solvent gradient. An accurate estimation of resolution based on sequence composition and retention
time allowed us to calculate the peak widths of all 250,0 0 0 random unique 12- and 16-mers as well as their n – 1 impurities at
each gradient slope.
The resulting distributions of calculated resolutions are shown
in Fig. 5. For the co-solvent gradient mode, the 12/11-mer separation always has a higher resolution than does the 16/15-mer
separation regardless of the sequences. In addition, all 12-mer sequences are predicted to reach a resolution of 1.5 at all investigated gradient slopes. The resolution of the 12-mer sequences using the co-solvent gradient mode was generally similar or slightly
better than could be achieved with the IPR gradient mode. This
could also be anticipated from Fig. 2a, where the selectivity be-

tween shorter oligonucleotides is greater for the co-solvent gradient than the IPR gradient. For the 16/15-mer separation resolution,
no sequences could be separated with a resolution of at least 1.5
using the steepest co-solvent gradient investigated. At the second
and third steepest gradients, i.e., G2 and G3, 42 and 28% of the
random sequences could not be separated (see Table 3). For the
IPR gradient mode, the resolution distributions between the 16/15mer and 12/11-mer show overlap at all gradient slopes, with the

overlap increasing with decreasing gradient slope. The results indicate that some 12/11-mers are more diﬃcult to resolve than some
16/15-mers using IPR gradients. This could be expected from the
experimental resolution data showing that a GC-rich 12-mer can
have lower resolution compared to some 16-mers (Table 4). For the
16-mer FLPs, 31, 9, and 4% of all random unique sequences are expected not to reach the critical resolution of 1.5 at gradient slopes
of G1, G2, and G3, respectively.
Investigating the characteristics of the 16-mer sequences that
do not reach a resolution of at least 1.5, we found, for the cosolvent gradient, that they had a marginally higher frequency of C,
both throughout the sequence and in the 5 terminal nucleobase,
which when missing creates the n – 1-mer (see Table 3). For the
IPR gradient, there was a similar but more pronounced trend. The
sequences that does reach critical resolution at G2 and G3 contained 27 and 40% C as well as above average A. For the 5 terminal nucleobase, there was a 41 or 82% probability that it was a C at
gradient slopes G2 and G3. At this could be understood from two
earlier observations: ﬁrst, a sequence containing a large proportion
of C will lead to a wider peak; second, the loss of a terminal C
will give a smaller than average relative decrease in retention time.
These effects combined lead to diﬃculties obtaining suﬃcient resolution.
Investigating the FLPs of experimental dataset (Supplementary material Table S1), we found that the one of the sequences
that did not reach the critical resolution using the IPR gradient at the steepest gradient slope G1 was the RHB sample (3 CGCGTGGTCCTGGTCC-5 ). This sequence has a composition of 37.5%
C, 37.5% G, 25% T, and 0% A as well as a terminal C at the 5 end.
The experimental resolution for the n – 1-mer was calculated to
about 1.3 at G1, see Table 4. Decreasing the gradient slope to G3 increased the resolution to about 1.9. The resolution at gradient slope
G1 using the co-solvent gradient was even lower, about 1 at G1
7

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Fig. 5. Distributions of the predicted n – 1 resolutions for 250,0 0 0 unique 12- and 16-mer sequences (blue and orange ﬁll) calculated using the SVR model. Subplots a)–c)
show the co-solvent gradient and d)–f) the IPR gradient. Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f). Vertical dashed line at a resolution of 1.5.

Fig. 6. Experimental and simulated chromatograms of the RHB sample at the steep gradient slope G1 (a, c) and the shallow gradient slope G3 (b, d), respectively. Co-solvent
gradient mode (a, b) and IPR gradient mode (c, d).

and just 1.5 at G3. Experimental and simulated chromatograms of
RHB are shown in Fig. 6. The simulated peaks were constructed by
generating a normal distribution with a variance calculated from
the nucleobase composition and retention time. The areas of the
FLP and n – 1 were manually normalized by adjusted the height
separately for each peak and then stitched them together to get the

ﬁnal chromatogram. The retention time and peak widths of the experimental and simulated chromatograms are in good agreement,
although there is a slight underestimation of calculated resolution
in the co-solvent gradient at gradient slope G1, also indicated from
Table 4.

8

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Table 3
Details of predicted 16-mer failure sequences (Rs < 1.5); fx is the percentage of nucleobase x.
Gradient mode

Gradient slope

Co-solvent gradient

G1
G2
G3
G1
G2
G3

IPR gradient

Below critical resolution,Rs < 1.5
Frequency (%)

Sequence composition

Terminal nucleobase composition

100
42
28
31
9
4

fA ,
25
22
23

28
32
21

fA
25
24
24
27
31
3

fT ,
25
22
23
23
22
22

fC ,
25
28
27
24
27
40

fG
25

28
27
25
19
17

fT
25
24
24
2
0
0

fC
25
26
26
34
41
82

fG
25
26
26
37
28
15

Table 4
Experimentally measured resolutions vs predictions for FLP and n – 1 using the SVR model for retention times and the
linear model for peak widths, respectively.
Co-solvent gradient
G1
Sample

5 -end

S12A
C
S12C
C
S16A
C
S16B
A
S16C
A
MALAT
G
MHA
A
MHB
A
RHA
G
RHB
C
Abs. mean error %

IPR gradient

G2

G3

G1

G2

G3

Exp

Pred

Exp

Pred

Exp

Pred

Exp

Pred

Exp

Pred

Exp

Pred

1.81
1.73
1.10
1.03
1.02
0.86
1.05
1.14
1.09
0.98
15.8

1.81
1.71
1.30
0.73
1.03
0.61
0.83
0.83
0.82
0.92

2.40
2.45
1.43
1.49
1.45
1.23
1.57
1.65
1.54
1.32
7.9

2.53
2.63
1.63
1.57
1.69
1.48
1.59
1.59
1.49
1.34

2.76
2.92
1.69
1.75
1.75
1.51
1.85

1.93
1.77
1.53
4.2

2.77
2.92
2.06
1.76
1.83
1.50
1.96
1.96
1.68
1.51

2.17
2.26
1.56
1.49
1.45
1.27
1.54
1.85
1.63
1.29
10.9

2.17
2.39

1.89
1.49
1.91
1.59
1.76
1.76
1.55
1.31

2.45
2.63
1.97
1.90
1.84
1.62
1.98
2.31
2.06
1.68
7.0

2.44
2.70
2.33
1.98
2.14
1.72
2.35
2.35
2.06

1.66

2.38
2.82
2.21
2.52
2.21
1.92
2.35
2.74
2.30
1.90
4.8

2.38
2.81
2.29
2.52
2.44
2.15
2.78
2.78
2.32
1.91

5. Conclusions

els could be expanded to account for retention shifts introduced by
other oligonucleotide modiﬁcations such as 3 -MOE, methyl-C or
LNAs if suﬃcient data is provided. Also other impurities related to

the FLP if trained with such retention data. Other impurities could
for example include (P=O) or abasics. Other chromatographic systems including other column chemistries, particle sizes, temperatures, and mobile phases could also be added to have an even
greater number of possible systems to choose from. The methodology could also be used to optimize the method run time in silico
before running experiments.

This study aimed at constructing an ML model capable of predicting the retention times of phosphorothioated oligonucleotides
with high accuracy. The model was shown to predict retention
times with low RMSE as well as high Q2 and R2 for all investigated
conditions. For the investigated experimental systems, the effect of
secondary oligonucleotide structure was shown to be minimal, allowing us to construct a simpler model.
The ML models were used for predicting the chromatographic
characteristics of 250,0 0 0 random 12- and 16-mers. It was found
that the variance in retention time was higher when using the
IPR gradient mode than the co-solvent gradient mode. However, a
slight skewness in the distribution of retention times for a uniform
distribution of A, T, G, C indicates that the SVR model has captured
sequence speciﬁc contribution to the retention time which could
indicate the presence of next neighbor effects. Sequences containing high proportions of C and G gave the shortest retention times,
whereas high proportions of A and T gave the longest retention
times in both gradient modes.
Finally, the resolution of each of the 250,0 0 0 random sequences
to its n – 1-mer was calculated using the retention time from
the ML model and the peak width from the linear combination
of oligonucleotide GC-content and retention time. Results indicate
that the co-solvent gradient mode can be expected to easily resolve
all 12-mer sequences from the 11-mers, typically with greater resolution than can the IPR gradient. On the other hand, the probability of successfully resolving longer 16-mer sequences from 15mers was signiﬁcantly higher using the IPR gradient mode. For
both methods, decreasing the gradient slope increased the probability of achieving critical resolution. Among the 16-mers that still
could not be resolved using the IPR gradient mode, the frequencies
of C were very high, respectively, at the terminal nucleobase.
The ML models constructed in this study could help select the

appropriate gradient mode and gradient slope that would lead to
successful separation before performing an experiment. The mod-

Availability
Implementations and code used in this study can be found at:
21- 1579.

Declaration of Competing Interest
The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to
inﬂuence the work reported in this paper.

CRediT authorship contribution statement
Martin Enmark: Conceptualization, Methodology, Software,
Validation, Formal analysis, Investigation, Resources, Data curation,
Writing – original draft, Writing – review & editing, Visualization, Supervision. Jakob Häggström: Methodology, Software, Formal analysis, Investigation, Data curation, Writing – original draft.
Jörgen Samuelsson: Conceptualization, Validation, Writing – original draft, Writing – review & editing, Supervision. Torgny Fornstedt: Conceptualization, Writing – review & editing, Supervision,
Project administration, Funding acquisition.
9

M. Enmark, J. Häggström, J. Samuelsson et al.

Journal of Chromatography A 1671 (2022) 462999

Acknowledgements

[17] M. Enmark, M. Rova, J. Samuelsson, E. Örnskov, et al., Investigation of
factors inﬂuencing the separation of diastereomers of phosphorothioated
oligonucleotides, Anal Bioanal. Chem. 411 (2019) 3383–3394, doi:10.1007/
s00216- 019- 01813- 2.

[18] M. Enmark, S. Harun, J. Samuelsson, E. Örnskov, et al., Selectivity limits of and
opportunities for ion pair chromatographic separation of oligonucleotides, J.
Chromatogr. A 1651 (2021) 462269, doi:10.1016/j.chroma.2021.462269.
[19] A. Demelenne, M.-J. Gou, G. Nys, C. Parulski, et al., Evaluation of hydrophilic interaction liquid chromatography, capillary zone electrophoresis and drift tube
ion-mobility quadrupole time of ﬂight mass spectrometry for the characterization of phosphodiester and phosphorothioate oligonucleotides, J. Chromatogr.
A 1614 (2020) 460716, doi:10.1016/j.chroma.2019.460716.
[20] M. Gilar, K.J. Fountain, Y. Budman, U.D. Neue, et al., Ion-pair reversedphase high-performance liquid chromatography analysis of oligonucleotides:
Retention prediction, J. Chromatogr. A 958 (2002) 167–182, doi:10.1016/
S0 021-9673(02)0 0306-0.
´
[21] S. Studzinska,
B. Buszewski, Different approaches to quantitative structure–
retention relationships in the prediction of oligonucleotide retention, J. Sep.
Sci. 38 (2015) 2076–2084, doi:10.1002/jssc.201401395.
[22] M. Sturm, S. Quinten, C.G. Huber, O. Kohlbacher, A statistical learning approach
to the modeling of chromatographic retention of oligonucleotides incorporating sequence and secondary structure data, Nucleic Acids Res. 35 (2007) 4195–
4202, doi:10.1093/nar/gkm338.
[23] C. Liang, J.-Q. Qiao, H.-Z. Lian, A novel strategy for retention prediction of nucleic acids with their sequence information in ion-pair reversed phase liquid
chromatography, Talanta 185 (2018) 592–601, doi:10.1016/j.talanta.2018.04.030.
[24] O. Kohlbacher, S. Quinten, M. Sturm, B.M. Mayr, et al., Structure–Activity Relationships in Chromatography: Retention Prediction of Oligonucleotides with
Support Vector Regression, Angew. Chem. Int. Ed. 45 (20 06) 70 09–7012, doi:10.
10 02/anie.20 0602561.
[25] L. Moruz, L. Käll, Peptide retention time prediction, Mass Spec. Rev. 36 (2017)
615–623, doi:10.1002/mas.21488.
[26] M. Gilar, A. Jaworski, P. Olivova, J.C. Gebler, Peptide retention prediction applied to proteomic data analysis, Rapid Commun. Mass Spectrom. 21 (2007)
2813–2821, doi:10.1002/rcm.3150.
[27] O.V. Krokhin, R. Craig, V. Spicer, W. Ens, et al., An improved model for prediction of retention times of tryptic peptides in ion pair reversed-phase HPLC its
application to protein peptide mapping by off-line HPLC-MALDI MS, Mol. Cell.
Proteomics. 3 (2004) 908–919, doi:10.1074/mcp.M400031-MCP200.
[28] K. Petritis, L.J. Kangas, P.L. Ferguson, G.A. Anderson, et al., Use of Artiﬁcial

Neural Networks for the Accurate Prediction of Peptide Liquid Chromatography Elution Times in Proteome Analyses, Anal. Chem. 75 (2003) 1039–1048,
doi:10.1021/ac0205154.
[29] A.A. Klammer, X. Yi, M.J. MacCoss, W.S. Noble, Improving Tandem Mass Spectrum Identiﬁcation Using Peptide Retention Time Prediction across Diverse
Chromatography Conditions, Anal. Chem. 79 (2007) 6111–6118, doi:10.1021/
ac070262k.
˚
[30] J. Samuelsson, F.F. Eiriksson, D. Asberg,
M. Thorsteinsdóttir, et al., Determining
gradient conditions for peptide puriﬁcation in RPLC with machine-learningbased retention time predictions, J. Chromatogr. A 1598 (2019) 92–100, doi:10.
1016/j.chroma.2019.03.043.
[31] E. Stellwagen, J.M. Muse, N.C. Stellwagen, Monovalent Cation Size and DNA
Conformational Stability, Biochemistry 50 (2011) 3084–3094, doi:10.1021/
bi1015524.
´ et al., Fluorescent base ana[32] J.R. Nilsson, T. Baladi, A. Gallud, D. Baždarevic,
logues in gapmers enable stealth labeling of antisense oligonucleotide therapeutics, Sci Rep. 11 (2021) 11365, doi:10.1038/s41598- 021- 90629- 1.
[33] S.G. Roussis, C. Koch, D. Capaldi, C. Rentel, Rapid oligonucleotide drug impurity
determination by direct spectral comparison of ion-pair reversed-phase highperformance liquid chromatography electrospray ionization mass spectrometry
data, Rapid Commun. Mass Spectrom. 32 (2018) 1099–1106, doi:10.1002/rcm.
8125.
[34] J. Timmons, leshane, Lattice-Automation/seqfold 0.7.7, Zenodo (2021), doi:10.
5281/zenodo.4579886.
[35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, et al., Scikit-learn: machine
learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[36] M. Newville, T. Stensitzki, D.B. Allen, A. Ingargiola, LMFIT: non-linear leastsquare minimization and curve-ﬁtting for python, Zenodo (2014), doi:10.5281/
zenodo.11813.

This work was supported by the Swedish Knowledge Foundation via the project “Improved Methods for Process and Quality
Controls using Digital Tools” (grant number 20210021) and by the
Swedish Research Council (VR) via the project “Fundamental Studies on Molecular Interactions aimed at Preparative Separations and
Biospeciﬁc Measurements” (grant number 2015-04627).

Supplementary materials
Supplementary material associated with this article can be
found, in the online version, at doi:10.1016/j.chroma.2022.462999.
References
[1] S. Yang, R.E. Rothman, PCR-based diagnostics for infectious diseases: uses, limitations, and future applications in acute-care settings, Lancet Infect. Dis. 4
(2004) 337–348, doi:10.1016/S1473-3099(04)01044-8.
[2] L. Becherer, N. Borst, M. Bakheit, S. Frischmann, et al., Loop-mediated isothermal ampliﬁcation (LAMP) – review and classiﬁcation of methods for sequencespeciﬁc detection, Anal. Methods 12 (2020) 717–746, doi:10.1039/C9AY02246E.
[3] M.J. Heller, DNA Microarray Technology: Devices, Systems, and Applications,
Annu. Rev. Biomed. Eng. 4 (2002) 129–153, doi:10.1146/annurev.bioeng.4.
020702.153438.
[4] W. Yin, M. Rogge, Targeting RNA: A Transformative Therapeutic Strategy, Clin.
Translat. Sci. 12 (2019) 98–112, doi:10.1111/cts.12624.
[5] T.C. Roberts, R. Langer, M.J.A. Wood, Advances in oligonucleotide drug delivery,
Nat. Rev. Drug Discovery 19 (2020) 673–694, doi:10.1038/s41573- 020- 0075- 7.
[6] C.F. Bennett, E.E. Swayze, RNA targeting therapeutics: molecular mechanisms
of antisense oligonucleotides as a therapeutic platform, Annu. Rev. Pharmacol.
Toxicol. 50 (2010) 259–293, doi:10.1146/annurev.pharmtox.010909.105654.
[7] E. Paredes, V. Aduda, K.L. Ackley, H. Cramer, 6.11 - Manufacturing of Oligonucleotides, in: S. Chackalamannil, D. Rotella, S.E. Ward (Eds.), Comprehensive Medicinal Chemistry III, Elsevier, Oxford, 2017, pp. 233–279, doi:10.1016/
B978- 0- 12- 409547- 2.12423- 0.
[8] S. Benizri, A. Gissot, A. Martin, B. Vialet, et al., Bioconjugated oligonucleotides:
recent developments and therapeutic applications, Bioconjugate Chem. 30
(2019) 366–383, doi:10.1021/acs.bioconjchem.8b00761.
[9] N.M. El Zahar, N. Magdy, A.M. El-Kosasy, M.G. Bartlett, Chromatographic approaches for the characterization and quality control of therapeutic oligonucleotide impurities, Biomed. Chromatogr. 32 (2018), doi:10.1002/bmc.4088.
[10] D. Capaldi, A. Teasdale, S. Henry, N. Akhtar, et al., Impurities in Oligonucleotide
Drug Substances and Drug Products, Nucleic Acid Ther. 27 (2017) 309–322,
doi:10.1089/nat.2017.0691.
[11] M. Enmark, J. Bagge, J. Samuelsson, L. Thunberg, et al., Analytical and
preparative separation of phosphorothioated oligonucleotides: columns and
ion-pair reagents, Anal. Bioanal. Chem. 412 (2020) 299–309, doi:10.1007/
s00216- 019- 02236- 9.

[12] S.G. Roussis, M. Pearce, C. Rentel, Small alkyl amines as ion-pair reagents
for the separation of positional isomers of impurities in phosphate diester
oligonucleotides, J. Chromatogr. A 1594 (2019) 105–111, doi:10.1016/j.chroma.
2019.02.026.
[13] S.T. Crooke, J.L. Witztum, C.F. Bennett, B.F. Baker, RNA-Targeted Therapeutics,
Cell Metab. 27 (2018) 714–739, doi:10.1016/j.cmet.2018.03.004.
[14] M. Catani, C.D. Luca, J.M.G. Alcântara, N. Manfredini, et al., Oligonucleotides:
current trends and innovative applications in the synthesis, characterization,
and puriﬁcation, Biotechnol. J. (2022) 1900226 n/a (n.d.), doi:10.1002/biot.
201900226.
[15] A. Goyon, P. Yehl, K. Zhang, Characterization of therapeutic oligonucleotides by
liquid chromatography, J. Pharm. Biomed. Anal. 182 (2020) 113105, doi:10.1016/
j.jpba.2020.113105.
´
´
[16] S. Studzinska,
S. Bocian, L. Siecinska,
B. Buszewski, Application of phenyl-based
stationary phases for the study of retention and separation of oligonucleotides,
J. Chromatogr. B 1060 (2017) 36–43, doi:10.1016/j.jchromb.2017.05.033.

10

Building machine-learning-based models for retention time and resolution predictions in ion pair chromatography of oligonucleotides

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về