Modeling of retention behaviors of ecotoxicity of anilines and phenols by chemometrics models
Environmental hazard is the risk of damage to the environment eg air pollution, water pollution, toxins, and radioactivity. We performed studies upon an extended series of 65 toxic compounds anilines and phenols with chromatographic retention (log k) using quantitative structure–retention relationship (QSRR) methods that imply analysis of correlations and representation of models. A suitable set of molecular descriptors was calculated and the genetic algorithm (GA) was employed to select those descriptors that resulted in the best-fit models. The partial least squares (PLS), kernel partial least squares PLS (KPLS) and Levenberg- Marquardt artificial neural network (L-M ANN) were utilized to construct the linear and nonlinear QSRR models. The proposed methods will be of importance in this research, and could be expected to apply to other similar research fields.\
Fathead minnows (Pimephales promelas) are found in every drainage in Minnesota. It is the most common species of minnow in the state. They live in many kinds of lakes and streams, but are especially common in shallow, weedy lakes; bog ponds; low-gradient, turbid (cloudy) streams; and ditches. These habitats often have no predators and low oxygen levels. Fatheads are noted for their ability to withstand low oxygen levels. Fatheads commonly occur with white suckers, bluntnose minnows, common shiners, northern redbelly dace, creek chubs, and young-of-the-year black bullheads. Fathead minnows are considered an opportunist feeder. They eat just about anything that they come across, such as algae, protozoa (like amoeba), plant matter, insects (adults and larvae), rotifers, and copepods. In lakes and deeper streams, fatheads are common prey for crappies, rock bass, perch, walleyes, largemouth bass, and northern pike. They also are eaten by snapping turtles, herons, kingfishers, and terns. Eggs of the fathead are eaten by painted turtles and certain large leeches. Although humans do not eat fatheads, they harvest them as bait . Environmental hazard is a generic term for any situation or state of event
which poses a threat to the surrounding natural environment and adversely affects people’s health. This term incorporates topics like pollution and natural disasters such as storms and earthquakes. Hazards can be categorized in five types: Chemical, Physical, Mechanical, Biological and Psychosocial. Environmental hazard and risk assessment of chemical substances requires comprehensive information on the exposure, fate and ecotoxicology of the contaminants; however, complete data sets are rarely available. One reason for these deficiencies is that testing capacities are limited, which impedes the thorough experimental investigation of all the existing and new chemicals .
To fill at least some of the data gaps, mathematical modeling techniques are used to provide sufficiently accurate substitutes. The models can be used to estimate the parameters related to the fate and effects of chemicals and hence to identify contaminants of special environmental concern and to obtain a ranking of potentially hazardous pollutants. In
this way, the priority compounds can then be subjected to detailed testing and the limited resources for experimental investigations can be directed effectively to the chemicals that are most likely to have an environmental impact Attention in mathematical modeling techniques also arises from their application as absolute alternatives to animal experiments, in the interests of time-effectiveness, cost-effectiveness and animal welfare .
Alternative methods assist the policy of the “Three Rs” (replacement, reduction and refinement of the use of laboratory animals) and several regulatory organizations have been established to investigate and promote alternative methods. Chemical modeling techniques are based on the premise that the structure of a compound determines all its properties. The study of the type of chemical structure of a foreign substance which will interact to a living system and produce a well-defined biological endpoint is commonly referred to as quantitative structure-retention relationships QSRR [4-5]. The use of QSRR for toxicity estimation of new chemicals or to regulatory toxicological assessment is increasing, especially in aquatic toxicology. Alternatively to QSRR models quantitative retention relationships QRRR, represent other kind of modeling techniques, in which chromatographic retention parameters are used as descriptor and/or predictor variables of a given biological response of chemicals. QSRR models using retention factors (log k) obtained using conventional RP-HPLC, micellar liquid chromatography (MLC) and biopartitioning micellar chromatography (BMC) have been reported [6-10].
The aim of the present study is estimation of ability optimal descriptors calculated with linear regression (the partial least squares (PLS)) and non-linear regressions (the kernel partial least squares (KPLS) and Levenberg- Marquardt artificial neural network (L-M ANN)) in QSRR analysis of logarithm of the retention factor in BMC (log k) for toxicity to Fathead Minnows of anilines and phenols. The stability and predictive power of these models were validated using Leave-Group-Out Cross-Validation (LGO CV) and external test set.
Results and Discussion
Results of the GA-PLS model
The best model is selected on the basis of the highest square correlation coefficient leave-group-out cross validation (R2), the least root mean squares error (RMSE) and relative error (RE). These parameters are probably the most popular measure of how well a model fits the data. The best GA-PLS model contains 23 selected descriptors in 11 latent variables space. The R2 and mean RE for training and test sets were (0.788, 0.709) and (15.49, 22.88), respectively. The predicted values of log k are plotted against the experimental values for training and test sets in Figure 1. For this in general, the number of components (latent variables) is less than the number of independent variables in PLS analysis. The PLS model uses higher number of descriptors that allow the model to extract better structural information from descriptors to result in a lower prediction error.
Figure 1: Plots of predicted retention time against the experimental values by GA-PLS model.
Results of the GA-KPLS model
In this paper a radial basis kernel function, k(x,y)= exp(||x-y||2/c), was selected as the kernel function with where r is a constant that can be determined by considering the process to be predicted (here r was set to be 1), m is the dimension of the input space and is the variance of the data [11-12]. It means that the value of c depends on the system under the study. The 16 descriptors in 9 latent variables space chosen by GA-KPLS feature selection methods were contained. The R2 and mean RE for training and test sets were (0.811, 0.754) and (13.08, 19.70), respectively. It can be seen from these results that statistical results for GA-KPLS model are superior to GA-PLS method. Figure 2 shows the plot of the GA-KPLS predicted versus experimental values for log k of all of the molecules in the data set.
Results of the L-M ANN model
With the aim of improving the predictive performance of nonlinear QSRR model, L-M ANN modeling was performed. The networks were generated using the sixteen descriptors appearing in the GA-KPLS models as their inputs and log k as their output. For ANN generation, data set was separated into three groups: calibration and prediction (training) and test sets. All molecules were randomly placed in these sets. A three-layer network with a sigmoid transfer function was designed for each ANN. Before training the networks the input and output values were normalized between -1 & 1.
Figure 2: Plots of predicted log k versus the experimental values by GA-KPLS model.
The network was then trained using the training set by the back propagation strategy for optimization of the weights and bias values. The proper number of nodes in the hidden layer was determined by training the network with different number of nodes in the hidden layer. The root-mean-square error (RMSE) value measures how good the outputs are in comparison with the target values. It should be noted that for evaluating the over fitting, the training of the network for the prediction of log k must stop when the RMSE of the prediction set begins to increase while RMSE of calibration set continues to decrease. Therefore, training of the network was stopped when overtraining began. All of the above mentioned steps were carried out using basic back propagation, conjugate gradient and Levenberge-Marquardt weight update functions. It was realized that the RMSE for the training and test sets are minimum when three neurons were selected in the hidden layer. Finally, the number of iterations was optimized with the optimum values for the variables. It was realized that after 18 iterations, the RMSE for prediction set were minimum. The R2 and mean relative error for calibration, prediction and test sets were (0.976, 0.945, 0.887) and (4.14, 5.21, 8.39), respectively. Comparison between these values and other statistical parameter reveals the superiority of the L-M ANN model over other model. The key strength of neural networks, unlike regression analysis, is their ability to flexible mapping of the selected features by manipulating their functional dependence implicitly. The statistical parameters reveal the high predictive ability of L-M ANN model. The whole of these data clearly displays a significant improvement of the QSRR model consequent to nonlinear statistical treatment. Plot of predicted log k versus experimental log k values by L-M ANN for training and test sets are shown in Figure.3a and 3b. Obviously, there is a close agreement between the experimental and predicted log k and the data represent a very low scattering around a straight line with respective slope and intercept close to one and zero. As can be seen in this section, the L-M ANN is more reproducible than other models for modeling the log k of compounds.
Figure 3: Plot of predicted log k obtained by L-M ANN against the experimental values (a) for training set and (b) test set.
Model validation and statistical parameters
The accuracy of proposed models was illustrated using the evaluation techniques such as leave group out cross-validation (LGO-CV) procedure, validation through an external test set. In addition, chance correlation procedure is a useful method for investigating the accuracy of the resulted model, by which one can make sure if the results were obtained by chance or not. Cross validation is a popular technique used to explore the reliability of statistical models. Based on this technique, a number of modified data sets are created by deleting in each case one or a small group (leave-some-out) of objects. For each data set, an input–output model is developed, based on the utilized modeling technique. Each model is evaluated, by measuring its accuracy in predicting the responses of the remaining data (the ones or group data that have not been utilized in the development of the model). In particular, the LGO-CV procedure was utilized in this study. A QSRR model was then constructed on the basis of this reduced data set and subsequently used to predict the removed data. This procedure was repeated until a complete set of predicted was obtained. The data set should be divided into three new sub-data sets, one for calibration and prediction (training), and the other one for testing. The calibration set was used for model generation. The prediction set was applied deal with over fitting of the network, whereas test set which its molecules have no role in model building was used for the evaluation of the predictive ability of the models for external set .
In the other hand by means of training set, the best model is found and then, the prediction power of it is checked by test set, as an external data set. In this work, 60% of the database was used for calibration set, 20% for prediction set and 20% for test set , randomly (in each running program, from all 65 components, 39 components are in calibration set, 13 components are in prediction set and 13 components are in test set). The result clearly displays a significant improvement of the QSRR model consequent to non-linear statistical treatment and a substantial independence of model prediction from the structure of the test molecule. In the above analysis, the descriptive power of a given model has been measured by its ability to predict log k of unknown compounds. For the constructed models, two general statistical parameters were selected to evaluate the prediction ability of the model for log k values. For this case, the predicted log k of each sample in the prediction step was compared with the experimental log k. The root mean square error of prediction (RMSE) is a measurement of the average difference between predicted and experimental values, at the prediction stage. The RMSE can be interpreted as the average prediction error, expressed in the same units as the original response values. The RMSEP was obtained using the following formula:
The second statistical parameter was the relative error of prediction (RE) that shows the predictive ability of each component, and is calculated as:
Where yi is the experimental log k value of the anilines and phenols in the sample i, represents the predicted log k value in the sample i, is the mean of experimental log k values in the prediction set and n is the total number of samples used in the test set .
The GA-PLS, GA-KPLS and L-M ANN models was applied for the prediction of the log k values of ecotoxicity of anilines and phenols. High correlation coefficients and low prediction errors confirmed the good predictability of models. All methods seemed to be useful, although a comparison between these methods revealed the slight superiority of the L-M ANN over other models. Application of the developed model to a testing set of 13 compounds demonstrates that the new model is reliable with good predictive accuracy and simple formulation. The QSRR procedure allowed us to achieve a precise and relatively fast method for determination of log k of different series of these compounds to predict with sufficient accuracy the log k of new substituted compounds.