Document Type : Original Articles
Authors
1
Department of Comparative Bioscience, Faculty of Veterinary medicine, University of Tehran, Tehran, Iran
2
Department of Comparative Bioscience, Faculty of Veterinary Medicine, University of Tehran, Tehran, Iran
3
Department of Mechanical Engineering, University of Tehran, Tehran, Iran
4
Laboratory of Bioinformatics and Drug Design, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
5
National Cell Bank of Iran, Pasteur Institute of Iran, Tehran, Iran
10.22059/ijvm.2026.407791.1005987
Abstract
Background: Safety assessment should illustrate the hazardous effects of the test substance on macromolecules in organs and organisms. Genotoxicity is one of the most important toxicity tests, with the in vitro micronucleus assay being the most common method. Toxicity tests generally require heavy investment in time and money. Therefore, the development of a computer-simulated model for screening prior to confirmatory testing can potentially save a lot of time and expenses.
Objectives: The aim of this study is to develop an interpretable QSAR machine learning model for predicting micronuclei induction in cells due to the exposure to various chemicals such as drugs, pesticides, and cosmetics.
Methods: A comprehensive dataset of chemicals along with their micronucleus test results was compiled from reputable sources (390 compounds). The dataset was imbalanced, and the challenge was handled using the Synthetic Minority Oversampling Technique (SMOTE). RDKit was used to calculate and extract the chemical descriptors. Different machine learning models were trained, and each was evaluated for predictive performance using the test group. Models were compared internally, and the best performing was chosen.
Results: Toxicity predictions were made with various algorithms and techniques: Random Forest, Quadratic Discriminant Analysis, and Stacking methods, arriving at 79%, 74%, and 74% accuracy, respectively and 79%, 73% and 74% AUC-Score.
Conclusion: Random Forest stood out as the best model, employing the top ten descriptors, yielding an accuracy of 79%, an F1 score of 75%, an AUC-ROC score of 79%, and precision of 88%. The prime descriptors and final model (RF) were elucidated with SHAP and LIME to provide insights for implementing new chemicals with valuable information. The most important features are Kappa 3, Min Estate Index, and MollogP.
Keywords