This study addresses the challenge of predicting lipophilicity (logD) and acidity/basicity (pKa) for saturated fluorine-containing derivatives, which are crucial properties for drug discovery. The authors compiled a specialized dataset of fluorinated and non-fluorinated compounds and evaluated over 40 machine learning models, including linear, tree-based, and neural networks. A substructure mask explanation (SME) approach confirmed the importance of fluorine substitutions in these properties. The results were made publicly available via GitHub, pip, conda, and a KNIME node, allowing researchers to use these models for molecular design.
Takeaways:
- The study focuses on predicting logD and pKa for fluorinated compounds, which significantly impact pharmacological activity, bioavailability, metabolism, and toxicity.
- Standard prediction methods for these properties struggle with fluorine-containing molecules due to limited experimental data.
- The authors compiled a dataset of fluorinated and non-fluorinated derivatives with experimental logD and pKa values.
- More than 40 machine learning models, including linear, tree-based, and neural networks, were trained or fine-tuned for optimal prediction accuracy.
- A substructure mask explanation (SME) technique validated the role of fluorinated groups in influencing these properties.
- The models and datasets were open-sourced as a GitHub repository, pip and conda packages, and a KNIME node, making them accessible for further research and application.
- The study was supported by Blackthorn AI Ltd., Enamine Ltd., and the Ministry of Education and Science of Ukraine.
- Some authors are employees of Blackthorn AI Ltd. and Enamine Ltd., presenting potential conflicts of interest.