paint-brush
Conclusion and Future Directions for Transformer-Based Chemical Searchby@penicillin

Conclusion and Future Directions for Transformer-Based Chemical Search

by PenicillinMarch 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

CheSS introduces a transformer-based chemical similarity search leveraging prompt engineering to identify structurally distinct yet functionally similar molecules. This approach expands the scope of computational drug discovery, offering potential for repurposing known compounds and identifying new molecular classes with desirable functionality.

People Mentioned

Mention Thumbnail
Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Conclusion and Future Directions for Transformer-Based Chemical Search
Penicillin HackerNoon profile picture
0-item

Authors:

(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);

(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).

  1. Abstract & Introduction
  2. Methods
  3. Results and Discussion
  4. Determining Whether Canonicalization Impacts Search Behavior
  5. Explanation of Search Behavior & Drawbacks, Future Improvements, and Potential for Misuse
  6. Conclusion, Acknowledgements, Author Contributions, & more.
  7. Supplementary Figures

4. Conclusion

In this study, we created a chemical similarity search pipeline utilizing a transformer encoder-based chemical language model to generate embeddings upon which similarity scores can be computed. From this, we designed a prompt engineering strategy that expands upon existing chemical semantic searches by creating a method able to identify structurally dissimilar molecules with similar function. We demonstrate the utility of this search method to identify non-obvious functional compounds related to multiple different query molecules. This method may aid repurposing known compounds or in discovering new structural classes of molecules that have desirable functionality. Despite potential drawbacks, we believe that CheSS and the canonicalization prompt engineering method discussed herein will be of broad interest to the chemical community, as it begins to explore how machine learning can be used outside of staid similarity queries.

5. Acknowledgements

The authors acknowledge the Texas Advanced Computing Center at The University of Texas at Austin for providing high-performance computing resources. This work was supported by the Welch Foundation (C.W.K.), the Blumberg Centennial Professorship in Molecular Evolution (C.O.W.), and the Reeder Centennial Fellowship in Systematic and Evolutionary Biology at The University of Texas at Austin (C.O.W.).

6. Author Contributions

Conceptualization, C.W.K. and A.L.F.; Methodology, C.W.K.; Software, C.W.K. and A.L.F.; Validation, C.W.K. and A.L.F; Formal Analysis, C.W.K.; Investigation, C.W.K.; Resources, A.D.E., C.O.W.; Data Curation, C.W.K. and A.L.F.; Writing - Original Draft, C.W.K. and A.L.F.; Writing - Review & Editing, C.W.K., A.L.F., A.D.E., and C.O.W.; Visualization, C.W.K and A.L.F.; Supervision, A.D.E. and C.O.W.; Funding Acquisition, A.D.E. and C.O.W.

7. Declaration of Interests

The authors declare no competing interests.

8. Data and Code Availability

Project code and discussed search results can be found at https://github.com/kosonocky/CheSS.


References

[1] Qingxin Li and CongBao Kang. Mechanisms of action for small molecules revealed by structural biology in drug discovery. International Journal of Molecular Sciences, 21(15):5262, 2020.


[2] Gordon M Cragg and David J Newman. Biodiversity: A continuing source of novel drug leads. Pure and applied chemistry, 77(1):7–24, 2005.


[3] Alexander Fleming. Penicillin. British medical journal, 2(4210):386, 1941.


[4] Rui H Jiao, Shu Xu, Jun Y Liu, Hui M Ge, Hui Ding, Chen Xu, Hai L Zhu, and Ren X Tan. Chaetominine, a cytotoxic alkaloid produced by endophytic chaetomium sp. ifb-e015. Organic Letters, 8(25):5709–5712, 2006.


[5] Mansukh C Wani and Susan Band Horwitz. Nature as a remarkable chemist: A personal story of the discovery and development of taxol®. Anti-cancer drugs, 25(5):482, 2014.


[6] James P Hughes, Stephen Rees, S Barrett Kalindjian, and Karen L Philpott. Principles of early drug discovery. British journal of pharmacology, 162(6):1239–1249, 2011.


[7] Yvonne C Martin, James L Kofron, and Linda M Traphagen. Do structurally similar molecules have similar biological activity? Journal of medicinal chemistry, 45(19):4350–4358, 2002.


[8] Hasan Pathan and John Williams. Basic opioid pharmacology: an update. British journal of pain, 6(1):11–16, 2012.


[9] Limeng Pu, Misagh Naderi, Tairan Liu, Hsiao-Chun Wu, Supratik Mukhopadhyay, and Michal Brylinski. etoxpred: a machine learning-based approach to estimate the toxicity of drug candidates. BMC Pharmacology and Toxicology, 20(1):1–15, 2019.


[10] Yilin Yang, Mingjie Liu, and John R Kitchin. Neural network embeddings based similarity search method for atomistic systems. Digital Discovery, 1(5):636–644, 2022.


[11] Kyunghoon Lee, Jinho Jang, Seonghwan Seo, Jaechang Lim, and Woo Youn Kim. Drug-likeness scoring based on unsupervised learning. Chemical Science, 13(2):554–565, 2022.


[12] Manuel S Sellner, Amr H Mahmoud, and Markus A Lill. Efficient virtual high-content screening using a distance-aware transformer model. Journal of Cheminformatics, 15(1):18, 2023.


[13] Bomin Wei, Yue Zhang, and Xiang Gong. Deeplpi: a novel deep learning-based model for protein–ligand interaction prediction for drug repurposing. Scientific reports, 12(1):18200, 2022.


[14] Michael Moret, Francesca Grisoni, Paul Katzberger, and Gisbert Schneider. Perplexity-based molecule ranking and bias estimation of chemical language models. Journal of chemical information and modeling, 62(5):1199–1206, 2022.


[15] Michael Moret, Irene Pachon Angona, Leandro Cotos, Shen Yan, Kenneth Atz, Cyrill Brunner, Martin Baumgartner, Francesca Grisoni, and Gisbert Schneider. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nature Communications, 14(1):114, 2023.


[16] Daniel Flam-Shepherd, Kevin Zhu, and Alán Aspuru-Guzik. Language models can learn complex molecular distributions. Nature Communications, 13(1):3293, 2022.


[17] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.


[18] Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12): 1256–1264, 2022.


[19] Benjamin I Tingle, Khanh G Tang, Mar Castanon, John J Gutierrez, Munkhzul Khurelbaatar, Chinzorig Dandarchuluun, Yurii S Moroz, and John J Irwin. Zinc-22- a free multi-billion-scale database of tangible compounds for ligand discovery. Journal of Chemical Information and Modeling, 63(4):1166–1176, 2023.


[20] Maria Batool, Bilal Ahmad, and Sangdun Choi. A structure-based drug discovery paradigm. International journal of molecular sciences, 20(11):2783, 2019.


[21] Katalin Szilágyi, Beáta Flachner, István Hajdú, Mária Szaszkó, Krisztina Dobi, Zsolt Lorincz, Sándor Cseh, and ˝ György Dormán. Rapid identification of potential drug candidates from multi-million compounds’ repositories. combination of 2d similarity search with 3d ligand/structure based methods and in vitro screening. Molecules, 26 (18):5593, 2021.


[22] Dagmar Stumpfe and Jürgen Bajorath. Similarity searching. Wiley Interdisciplinary Reviews: Computational Molecular Science, 1(2):260–282, 2011.


[23] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.


[24] Wenhao Gao and Connor W Coley. The synthesizability of molecules proposed by generative models. Journal of chemical information and modeling, 60(12):5714–5723, 2020.


[25] Josep Arús-Pous, Simon Viet Johansson, Oleksii Prykhodko, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen, and Ola Engkvist. Randomized smiles strings improve the quality of molecular generative models. Journal of cheminformatics, 11(1):1–13, 2019.


[26] Maranga Mokaya, Fergus Imrie, Willem P van Hoorn, Aleksandra Kalisz, Anthony R Bradley, and Charlotte M Deane. Testing the limits of smiles-based de novo molecular generation with curriculum and deep reinforcement learning. Nature Machine Intelligence, pages 1–9, 2023.


[27] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.


[28] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.


[29] Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self-supervised pretraining for molecular property prediction. CoRR, abs/2010.09885, 2020. URL https://arxiv.org/abs/2010.09885.


[30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810. 04805.


[31] Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8, 2013. [32] TK OEChem. version 2.3.0; openeye scientific software: Santa fe, nm.


[33] John W Ratcliff and David E Metzener. Pattern-matching-the gestalt approach. Dr Dobbs Journal, 13(7):46, 1988.


[34] Dávid Bajusz, Anita Rácz, and Károly Héberger. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of cheminformatics, 7(1):1–13, 2015.


[35] Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, et al. Pubchem substance and compound databases. Nucleic acids research, 44(D1): D1202–D1213, 2016.


[36] Jennifer Hammond, Heidi Leister-Tebbe, Annie Gardner, Paula Abreu, Weihang Bao, Wayne Wisemandle, MaryLynn Baniecki, Victoria M Hendrick, Bharat Damle, Abraham Simón-Campos, et al. Oral nirmatrelvir for high-risk, nonhospitalized adults with covid-19. New England Journal of Medicine, 386(15):1397–1408, 2022.


[37] Margaret A Fischl, Douglas D Richman, Nellie Hansen, Ann C Collier, John T Carey, Michael F Para, W David Hardy, Raphael Dolin, William G Powderly, J Davis Allan, et al. The safety and efficacy of zidovudine (azt) in the treatment of subjects with mildly symptomatic human immunodeficiency virus type 1 (hiv) infection: a double-blind, placebo-controlled trial. Annals of internal medicine, 112(10):727–737, 1990.


[38] M Titeler, RA Lyon, and RA Glennon. Radioligand binding evidence implicates the brain 5-ht 2 receptor as a site of action for lsd and phenylisopropylamine hallucinogens. Psychopharmacology, 94:213–216, 1988.


[39] Lorenzo Pieri, Margherita Pieri, and Willy Haefely. Lsd as an agonist of dopamine receptors in the striatum. Nature, 252:586–588, 1974.


[40] Patil Armenian, Kathy T Vo, Jill Barr-Walker, and Kara L Lynch. Fentanyl, fentanyl analogs and novel synthetic opioids: a comprehensive review. Neuropharmacology, 134:121–132, 2018.


[41] Takao Yamanoue, Jose M Brum, Fawzy G Estafanous, Philip A Khairallah, and Carlos M Ferrario. Fentanyl attenuates porcine coronary arterial contraction through m3-muscarinic antagonism. Anesthesia & Analgesia, 76 (2):382–390, 1993.


[42] Olav Hustveit. Binding of fentanyl and pethidine to muscarinic receptors in rat brain. The Japanese Journal of Pharmacology, 64(1):57–59, 1994.


[43] S Afonso, K Horita, JP Sousa e Silva, IF Almeida, MH Amaral, PA Lobão, PC Costa, Margarida S Miranda, Joaquim CG Esteves da Silva, and JM Sousa Lobo. Photodegradation of avobenzone: Stabilization effect of antioxidants. Journal of Photochemistry and Photobiology B: Biology, 140:36–40, 2014.


[44] Yuyang Zhou, Jinyong Zhuang, Wenming Su, and Xiaomei Wang. Yellow organic light-emitting diodes from heteroleptic iridium (iii) complexes with avobenzone ligands as dopants. European Journal of Inorganic Chemistry, 2015(33):5571–5576, 2015.


[45] A Venkateswararao, KR Justin Thomas, Chuan-Pei Lee, and Kuo-Chuan Ho. Effect of auxiliary chromophores on the optical, electrochemical, and photovoltaic properties of carbazole-based dyes. Asian Journal of Organic Chemistry, 4(1):69–80, 2015.


[46] Sungjin Ahn, Seungchan An, Moonyoung Lee, Eunyoung Lee, Jeong Joo Pyo, Jeong Hyeon Kim, Min Won Ki, Sun Hee Jin, Jaehyoun Ha, and Minsoo Noh. A long-wave uva filter avobenzone induces obesogenic phenotypes in normal human epidermal keratinocytes and mesenchymal stem cells. Archives of toxicology, 93:1903–1915, 2019.


[47] Changwon Yang, Whasun Lim, Fuller W Bazer, and Gwonhwa Song. Avobenzone suppresses proliferative activity of human trophoblast cells and induces apoptosis mediated by mitochondrial disruption. Reproductive Toxicology, 81:50–57, 2018.


[48] Michele M Castro, Arulmozhi D Kandasamy, Nermeen Youssef, and Richard Schulz. Matrix metalloproteinase inhibitor properties of tetracyclines: therapeutic potential in cardiovascular diseases. Pharmacological Research, 64(6):551–560, 2011.


[49] Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.