Authors:
(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);
(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);
(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);
(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).
Chemical similarity searches are widely used in-silico methods for identifying new drug-like molecules. These methods have historically relied on structure-based comparisons to compute molecular similarity. Here, we use a chemical language model to create a vector-based chemical search. We extend implementations by creating a prompt engineering strategy that utilizes two different chemical string representation algorithms: one for the query and the other for the database. We explore this method by reviewing the search results from five drug-like query molecules (penicillin G, nirmatrelvir, zidovudine, lysergic acid diethylamide, and fentanyl) and three dye-like query molecules (acid blue 25, avobenzone, and 2-diphenylaminocarbazole). We find that this novel method identifies molecules that are functionally similar to the query, indicated by the associated patent literature, and that many of these molecules are structurally distinct from the query, making them unlikely to be found with traditional chemical similarity search methods. This method may aid in the discovery of novel structural classes of molecules that achieve target functionality.
Keywords: Drug Discovery · Machine Learning · Chemical Similarity Search · Prompt Engineering · SMILES
Applications for small molecules in modern society are various and widespread, including treatment of heritable disease, pathogen inhibition, and the generation of functional materials for use in electronics and consumer goods. Molecular function emerges from structure, but it is not always obvious how this emerges from first principles due to the dependence of function on the target molecule [1]. Traditionally, exploration of natural products has led to the identification of vital pharmaceuticals and specialty chemicals [2–5]. These first generation molecules act as starting points, upon which new molecules are engineered for furthering desired functionality [6]. Structural neighbors often share similar functionality, as the relevant chemistry may be unchanged or improved [7]. However, molecules with low structural similarity can act on the same target despite the highly different structure, as is the case with morphine and fentanyl on the µ-opioid receptor [8].
There are numerous contemporary approaches to applying machine learning to chemistry [9–16]. However, the application of language models to this space has led to surprising success in predicting biochemical features such as drug-likeness and protein-ligand interactions [11, 13]. These methods require string representations of molecules, commonly using the Simplified Molecular-Input Line-Entry System (SMILES) [17]. Language models are often trained in an unsupervised manner with the reward function tied to sequence reconstruction, i.e. feeding the model a masked or partial input with the goal of reconstructing the original sequence. It was recently demonstrated that a chemical language model, though trained only on SMILES strings, correctly predicted complex biophysical and quantum-chemical properties [18]. This points to the possibility that these models develop a chemical latent space that allows for the emergence of higher-order biochemical comprehension.
Recently, computationally generated chemical libraries have grown to surpass 37 billion commercially available compounds [19]. This marked growth has generated a new field of computational pre-screening of chemicals in order to support resource efficient discovery in the laboratory [20]. One primary class of computational pre-screening methods are chemical similarity searches. These methods have historically used structure-based comparisons, a notable example being the fingerprint Tanimoto search which computes a hierarchical list of molecules ranked by molecular substructure similarity to a given query [21, 22].
Chemical Language Models (CLM) have been applied to drug discovery, in particular de novo molecule generation and chemical similarity searches. De novo methods generate novel molecules with the decoder portion of a language model after fine-tuning toward a specific molecule or target [14–16, 18]. Building off the recent success of generative models such as GPT, de novo molecule generation has shown promise but is limited due to a lack of generalizability and guaranteed synthesizability [23, 24]. In contrast, a chemical similarity search based on a CLM has the advantage of computational speed, generalizability, and high database control to ensure synthesizability. Sellner et al. recently created a novel transformer-based chemical similarity search with an optimized loss function to approximate previous structure-based methods [12]. However, this method tends to identify molecules with high structural similarity to the query molecule, when instead we would often like to find structurally distinct functional analogues. Such a CLM-based search does not currently exist.
Here, we describe a chemical similarity search based on a CLM that identifies molecules with similar function to a given query molecule. This method works by calculating the CLM-computed feature vector similarity between a query SMILES string and a chemical database. Keeping the SMILES canonicalization algorithm constant between the query and database resulted in a chemical language search that approximated recent transformer-based chemical search methods. However, we found that when the query SMILES string was canonicalized with a different algorithm than was used for the database, the reliance on structural similarity diminished while functional similarity was retained. This behavior seemed reasonable given the literature describing how models learn to better represent chemical space when SMILES randomization occurs during training and from the reported emergent understanding of underrepresented languages in predominantly English-trained models [25–28]. We utilize alternatively canonicalized queries as a novel prompt engineering strategy to identify structurally distinct functional analogues of small molecules. Our method fundamentally differs from existing literature in that SMILES augmentation was used for the query of a chemical similarity search rather than for model training and fine-tuning to a specific task. We tested our method across three canonicalizations and eight query molecules and found that with increasingly divergent canonicalizations we were able to identify structurally distinct functional analogues.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.