Authors:
(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);
(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);
(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);
(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).
A CheSS search using each of three, different query canonicalizations was conducted on eight molecules of known function and roughly equal chemical complexity: penicillin G, nirmatrelvir, zidovudine, lysergic acid diethylamide (LSD), fentanyl, acid blue 25 free acid (acid blue 25 FA), avobenzone, and 2-diphenylaminocarbazole (2-dPAC) (Fig. 2). These molecules were chosen as they are of roughly equal complexity but otherwise represent two distinct classes of molecules: drug-like bioactive molecules and non-drug-like photochemical molecules (herein referred to as dye-like). In addition, the molecules were all sufficiently structurally dissimilar from one another, as determined by having a fingerprint Tanimoto coefficient (Tc) less than 0.60 (Fig S1). That said, acid blue 25 FA was more similar to the drug-like molecules, whereas the dyes avobenzone and 2-dPAC were both highly dissimilar from all other query molecules (Fig. S1).
For each query molecule, several similarity metrics were calculated between the three canonicalizations (pairwise comparisons, Figure 3). Gestalt pattern matching, a string similarity metric, showed that each query canonicalized into different strings, with a mean pairwise value of 0.47 across canonicalizations (n=8) (Fig. 3). Because the CLM does not directly receive strings as inputs, but instead receives the tokenized representations (integer-mapped subsections) of strings, the token vectors were analyzed to understand how these strings would be presented to the model. Tanimoto similarity, a metric comparing shared elements between two sets, was applied to the token vectors which indicated that the query strings were converted using markedly different input tokens (mean pairwise value of 0.69 across canonicalizations) (Fig. 3). Similarly, token vector lengths were variable in length, with some queries differing by almost a factor of 2 depending on canonicalization (Fig. 3). Changes in the token vectors cause differences in featurization, or the model’s interpretation of said input, and it was found that different embeddings were obtained depending on canonicalization (mean pairwise feature cosine similarity of 0.66), indicating that the model interpreted different canonicalizations of the same molecule as quite distinct inputs (Fig. 3).
In order to explore how different canonicalizations impact feature-based search behavior, similarity metrics were obtained comparing each canonicalized query to its respective top 20 CheSS search results. Queries canonicalized with RDKit Atom 0 yielded compounds high in structural similarity, as evidenced by fingerprint Tanimoto similarity, a measure of molecular substructure similarity, with a mean coefficient of 0.62 (n=160) (Fig. 4d). In contrast, the mean fingerprint Tanimoto coefficients for RDKit Atom n and OEChem were 0.45 and 0.32 respectively. Another way to see the differences in these searches was that for RDKit Atom 0 canonicalized queries, 22% could have been found from a fingerprint Tanimoto search using a cutoff as high as 0.80, indicating that nearly a quarter of the results were 1-2 atomic changes aways from the query molecule (Fig. 4d). In contrast, only 6% and 2% of the top results for RDKit Atom n and OEChem, respectively, could have been found from this same search, indicating significant structural divergence (Fig. 4d).
At a more granular level, these structural differences are well-illustrated for a penicillin G query, in that there is a gradual diminishing of β-lactam-containing results as canonicalization diverges (Figs. 4a, 4b, 4c): all of the top 8 hits for the RDKit Atom 0 canonicalizations contained β-lactams, while progressively fewer lactams were found for RDKit Atom n and OEChem. This trend in diverging structure was partially explained by Gestalt pattern matching similarity, in which the mean scores for RDKit Atom 0, RDKit Atom n, and OEChem were 0.86, 0.65, and 0.44 respectively, indicating that the average RDKit Atom 0 top result was a simple string permutation away from the original query, and thus also a simple structural modification away, but this was not the case for the alternate canonicalizations (Fig. 4d).
The token vector Tanimoto similarity demonstrated that RDKit Atom 0 canonicalization returned molecules with a high number of shared tokens to the query (mean of 0.80), whereas this was not the case with RDKit Atom n and OEChem (means 0.65, 0.60), indicating that the model’s ability to memorize tokens to determine feature similarity was reduced by different canonicalizations (Fig. 4d). These results point to the possibility that the model utilizes non-obvious relationships to determine the alternative canonicalization’s location in feature space.
Interestingly, it was observed that the token vector length ratios for all canonicalizations’ results fell within about 20% of each query’s token vector length, indicating that the model heavily utilized token vector length to determine feature space location and thus similarity (Fig. 4d). This means that token vector length may constrain CLM-based similarity searches to confined regions of chemical space, with alternative canonicalizations acting as ways to bypass this predominant search criteria and thereby explore more distant regions of chemical space through variations in token vector length, ultimately allowing for more comprehensive and far-reaching similarity searches (Figs. 4e, 4f).
In order to begin to determine the functional significance of the search results, patent and literature searches were conducted on the top 20 results from each search. In general, functionally drug-like queries returned high levels of drug-like molecules and few dye-like molecules, and conversely, queries on functionally dye-like molecules returned more dye-like molecules, and many fewer drug-like molecules (Figs. 5a, 5b). An exception to the latter statement were molecules identified using OEChem inquiries, but this skew was due almost solely to results from the avobenzone search (which in turn has known biological activity [46, 47]). In general, the baseline of random drug-like molecules returned for drug-based queries exists, but is relatively low.
The functional similarity of query results was contrasted with fingerprint Tanimoto similarity (Fig. 5c). The categorization of functionality was either positive (known relevant functionality to the query) or negative (unknown relevant functionality), and the structural similarity was either positive (Tc ≥ 0.60) or negative (Tc < 0.60). Criteria / ontologies for similar functionality for each molecule were as follows: Penicillin G: antibiotic [3]; nirmatrelvir: protease inhibitor or antiviral [36]; zidovudine: antiviral [37]; LSD: 5-HT receptor agonist or dopaminergic agonist [38, 39]; fentanyl:
opioid analgesic or muscarinic receptor agonist [40–42]; acid blue 25 FA: dye or electroluminescent; avobenzone: UV-Absorption, electroluminescent [43, 44]; 2-dPAC: electroluminescent [45]. To illustrate, for OEChem-canonicalized nirmatrelvir (a SARS-CoV-2 main protease inhibitor) several top results (7, 8, 16, and 17) were classified as positive, as these compounds were known protease inhibitors and / or antivirals (Figs. 6h 6i, 6j). A table of the top 20 results for each search, complete with links to their PubChem pages, relevant patents and functional descriptors is listed in Table S2.
As expected, CheSS searches with the query molecules canonicalized with RDKit Atom 0 resulted in the identification of molecules with similar structures and functions. There were 31 structurally dissimilar molecules (Tc < 0.60) with shared functionality to the query, 30 of which were nonetheless obvious structural derivatives. Penicillin G returned β-lactam antibiotics, nirmatrelvir returned Hepatitis C Virus (HCV) protease inhibitors, zidovudine returned antiviral pyrimidine nucleosides, LSD returned psychoactive ergolines, fentanyl returned narcotic piperidine analogues, acid blue 25 FA returned anthraquinone dyes, avobenzone returned dibenzoylmethane permutants, and 2-dPAC returned electroluminescent carbazole and triphenylamine derivatives.
In contrast to RDKit Atom 0 queries, RDKit Atom n queries returned molecules that had greater structural diversity, but that still contained many structural analogues. There were 26 structurally dissimilar molecules (Tc < 0.60) with
shared functionality to the query, some 22 of which were relatively obvious structural derivatives, including purine and pyrimidine analogs of zidovudine (Fig. 6a). However, these hits also included a quite distinct dopaminergic agonist for LSD (Fig. 6b), a hydrophobic fluorescence probe for acid blue 25 FA (Fig. 6c), a refractive copolymer for avobenzone (Fig. 6d), and a fentanyl-like agonist for its known coronary muscarinic receptor (Fig. 6e) [41, 42]. 2-dPAC did not return any molecules with known relevant functionality using this canonicalization, though the results had similar aromaticity to the query. Interestingly, tetracycline antibiotics have been used to inhibit metalloproteases, and the penicillin G query returned two non-tetracycline metalloprotease inhibitors (Fig. 6f) [48].
OEChem generally returned molecules that were structurally highly dissimilar to the query. There were 35 structurally dissimilar molecules (Tc < 0.60) with shared functionality to the query, and in contrast to the previous two canonicalizations, only 7 of these were obvious structural derivatives. The more diverse compounds with shared functionality included two non-HCV protease inhibitors (Figs. 6g, 6h); a Respiratory Syncytial Virus antiviral for nirmatrelvir (Fig. 6i); a dopaminergic and serotonergic agonist (Fig. 6j) and a 5-HT1 receptor agonist for LSD (Fig. 6k); porphyrins (Fig. 6l) and other conjugated dyes for acid blue 25 FA (Fig. 6m); photovoltaic and electroluminescent molecules for avobenzone; and highly conjugated electroluminescent molecules for 2-dPAC (Fig. 6n). Taken together, these hits provide anecdotal proof for the hypothesis that changing query canonicalizations can lead to the discovery of novel, but functional, chemical compounds. These insights may provide interesting avenues for drug repurposing, and that molecules with previously unknown functions may serve as leads for novel drug discovery.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.