paint-brush
Alternative Canonicalizations Expand Chemical Similarity Search Capabilitiesby@penicillin
New Story

Alternative Canonicalizations Expand Chemical Similarity Search Capabilities

by PenicillinMarch 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

By applying different SMILES canonicalizations, CheSS alters search behavior, identifying structurally diverse but functionally similar molecules. RDKit Atom 0 returns structurally close analogues, while RDKit Atom n and OEChem uncover novel candidates, potentially aiding drug repurposing and expanding chemical discovery.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Alternative Canonicalizations Expand Chemical Similarity Search Capabilities
Penicillin HackerNoon profile picture
0-item

Authors:

(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);

(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).

  1. Abstract & Introduction
  2. Methods
  3. Results and Discussion
  4. Determining Whether Canonicalization Impacts Search Behavior
  5. Explanation of Search Behavior & Drawbacks, Future Improvements, and Potential for Misuse
  6. Conclusion, Acknowledgements, Author Contributions, & more.
  7. Supplementary Figures

3.1 Determining Whether Canonicalization Impacts Search Behavior

A CheSS search using each of three, different query canonicalizations was conducted on eight molecules of known function and roughly equal chemical complexity: penicillin G, nirmatrelvir, zidovudine, lysergic acid diethylamide (LSD), fentanyl, acid blue 25 free acid (acid blue 25 FA), avobenzone, and 2-diphenylaminocarbazole (2-dPAC) (Fig. 2). These molecules were chosen as they are of roughly equal complexity but otherwise represent two distinct classes of molecules: drug-like bioactive molecules and non-drug-like photochemical molecules (herein referred to as dye-like). In addition, the molecules were all sufficiently structurally dissimilar from one another, as determined by having a fingerprint Tanimoto coefficient (Tc) less than 0.60 (Fig S1). That said, acid blue 25 FA was more similar to the drug-like molecules, whereas the dyes avobenzone and 2-dPAC were both highly dissimilar from all other query molecules (Fig. S1).


Figure 2: Query molecules and canonical SMILES representations. Query molecules made achiral during canonicalization. (a). Penicillin G; (b). Nirmatrelvir; (c). Zidovudine; (d). LSD; (e). Fentanyl; (f). Acid blue 25 FA; (g). Avobenzone; (h). 2-dPAC. (i). Penicillin G SMILES strings for the three canon-icalizations used herein. Unabridged SMILES for each query are listed in Table S1.

3.1.1 Statistical Analysis of Query Canonicalizations

For each query molecule, several similarity metrics were calculated between the three canonicalizations (pairwise comparisons, Figure 3). Gestalt pattern matching, a string similarity metric, showed that each query canonicalized into different strings, with a mean pairwise value of 0.47 across canonicalizations (n=8) (Fig. 3). Because the CLM does not directly receive strings as inputs, but instead receives the tokenized representations (integer-mapped subsections) of strings, the token vectors were analyzed to understand how these strings would be presented to the model. Tanimoto similarity, a metric comparing shared elements between two sets, was applied to the token vectors which indicated that the query strings were converted using markedly different input tokens (mean pairwise value of 0.69 across canonicalizations) (Fig. 3). Similarly, token vector lengths were variable in length, with some queries differing by almost a factor of 2 depending on canonicalization (Fig. 3). Changes in the token vectors cause differences in featurization, or the model’s interpretation of said input, and it was found that different embeddings were obtained depending on canonicalization (mean pairwise feature cosine similarity of 0.66), indicating that the model interpreted different canonicalizations of the same molecule as quite distinct inputs (Fig. 3).

3.1.2 Distribution of Top Hits

In order to explore how different canonicalizations impact feature-based search behavior, similarity metrics were obtained comparing each canonicalized query to its respective top 20 CheSS search results. Queries canonicalized with RDKit Atom 0 yielded compounds high in structural similarity, as evidenced by fingerprint Tanimoto similarity, a measure of molecular substructure similarity, with a mean coefficient of 0.62 (n=160) (Fig. 4d). In contrast, the mean fingerprint Tanimoto coefficients for RDKit Atom n and OEChem were 0.45 and 0.32 respectively. Another way to see the differences in these searches was that for RDKit Atom 0 canonicalized queries, 22% could have been found from a fingerprint Tanimoto search using a cutoff as high as 0.80, indicating that nearly a quarter of the results were 1-2 atomic changes aways from the query molecule (Fig. 4d). In contrast, only 6% and 2% of the top results for RDKit Atom n and OEChem, respectively, could have been found from this same search, indicating significant structural divergence (Fig. 4d).


At a more granular level, these structural differences are well-illustrated for a penicillin G query, in that there is a gradual diminishing of β-lactam-containing results as canonicalization diverges (Figs. 4a, 4b, 4c): all of the top 8 hits for the RDKit Atom 0 canonicalizations contained β-lactams, while progressively fewer lactams were found for RDKit Atom n and OEChem. This trend in diverging structure was partially explained by Gestalt pattern matching similarity, in which the mean scores for RDKit Atom 0, RDKit Atom n, and OEChem were 0.86, 0.65, and 0.44 respectively, indicating that the average RDKit Atom 0 top result was a simple string permutation away from the original query, and thus also a simple structural modification away, but this was not the case for the alternate canonicalizations (Fig. 4d).


Figure 3: Similarity metrics between the three canonicalized representations for each query molecule. Gestaltsimilarity demonstrates different canonicalizations result in markedly different strings. Token Tanimoto & length ratiosindicate these strings were tokenized into different inputs to the CLM. Feature cosine similarity between ChemBERTaembedded vectors demonstrate that the differently canonicalized queries’ token vectors were interpreted differently by the model resulting in increased spread across feature space. Deviations from 1.0 for each metric represent divergence between canonicalizations.


The token vector Tanimoto similarity demonstrated that RDKit Atom 0 canonicalization returned molecules with a high number of shared tokens to the query (mean of 0.80), whereas this was not the case with RDKit Atom n and OEChem (means 0.65, 0.60), indicating that the model’s ability to memorize tokens to determine feature similarity was reduced by different canonicalizations (Fig. 4d). These results point to the possibility that the model utilizes non-obvious relationships to determine the alternative canonicalization’s location in feature space.


Interestingly, it was observed that the token vector length ratios for all canonicalizations’ results fell within about 20% of each query’s token vector length, indicating that the model heavily utilized token vector length to determine feature space location and thus similarity (Fig. 4d). This means that token vector length may constrain CLM-based similarity searches to confined regions of chemical space, with alternative canonicalizations acting as ways to bypass this predominant search criteria and thereby explore more distant regions of chemical space through variations in token vector length, ultimately allowing for more comprehensive and far-reaching similarity searches (Figs. 4e, 4f).

3.1.3 Patent Search Reveals Functionality of Molecules

In order to begin to determine the functional significance of the search results, patent and literature searches were conducted on the top 20 results from each search. In general, functionally drug-like queries returned high levels of drug-like molecules and few dye-like molecules, and conversely, queries on functionally dye-like molecules returned more dye-like molecules, and many fewer drug-like molecules (Figs. 5a, 5b). An exception to the latter statement were molecules identified using OEChem inquiries, but this skew was due almost solely to results from the avobenzone search (which in turn has known biological activity [46, 47]). In general, the baseline of random drug-like molecules returned for drug-based queries exists, but is relatively low.


The functional similarity of query results was contrasted with fingerprint Tanimoto similarity (Fig. 5c). The categorization of functionality was either positive (known relevant functionality to the query) or negative (unknown relevant functionality), and the structural similarity was either positive (Tc ≥ 0.60) or negative (Tc < 0.60). Criteria / ontologies for similar functionality for each molecule were as follows: Penicillin G: antibiotic [3]; nirmatrelvir: protease inhibitor or antiviral [36]; zidovudine: antiviral [37]; LSD: 5-HT receptor agonist or dopaminergic agonist [38, 39]; fentanyl:


Figure 4: Search behavior depends on canonicalization. (a-c). Different canonicalizations return structurally distinct molecules, demonstrated by β-lactam ring-containing molecules in the top search results. (a). RDKit Atom 0 had 8/8 top results containing β-lactam rings. (b). RDKit Atom n had 7/8 top results containing β-lactam rings. (c). OEChem had 3/8 top results containing β-lactam rings. (d). Similarity metrics for all CheSS searches between each canonicalized query and respective top 20 results (n=160 for each canonicalization). Asterisks indicate the level of statistical significance for two-sided independent t-tests (ns, P<1.0; *, P<0.05; **, P<0.01; ***, P<0.001; **** P<0.0001). (e-f). The index rank of each alternate canonicalization’s top 20 results for penicillin G compared to the index rank that these same molecules scored in the other canonicalizations’ searches. Molecules functionally similar to the query indicated by a black dot, as determined by the patent search, and structurally similar to the query (Tc ≥ 0.60) indicated by a dashed line. Rank plots for each query and comparisons between RDKit Atom n and OEChem are listed in Fig. S7. Queries with alternative canonicalizations were able to find molecules that would not have been found when the same canonicalization as the database was used, which were often functionally similar to the query.


opioid analgesic or muscarinic receptor agonist [40–42]; acid blue 25 FA: dye or electroluminescent; avobenzone: UV-Absorption, electroluminescent [43, 44]; 2-dPAC: electroluminescent [45]. To illustrate, for OEChem-canonicalized nirmatrelvir (a SARS-CoV-2 main protease inhibitor) several top results (7, 8, 16, and 17) were classified as positive, as these compounds were known protease inhibitors and / or antivirals (Figs. 6h 6i, 6j). A table of the top 20 results for each search, complete with links to their PubChem pages, relevant patents and functional descriptors is listed in Table S2.


Figure 5: Patent-derived functional analysis for each canonicalization’s results. (a). Mean drug-like & dye-like molecules returned in the top 20 results from drug-like queries (95% CI, n=8). (b). Mean drug-like & dye-like molecules returned in the top 20 results from drug-like queries (95% CI, n=8). (c). Structure-function categorization across all queries for each canonicalization (n=8 for each canonicalization). Structural similarity determined by fingerprint Tanimoto similarity (+ indicates Tc ≥ 0.60, and − indicates Tc < 0.60). Functional similarity determined by patent search (+ indicates similar function, − indicates no known relevant function to query). Criteria for similar function for each molecule was as follows: Penicillin G: antibiotic [3]; nirmatrelvir: protease inhibitor or antiviral [36]; zidovudine: antiviral [37]; LSD: 5-HT receptor agonist or dopaminergic agonist [38, 39]; fentanyl: opioid analgesic or muscarinic receptor agonist [40–42]; acid blue 25 FA: dye or electroluminescent; avobenzone: UV-Absorption, electroluminescent [43, 44]; 2-dPAC: electroluminescent [45]. (d). Mean non-derivative functional analogues returned in the top 20 results (95% CI, n=8). Asterisks indicate the level of statistical significance for two-sided independent t-tests (ns, P<1.0; *, P<0.05; **, P<0.01; ***, P<0.001; **** P<0.0001).


As expected, CheSS searches with the query molecules canonicalized with RDKit Atom 0 resulted in the identification of molecules with similar structures and functions. There were 31 structurally dissimilar molecules (Tc < 0.60) with shared functionality to the query, 30 of which were nonetheless obvious structural derivatives. Penicillin G returned β-lactam antibiotics, nirmatrelvir returned Hepatitis C Virus (HCV) protease inhibitors, zidovudine returned antiviral pyrimidine nucleosides, LSD returned psychoactive ergolines, fentanyl returned narcotic piperidine analogues, acid blue 25 FA returned anthraquinone dyes, avobenzone returned dibenzoylmethane permutants, and 2-dPAC returned electroluminescent carbazole and triphenylamine derivatives.


In contrast to RDKit Atom 0 queries, RDKit Atom n queries returned molecules that had greater structural diversity, but that still contained many structural analogues. There were 26 structurally dissimilar molecules (Tc < 0.60) with


Figure 6: Structures of molecules discussed herein. (a). RDKit Atom n zidovudine top result #2; (b). RDKit Atom n LSD top result #7; (c). RDKit Atom n acid blue 25 FA top result #9; (d). RDKit Atom n avobenzone top result #13; (e). RDKit Atom n fentanyl top result #20; (f). RDKit Atom n penicillin G top result #9; (g). OEChem nirmatrelvir top result #16; (h). OEChem nirmatrelvir top result #17; (i). OEChem nirmatrelvir top result #8; (j). OEChem LSD top result #8; (k). OEChem LSD top result #14; (l). OEChem acid blue 25 FA top result #5; (m). OEChem acid blue 25 FA top result #20; (n). OEChem 2-dPAC top result #15.


shared functionality to the query, some 22 of which were relatively obvious structural derivatives, including purine and pyrimidine analogs of zidovudine (Fig. 6a). However, these hits also included a quite distinct dopaminergic agonist for LSD (Fig. 6b), a hydrophobic fluorescence probe for acid blue 25 FA (Fig. 6c), a refractive copolymer for avobenzone (Fig. 6d), and a fentanyl-like agonist for its known coronary muscarinic receptor (Fig. 6e) [41, 42]. 2-dPAC did not return any molecules with known relevant functionality using this canonicalization, though the results had similar aromaticity to the query. Interestingly, tetracycline antibiotics have been used to inhibit metalloproteases, and the penicillin G query returned two non-tetracycline metalloprotease inhibitors (Fig. 6f) [48].


OEChem generally returned molecules that were structurally highly dissimilar to the query. There were 35 structurally dissimilar molecules (Tc < 0.60) with shared functionality to the query, and in contrast to the previous two canonicalizations, only 7 of these were obvious structural derivatives. The more diverse compounds with shared functionality included two non-HCV protease inhibitors (Figs. 6g, 6h); a Respiratory Syncytial Virus antiviral for nirmatrelvir (Fig. 6i); a dopaminergic and serotonergic agonist (Fig. 6j) and a 5-HT1 receptor agonist for LSD (Fig. 6k); porphyrins (Fig. 6l) and other conjugated dyes for acid blue 25 FA (Fig. 6m); photovoltaic and electroluminescent molecules for avobenzone; and highly conjugated electroluminescent molecules for 2-dPAC (Fig. 6n). Taken together, these hits provide anecdotal proof for the hypothesis that changing query canonicalizations can lead to the discovery of novel, but functional, chemical compounds. These insights may provide interesting avenues for drug repurposing, and that molecules with previously unknown functions may serve as leads for novel drug discovery.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.