paint-brush
Supplementary Figures for Transformer-Based Chemical Similarity Searchby@penicillin
New Story

Supplementary Figures for Transformer-Based Chemical Similarity Search

by PenicillinMarch 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Supplementary figures showcase key findings from CheSS searches, including Tanimoto coefficients, feature cosine similarities, and token vector comparisons. These visuals highlight how different SMILES canonicalizations impact molecular search behavior, emphasizing structural vs. functional similarity in chemical discovery.

Company Mentioned

Mention Thumbnail
featured image - Supplementary Figures for Transformer-Based Chemical Similarity Search
Penicillin HackerNoon profile picture
0-item

Authors:

(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);

(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).

  1. Abstract & Introduction
  2. Methods
  3. Results and Discussion
  4. Determining Whether Canonicalization Impacts Search Behavior
  5. Explanation of Search Behavior & Drawbacks, Future Improvements, and Potential for Misuse
  6. Conclusion, Acknowledgements, Author Contributions, & more.
  7. Supplementary Figures

Supplementary Figures

Table S1: Different canonical SMILES string representations for each molecule query molecule.


Figure S1: Fingerprint Tanimoto coefficients between each of the query molecules. The drug-like molecules, as well as acid blue 25 FA, are more similar to one another than they are to avobenzone & 2-dPAC. All of the molecules are fairly dissimilar to one another, with the highest similarity being 0.39 between LSD and penicillin G, and the lowest similarity being 0.11 between avobenzone and 2-dPAC.


Figure S2: Feature cosine similarity of each RDKit canonicalized query depending on the chosen root atom number. (a-h). In order: penicillin G, nirmatrelvir, zidovudine, LSD, fentanyl, acid blue 25 FA, avobenzone, 2-dPAC. The canonicalized variant with the lowest feature similarity to the Atom 0 representation was chosen as the “RDKit Atom n query”. The root atoms providing most dissimilar feature vectors to the Atom 0 representations were 13 for penicillin G, 21 for nirmatrelvir, 15 for zidovudine, 9 for LSD, 17 for fentanyl, 26 for acid blue 25 FA, 8 for avobenzone, and 18 for 2-dPAC.


Figure S3: Fingerprint Tanimoto coefficients between the query molecule and the top 20 most similar molecules to the query (by feature cosine similarity) for each canonicalization. (a-c). In order: RDKit Atom 0, RDKit Atom n, OEChem. These demonstrate that the RDKit Atom 0 search is providing results similar to a fingerprint structural search, whereas this is less so the case in RDKit Atom n, and even less so in the OEChem search. The exception to this is 2-dPAC, in which none of the molecules would have reasonably been found with a fingerprint search.


Figure S4: Gestalt similarity between the strings of the top 20 most similar molecules to the query (by feature cosine similarity) and the canonicalized query string of each respective canonicalization. (a-c). In order: RDKit Atom 0, RDKit Atom n, OEChem. These demonstrate that the RDKit Atom 0 search is providing results very similar to a simple string similarity search, whereas this is less so the case in RDKit Atom n, and even less so in the OEChemsearch.


Figure S5: Token vector Tanimoto ratios between the query molecule’s tokenized SMILES vector and the tokenized SMILES vectors of the top 20 most similar molecules to the query (by feature cosine similarity) for each canonicalization. (a-c). In order: RDKit Atom 0, RDKit Atom n, OEChem. These demonstrate that searches using RDKit Atom 0 to canonicalize the SMILES string will return molecules with a high number of shared tokens to the query, whereas this is less so the case with RDKit Atom n and OEChem. Despite these differences, nearly all of the top results share at least 50% of the tokens with the query.


Figure S6: Token vector length ratio between the query molecule’s tokenized SMILES vector and the tokenized SMILES vectors of the top 20 most similar molecules to the query (by feature cosine similarity) for each canonicalization. These demonstrate that the length of the tokenized vectors for almost all of the results fall within 20% of the query’s length, indicating that the length of the tokenized SMILES vector is a significant factor in how the top results are determined.


Figure S7: The index rank of each alternate canonicalization’s top 20 results for each query compared to the index rank that these same molecules scored in the other canonicalizations’ searches. (a-h). In order: penicillin G, nirmatrelvir, zidovudine, LSD, fentanyl, acid blue 25 FA, avobenzone, 2-dPAC. Molecules functionally similar to the query indicated by a black dot, as determined by the patent search, and structurally similar to the query (Tc ≥ 0.60) indicated by a dashed line. These demonstrate that queries that underwent alternative canonicalization were able to identify functional molecules that would have been impractical to find using the standard canonicalization.


Figure S7: The index rank of each alternate canonicalization’s top 20 results for each query compared to the index rank that these same molecules scored in the other canonicalizations’ searches. (a-h). In order: penicillin G, nirmatrelvir, zidovudine, LSD, fentanyl, acid blue 25 FA, avobenzone, 2-dPAC. Molecules functionally similar to the query indicated by a black dot, as determined by the patent search, and structurally similar to the query (Tc ≥ 0.60) indicated by a dashed line. These demonstrate that queries that underwent alternative canonicalization were able to identify functional molecules that would have been impractical to find using the standard canonicalization.


Figure S7: The index rank of each alternate canonicalization’s top 20 results for each query compared to the index rank that these same molecules scored in the other canonicalizations’ searches. (a-h). In order: penicillin G, nirmatrelvir, zidovudine, LSD, fentanyl, acid blue 25 FA, avobenzone, 2-dPAC. Molecules functionally similar to the query indicated by a black dot, as determined by the patent search, and structurally similar to the query (Tc ≥ 0.60) indicated by a dashed line. These demonstrate that queries that underwent alternative canonicalization were able to identify functional molecules that would have been impractical to find using the standard canonicalization.


Figure S7: The index rank of each alternate canonicalization’s top 20 results for each query compared to the index rank that these same molecules scored in the other canonicalizations’ searches. (a-h). In order: penicillin G, nirmatrelvir, zidovudine, LSD, fentanyl, acid blue 25 FA, avobenzone, 2-dPAC. Molecules functionally similar to the query indicated by a black dot, as determined by the patent search, and structurally similar to the query (Tc ≥ 0.60) indicated by a dashed line. These demonstrate that queries that underwent alternative canonicalization were able to identify functional molecules that would have been impractical to find using the standard canonicalization.


Figure S8: Structures of top 20 results for each query.


Figure S8: Structures of top 20 results for each query.


Figure S8: Structures of top 20 results for each query.


Figure S8: Structures of top 20 results for each query.


Figure S8: Structures of top 20 results for each query.


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA). Query Canon. Rank CID Patent ID / DOI Functional Descri


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


Table S2: CheSS Top Results Information. Includes query, canonicalization, search rank, PubChem CID, Patent ID/DOI, functional descriptor, categorized drug/dye-likeness based on functionality, same functionality categorization, fingerprint Tanimoto coefficient between query & result, categorized Structurally Distinct Functional Analogue (SDFA), categorized Non-Derivative Functional Analogue (NDFA).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.