paint-brush
SMILES Variations Influence Chemical Search Behavior and Functional Discoveryby@penicillin
New Story

SMILES Variations Influence Chemical Search Behavior and Functional Discovery

by PenicillinMarch 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

CheSS search behavior is influenced by SMILES canonicalization, with tokenization differences altering embeddings and uncovering novel functional analogues. While promising for drug discovery, limitations exist, and dual-use concerns highlight the need for careful implementation of chemical similarity search tools.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - SMILES Variations Influence Chemical Search Behavior and Functional Discovery
Penicillin HackerNoon profile picture
0-item

Authors:

(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);

(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);

(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).

  1. Abstract & Introduction
  2. Methods
  3. Results and Discussion
  4. Determining Whether Canonicalization Impacts Search Behavior
  5. Explanation of Search Behavior & Drawbacks, Future Improvements, and Potential for Misuse
  6. Conclusion, Acknowledgements, Author Contributions, & more.
  7. Supplementary Figures

3.2 Explanation of Search Behavior

We find that alternative canonicalizations influence the search behavior of CheSS through changes in tokenization, which causes the CLM to weight higher-order relationships more importantly when creating embeddings. There are stark differences between RDKit and OEChem canonicalizations, notably their differences in the representation of aromatic rings. OEChem prefers the Kekulé form (C1=CC=CC=C1), while RDKit prefers to use lowercase with assumed aromaticity (c1ccccc1), and these differences, among others, result in markedly different tokenization, both in the composition and the length of the tokenized vectors (Fig. 4). Since the CLM has demonstrated a bias toward embedding molecules with similar token vector lengths and (to a lesser extent) token composition to the query (Fig. 4), a query with a different canonicalization will tokenize into a radically different token vector and thereby make the CLM more likely to return molecules with SMILES representations that are highly dissimilar from the original same-canonicalized query. Despite this behavior, and potentially because of it, CheSS searches with alternative canonicalizations found diverse chemical structures with similar functional properties, as demonstrated by patent and literature searches (Fig. 5).


When CLMs are forced to go beyond simple token patterns to determine similarity, more nuanced relationships may appear. Given the nature of transformers, it is indiscernible what these relationships are, but based on our analysis we find it possible that the CLM may, for example, key on the apposition of functional groups in space, in a way similar to how receptors perceive ligands.


3.3 Drawbacks, Future Improvements, and Potential for Misuse

The database used by CheSS consisted of the ∼10M molecules used as a training set for ChemBERTa, and thus it is difficult to predict the behavior of queries that differ greatly from the molecules in this dataset. In addition, very small molecules may not differ in their canonicalized representations, leading to more homogeneity between queries. Nonetheless, the method itself is extensible for use with any dataset, and may invite discussions regarding what datasets and CLMs are most useful for moving between different canonicalizations to identify functional analogues. We did not at this time explore whether non-canonicalized, yet valid, variations of SMILES strings will lead to similar results.


We also note that the threat of dual use for chemical machine learning models has been a topic of discussion amongst researchers [49]. While the model used in our implementation of CheSS was unsupervised and not trained for identifying toxic molecules, a successful chemical similarity search tool carries inherent risks. We therefore advise caution in considering public implementations of these tools and recommend restricting searches to avoid queries with the potential for malicious use.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.