Authors:
(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);
(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]);
(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author ([email protected]);
(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin ([email protected]).
CheSS Overview. The Chemical Semantic Search (CheSS) is a molecular search framework that uses language model-encoded feature vectors to compute similarity scores across molecular space. A database of molecules is encoded as strings using SMILES format [17]. A chemical language model is then used to generate a feature vector for each molecule in the database as well as the query molecule. The cosine similarity between the query vector and each database vector is computed, resulting in a vector of feature cosine similarities.
Language Model. ChemBERTa was used as the language model to generate embeddings [29]. This was a Bidirectional Encoder Representations from Transformers (BERT) model with 12 hidden layers of size 768, and was trained on 10M random non-redundant achiral SMILES strings selected from PubChem [30]. ChemBERTa was chosen over newer, higher-parameter BERT models due to the ease of implementation and publicly available dataset. ChemBERTa does not support isomeric SMILES (chirality), and all SMILES were canonicalized before input.
Database. The CheSS molecular feature database was built from the ∼10M random achiral molecules used to train ChemBERTa [29]. For each molecule, the SMILES string was canonicalized using RDKit [31]. We reduced this dataset to exclude all SMILES strings that tokenized to more than 512 tokens, the maximum supported by ChemBERTa. This resulted in a database of 9,999,809 molecules. The database SMILES strings were encoded into feature vectors with ChemBERTa. The [CLS] token vector representations of the final layer were chosen to be the feature vectors, as described in the original BERT paper [30]. These feature vectors were then L2 normalized and stored in chunks of 100k SMILES string-feature vector pairs for future cosine similarity calculations.
Canonicalization Query Types. ChemBERTa was trained on SMILES strings canonicalized using RDKit [29, 31]. Different canonicalization algorithms result in different, but equally valid, standardized strings representing the same molecule, which we utilize to create three highly different queries for the same molecule. The first query type used RDKit with its default Python implementation settings. This algorithm was used to canonicalize the database & train the model. When converting molecules to SMILES, RDKit allows specification of which atom number to root the SMILES string to. The default is Atom 0, and each atom results in a different representation. The feature cosine similarity was calculated between the default RDKit SMILES and the “Atom n” RDKit SMILES for each atom in the query molecule, as demonstrated in Figure S2. From these, we took the most dissimilar “Atom n” SMILES strings to be the second query type for each molecule. To obtain a third dissimilar SMILES representation, OEChem 2.3.0, a markedly different canonicalization algorithm than RDKit, was used [32]. These SMILES strings were obtained from the PubChem website.
Similarity Metrics. Various similarity metrics were used throughout, which include feature cosine similarity, Gestalt pattern matching similarity, fingerprint Tanimoto similarity, token vector length similarity, and token similarity [33, 34]. Feature cosine similarity is a distance metric that calculates the angle between two vectors A and B:
A cosine similarity of 1 indicates the normalized vectors are the same, 0 means they are orthogonal to one another, and −1 means they are opposite of one another.
Matching characters are identified first from the longest common substring, with recursive counts in non-matching regions on both sides of the substring. The metric ranges from a perfect match of 1 to a completely dissimilar string of 0. We used the difflib Python implementation of the Gestalt pattern matching algorithm to calculate Gestalt similarity.
Fingerprint Tanimoto similarity was used to calculate the structural similarity between pairs of molecules. This method encodes substructures into a binary vector, and then calculates the Tanimoto similarity between these encoded vectors. The Tanimoto / Jaccard similarity is the number of shared elements (intersection) between two sets A and B over the total number of unique elements in both sets (union) (Eq. 3):
This metric ranges from 1 (all elements shared) to 0 (no elements shared). The RDKit default implementation of fingerprint Tanimoto similarity was used herein.
All SMILES were encoded into token vectors before being passed into the model. These tokenized vectors were used for additional comparisons to better understand search behavior. The first metric used from these was the ratio of token lengths between two vectors. The second metric was the token Tanimoto / Jaccard similarity (Eq. 3) between the two molecules’ token vectors, and was used to determine the ratio of shared tokens between the two vectors. This metric ranges from 1 (all tokens shared) to 0 (no tokens shared).
Patent & Literature Search. In order to determine known functionality of the molecules examined herein, a patent & literature search was performed. The patents and literature articles for each molecule, if available, was obtained from PubChem [35]. A comprehensive list of all molecules considered herein, & their associated patents, is provided in Table S2.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.