Principal Component Analysis SMILES
Import the Package
Run the analysis
Principal component analysis is a well studied technique in identifying feature characteristics on a set of data and the variance of the data. The more some data aligns with a certain range we can determine as a cluster. In the context of chemistry, when we have a large set of SMILES
data if we would like to identify major features that collect groups of molecules that might not be as obvious as before.
To accomplish this, we need numbers and not a list of strings. We apply the the conversion of SMILES to fingerprints. First we need to capture a local chemical environment and convert to a series of numbers. Morgan Fingerprinting was used initially with something called the "Morgan Radius" which is a series of discrete integers that represent how many bonds to iterate to look.
The morgan radius at 0 only looks at the direct connections to the atoms and as you increase the radius you look the atoms connections that are connected to your direct connections. Doing this we can start to evaluate the chemical environment.
The conversion into the fingerprints means how much feature space you want to capture. Usually a bit length has a defined amount and the more numbers than in theory the more different features. Although, there is a level of redundancy and too wild of feature space. The default for ours is 512 which enough to get an overall look at a list of SMILES data without knowing anything in the first place.
For Principal Component Analysis, we want to evaluate the variance of the chemical data and start grouping clusters together that correspond to similar bit patterns. Do this we apply a vector to a series of data and a component orthogonal to the original vector.
We then evaluate the variance of the data with respect to the vector. Now how you place your vectors is where the variability comes in the data. And how many vectors you want to add is also subjective. The resultant plot is a linear combination of the variance of the data which reduces the data into two main categories: outliers and normal values.
To add some context in the terms of bit representation:
But this is not enough because the chemical space from the pca analysis on large sets of SMILES can have a lot of variance. To further gain some insight from the pca we employ a k-means clustering on the subspace:
To get a hierarchal view of the major core scaffolds in our ligand set. The result produced can produce variety of clusters depending on the data:
References
Jolliffe, I. T., editor. “Principal Component Analysis and Factor Analysis.” Principal Component Analysis, Springer, 2002, pp. 150–66. Springer Link, https://doi.org/10.1007/0-387-22440-8_7.
Ding, Chris, and Xiaofeng He. “K -Means Clustering via Principal Component Analysis.” Twenty-First International Conference on Machine Learning - ICML ’04, ACM Press, 2004, p. 29. DOI.org (Crossref), https://doi.org/10.1145/1015330.1015408.CloseDeleteEdit
Morgan, H. L. “The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service.” Journal of Chemical Documentation, vol. 5, no. 2, May 1965, pp. 107–13. DOI.org (Crossref), https://doi.org/10.1021/c160017a018.
Last updated