GlobalChem: Your Chemical Knowledge Graph
  • Welcome to the GlobalChem Documentation!
  • Quick Start
  • Extensions
  • API
    • GlobalChem
    • Graph Algorithm
  • Mother Nature
    • Mother Nature Commands
    • Discord Roles
  • Cheminformatics
    • SMILES Validation
    • Decoding Fingeringprints and SMILES to IUPAC
    • SMILES to PDF And Back
    • Drug Design Filters
    • Deep Layer Scatter
    • Identifier SMARTS
    • Protonating SMILES
    • Sunbursting SMILES
    • Visualizing SMARTS
    • One-Hot Encoding SMILES
    • Principal Component Analysis SMILES
    • GlobalChem Graph to Networkx Graph
    • Amino Acid Sequence to SMILES
    • Scaffold Graph Adapter
  • Bioinformatics
    • GlobalChem Protein
    • GlobalChem RNA
    • GlobalChem DNA
    • GlobalChem Bacteria
    • GlobalChem Monoclonal Antibody
  • Quantum Chemistry
    • Z-Matrix Store
    • Psi4Parser & Orbital Visualizer
  • ForceFields
    • GlobalChem Molecule
    • CGenFF Molecule
    • GAFF2 Molecule
    • CGenFF Dissimilarity Score
  • Development Operations
    • Open Source Database Monitor
  • Graphing Templates
    • Plotly
Powered by GitBook
On this page
  1. Cheminformatics

One-Hot Encoding SMILES

PreviousVisualizing SMARTSNextPrincipal Component Analysis SMILES

Last updated 3 years ago

One-Hot encoding is the processing of moving something that is built on some sort of language can be indexed appropriately with numbers. In machine learning, it is common to convert strings to numeric characters so the machine can work with the numbers. Most often there is an encoder to convert the strings, perform the machine learning, and then decode the numbers to check the performance of the machine learning.

Imports

from global_chem import GlobalChem
from global_chem_extensions import GlobalChemExtensions

gc = GlobalChem()
cheminformatics = GlobalChemExtensions().cheminformatics()

Convert A List of SMILES to GlobalChem Encoder

smiles_list = list(gc.get_node_smiles('pihkal').values())

encoded_smiles = cheminformatics.encode_smiles(smiles_list, max_length=200)

print ('Encoded SMILES: %s' % encoded_smiles[0])

decoded_smiles = cheminformatics.decode_smiles(encoded_smiles)

print ('Decoded SMILES: %s' % decoded_smiles[0])
Encoded SMILES: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Decoded SMILES: CCC(N)CC1=CC(=C(OC)C(=C1)OC)OC

Algorithm

In the code there is an index mapping of the the SMILES to the index and then convert to numpy array. Currently, the mapping is set to this to include polymeric SMILES, atom classes, and virtual atoms.

__SMILES_MAPPING__ = [
        ' ',
        '#', '%', '(', ')', '+', '-', '.', '/',
        '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
        '=', '@',
        'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P',
        'R', 'S', 'T', 'V', 'X', 'Z',
        '[', '\\', ']',
        'a', 'b', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's',
        't', 'u',
        '&', ':', '*'
    ]
Basic Molecular Representation for Machine LearningMedium
Logo