SMILES notation 
===============

   This introduction to SMILES notation is based on a more detailed
description in the following paper,

"SMILES, a chemical language and information system", D. Weininger, Journal
of Chemical Information and Computer Sciences, 28 (1988) pp31-36.

   SMILES (Simple Molecular Input Line Entry System) notation allows the two
dimensional graph of a molecule (and certain aspects of it's three
dimensional structure) to be written as a concise, one dimensional, string
of characters. This allows computers to store large numbers of chemical
structures in a small space and also enables them to be processed extremely
quickly. The beauty of SMILES, as compared to other encoding systems you
could devise, is that humans can also quite easily look at a SMILES string
and determine what molecule it represents and, conversely, easily construct
a SMILES string that represents a given structure.

   A SMILES notation is a sequence of characters that ends with a white
space. Hydrogens may be omitted or included. Aromatic structures can be
specified directly or in their Kekul form.


Atoms
=====

   Atoms are represented by their atomic symbols. Atoms not from the
"organic subset", that is B, C, N, O, P, S, F, Cl, Br, and I, are written
enclosed in square brackets to separate them from the next. The presence of
enough hydrogen atoms to fill up any unused bonds to an atom is implied
unless the atom symbol is enclosed in square brackets and the number of
attached hydrogens is explicitly stated. Charges on an atom, if present, may
also be specified in the square brackets.

   For example

        SMILES                                  Molecule

          C                                     methane
          N                                     ammonia
          O                                     water
         [Au]                                   elemental gold
         [OH-]                                  hydroxyl anion
         [OH3+]                                 hydronium cation
         [Fe+2] or [Fe++]                       iron (II) cation
         [NH4+]                                 ammonium cation

   Atoms in aromatic rings are specified by lower case letters, eg 'c' for
an aromatic carbon atom.


Bonds
=====

   Single, double, triple, and aromatic bonds are represented by the symbols
'-', '=', '#', and ':' respectively. Single and aromatic bond symbols may
be, and usually are, omitted.

   Examples

        CC                                      ethane
        C=C                                     ethene
        CCO                                     ethanol
        C#N                                     hydrogen cyanide
        [H][H]                                  molecular hydrogen


Branches
========

   Branches off the main chain are enclosed in parentheses.

   For example, (these and the following more complicated structures are
drawn out in the file 'SMILESegs')

        CCN(CC)CC                               triethylamine
        CC(=O)O                                 ethanoic acid

   Branches may be nested, for example

        C=CC(CCC)C(C(C)C)CCC

   is a perfectly valid SMILES string.


Cyclic structures
=================

   Rings are first converted to linear structures by breaking a single (or
aromatic) bond. The SMILES for the resulting linear structure is then
written as normal except that a ring closure number is added after each of
the two atoms that had the bond between them broken.

   For example, (remembering lower case atom symbols imply aromaticity)

        C1CCCCC1                                cyclohexane
        c1ccccc1                                benzene
        C1C=CC=C1                               cyclopentadiene
        Oc1ccccc1                               phenol
        Brc1cc(Br)cc(Br)c1                      tribromo-benzene

   There may be more than one way of writing the structure as a SMILES. For
example 1-methyl-3-bromo-cyclohexene may be written as

        CC1=CC(Br)CCC1

   or as

        CC1=CC(CCC1)Br

   An individual atom may be involved in closing more than one ring, in
cubane for example, in this case all the ring closure numbers associated
with the atom are written after it.

   So cubane may be written as

        C12C3C4C1C5C4C3C25

   Ring closure digits may be reused, however more than 9 may still be
needed, in this case (ie for ring closure numbers of 10 or greater) the two
digits are preceded by a '%' symbol.

   To illustrate both these things

        C%12CCCCC%12N=NC%12CCCCC%12

   represents two cyclohexane rings joined by a two nitrogen atom linker.


Disconnected structures
=======================

   Disconnected structures are written as individual SMILES separated by a
full stop. For example sodium phenoxide can be written as

        [Na+].[O-]c1ccccc1

   or even

        c1cc([O-].[Na+])ccc1

   Note, however, that no association of ions is implied by the order in
which disconnected structures appear in the SMILES.


Isomerism
=========

   The stereochemistry at chiral centres can be specified in SMILES. The
chiral atom should be enclosed in square brackets with either one or two '@'
symbols following it. One '@' implies that the branches that follow it in
the SMILES string occur in an anticlockwise arrangement. Two '@' symbols
mean the branches occur in a clockwise arrangement. This is undoubtedly
totally unclear, so here is an example

        OC(=O)[C@@]([H])(N)Cc1ccc(O)cc1 L-Tyrosine

   here the [H], N and Cc1ccc(O)cc1 are arranged clockwise when viewed along
the bond from the carboxyl group, OC(=O), to the chiral carbon atom. The
other isomer is

        OC(=O)[C@]([H])(N)Cc1ccc(O)cc1  D-Tyrosine

   which has the three groups in an anticlockwise arrangement. These can be
written more simply as

        OC(=O)[C@@H](N)Cc1ccc(O)cc1     L-Tyrosine

        OC(=O)[C@H](N)Cc1ccc(O)cc1      D-Tyrosine

   As an explanation for the use of the '@' symbol, and as an aid to
remembering which is which; an '@' symbol is an 'a' with an anticlockwise
circle around it.

   The cis/trans isomerism of double bonds can also be specified. The
symbols '/' and '\' are used, they should precede and/or follow the atoms
which are doubly bonded. For example

        Cl\C=C/Cl                       cis dichloro-ethene

        Cl\C=C\C1                       trans dichloro-ethene

   or for a double bond with two groups at each end

        Cl/C(Br)=C(/I)F

   This will have Cl and I trans to each other, with the Br at the same end
as the Cl, and the F at the same end as the I.


   Using the above rules almost all organic structures can be written in
SMILES notation. To demonstrate this point the final complicated example,
morphine

        O1C2C(O)C=CC3C2(C4)c5c1c(O)ccc5CC3N(C)C4

   can be written in a simple (it is when you get used to it!) and concise
way.


---
Simon Kilvington, 3/11/94
