by maxhodak

Autoencoder network for learning a continuous representation of molecular structures.

459 Stars 130 Forks Last release: Not found MIT License 38 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

A Keras implementation of Aspuru-Guzik's molecular autoencoder paper

Abstract from the paper

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This generative model allows efficient search and optimization through open-ended spaces of chemical compounds.

We train deep neural networks on hundreds of thousands of existing chemical structures to construct two coupled functions: an encoder and a decoder. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to the discrete representation from this latent space.

Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the design of drug-like molecules as well as organic light-emitting diodes.

Link to the paper


Install using

pip install -r requirements.txt
or build a docker container:
docker build .

The docker container can also be built different TensorFlow binary, for example in order to use GPU:

docker build --build-arg TF_BINARY_URL= .

You'll need to ensure the proper CUDA libraries are installed for this version to work.

Getting the datasets

A small 50k molecule dataset is included in

to make it easier to get started playing around with the model. A much larger 500k ChEMBL 21 extract is also included in
. A model trained on
is included in

All h5 files in this repo by git-lfs rather than included directly in the repo.

To download original datasets to work with, you can use the
  • python --dataset zinc12
  • python --dataset chembl22
  • python --uri --outfile data/my-file.csv

Preparing the data

To train the network you need a lot of SMILES strings. The
script assumes you have an HDF5 file that contains a table structure, one column of which is named
and contains one SMILES string no longer than 120 characters per row. The script then:
  • Normalizes the length of each string to 120 by appending whitespace as needed.
  • Builds a list of the unique characters used in the dataset. (The "charset")
  • Substitutes each character in each SMILES string with the integer ID of its location in the charset.
  • Converts each character position to a one-hot vector of len(charset).
  • Saves this matrix to the specified output file.


python data/smiles_50k.h5 data/processed.h5

Training the network

The preprocessed data can be fed into the

python data/processed.h5 model.h5 --epochs 20

If a model file already exists it will be opened and resumed. If it doesn't exist, it will be created.

By default, the latent space is 292-D per the paper, and is configurable with the

flag. If you use a non-default latent dimensionality don't forget to use
on the other scripts (eg
) when you operate on that model checkpoint file or it will be confused.

Sampling from a trained model

script can be used to either run the full autoencoder (for testing) or either the encoder or decoder halves using the
parameter. The data file must include a charset field.


python data/processed.h5 model.h5 --target autoencoder

python data/processed.h5 model.h5 --target encoder --save_h5 encoded.h5

python target/encoded.h5 model.h5 --target decoder


After 30 epochs on a 500,000 molecule extract from ChEMBL 21 (~7 hours on a NVIDIA GTX 1080), I'm seeing a loss of 0.26 and a reconstruction accuracy of 0.98.

Projecting the dataset onto 2D latent space gives a figure that looks pretty reasonably like Figure 3 from the paper, though there are some strange striations and it's not quite as well spread out as the examples in the paper.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.