In this paper, the authors discuss about the molecule generation, which is a challenging problem in cheminformatics. The two types of deep generative approaches generally used are encoding molecular graphs as strings of text, and learns their corresponding character-based language model while another approach operates directly on the molecular graph. But the above approaches have two limitations, like the generation of invalid and duplicate molecules.
To overcome the limitations of the model, the authors of this paper proposed a language model for small molecular substructures called fragments, loosely inspired by the well-known paradigm of Fragment-Based Drug Design. In simple language, they proposed to generate molecules fragment by fragment, instead of atom by atom.
The authors of this paper experimentally show that their model largely outperforms other language model-based competitors, reaching state-of-the-art performances typical of graph-based approaches.
In terms of the methods used in this paper, the main approach encompasses three steps:
- Break molecules into sequences of fragments,
- Encode them as SMILES words,
- Learn their corresponding language model.
Read the full paper for more details
Paper PDF: https://arxiv.org/pdf/2002.12826.pdf