Neural Networks and Nucleotides: AI in Genomic Manufacturing

Plant breeding is pivotal in ensuring stable food for the growing global population. To meet increasing food demands efficiently, plant breeding must achieve high rates of genetic gain. Genomic selection is a powerful tool, leveraging genome-wide DNA variation and phenotypic data to predict the performance of unobserved individuals. Empirical studies have demonstrated GS’s superiority over conventional methods, enhancing selection gains and reducing breeding cycles across various crops. Furthermore, deep learning techniques, a subset of artificial intelligence, are increasingly explored in genomic prediction, showing promise in improving prediction accuracy, particularly with the expanding volume of genetic data. This intersection of genomics and DL holds the potential for revolutionizing various fields, including precision medicine and agriculture.

Deep Learning Architectures: A Genomic Perspective:

Recent advancements in genomic deep learning architectures have enabled more efficient and accurate biological data processing. CNNs excel in capturing genomic motifs, while RNNs handle sequential data like DNA sequences. Autoencoders, including Variational Autoencoders (VAEs), are valuable for feature extraction and dimensionality reduction. Emerging architectures, like hybrid models combining CNNs and RNNs, tackle specific genomic tasks effectively. Transformer-based LLMs, such as GPT, overcome the limitations of CNNs and RNNs by efficiently processing long sequences and capturing global dependencies. However, the high cost of training and serving LLMs remains challenging, especially for genomics tasks with extensive data requirements and privacy concerns.

Genomic Applications:

Deep learning is a powerful tool in various genomic applications, including gene expression characterization, regulatory genomics, functional genomics, and structural genomics. In gene expression characterization, deep learning models like denoising autoencoders and variational autoencoders have been employed to extract features from gene expression data, leading to an understanding of biological processes and better performance in tasks such as clustering and prediction. Moreover, deep learning methods have shown promise in predicting gene expression levels from DNA sequences, incorporating epigenetic data for enhanced accuracy, and even utilizing generative models to explore hypothetical gene expression profiles under different perturbations.

In regulatory genomics, deep learning techniques have been applied to identify regulatory motifs such as promoters, enhancers, and splice sites, with CNNs being particularly effective in capturing sequence features. Subcellular localization prediction of proteins has also benefited from deep learning, with models like CNNs and RNNs achieving high accuracy by effectively learning from biological sequence data. Additionally, deep learning methods in structural genomics have shown promise in protein structure classification and homology detection, leveraging techniques such as LSTM networks and CNNs to extract features from amino acid sequences and accurately classify protein folds. Overall, deep learning revolutionizes genomic research by providing powerful tools for analyzing complex biological data and uncovering novel insights into genetic mechanisms.

Materials and methods:

The study employed two datasets from the 1000 Genomes project, consisting of 10,000 and 65,535 single-nucleotide polymorphisms (SNPs) on specific chromosomal regions. They trained generative models, including Wasserstein GAN with gradient penalty (WGAN-GP), Restricted Boltzmann Machines (RBM), and Variational Autoencoders (VAE) to generate artificial genomic sequences. WGAN-GP and VAE were implemented with convolutional layers, while RBM utilized out-of-equilibrium learning. The evaluation included assessing the models’ ability to mimic real data via PCA and calculating the nearest neighbor adversarial accuracy (AATS) to measure overfitting and underfitting. Privacy leakage was quantified using a privacy score computed from AATS values of test and training datasets.

Generating large-scale genomic data:

The study trained WGAN and CRBM models on 1000 genome data containing 65,535 SNPs to generate artificial genomic sequences. While the VAE model couldn’t be trained effectively, WGAN and CRBM generated sequences that well captured real population structure and allele frequencies. However, WGAN-generated sequences had more fixed alleles with low frequencies than CRBM. LD decay analysis showed that both models had lower LD than real genomes. CRBM outperformed WGAN in 3-point correlation analysis but showed anomalies in AATS values, potentially indicating sequences outside the real data space. Further analysis revealed higher frequencies of chains of true data points compared to synthetic ones.


Deep learning shows promise in genomic research for its ability to capture nonlinear patterns and integrate diverse data sources without explicit feature engineering. However, its superiority over conventional models in predictive power has yet to be definitive. While generative neural networks can efficiently simulate large-scale genomic data, challenges like computational complexity and model optimization persist. Privacy concerns also necessitate further investigation. Despite these hurdles, advancements in model training and privacy safeguards could lead to artificial genome banks, expanding access to genomic data. Deep learning holds the potential to revolutionize genomics but requires careful navigation of challenges to achieve meaningful breakthroughs in predictive accuracy and interoperability.


[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft