A New AI Research From Google Declares The Completion of The First Human Pangenome Reference

Researchers have spent decades piecing together a human genome map, a comprehensive copy of each individual’s genetic instructions. In 2000, researchers completed the first draft, but it needed key components. After completing the reference genome in 2022, they still had a ways to go. Genomics has spent the past three years working with the Human Pangenome Research Consortium, a group of 119 researchers from 60 institutions worldwide, to develop a new and more comprehensive map of the human genome.

The pangenome is an improved representation of the genetic variation of human populations since it combines reference sequences from 47 different genomes. Using Google’s deep learning technology and previous genomics advancements, researchers overcame the difficulties of producing correct pangenome sequences and applying them to a genomic analysis by employing techniques based on convolutional neural networks (CNNs) and transformers. The consortium was able to compile a wealth of data now available to academics, doctors, and geneticists everywhere.

Applications

  • Using a single linear reference genome, such as GRCh38 or CHM13, introduces mapping biases that the pangenome reference intends to eliminate, leading to vastly improved downstream analysis procedures.
  • A major benefit of a graph-based pangenome reference is that it can accurately represent polymorphic SVs.
  • Researchers compared the utility of the pangenome reference to that of a typical reference genome by mapping simulated RNA sequencing (RNA-seq) data to both the pangenome and the reference genome (Methods). Lower false mapping rates were achieved by the pangenome-based pipeline using vg mpmap57 compared to the linear reference pipeline using vg mpmap or STAR58. There was less allelic bias and more mapped coverage on heterozygous variations in the pangenome pipeline than in the linear reference pipelines, which could help with research into allele-specific expression.
  • Researchers re-analyzed data for H3K4me1 and H3K27ac from ChIP-seq and ATAC-seq on monocyte-derived macrophages from 30 individuals of African ancestry and 30 individuals of European ancestry, respectively, using the pangenome.

Pangenomes are constructed using graphs

After sequencing equipment reads millions of tiny fragments of an individual’s genome, a program called a mapper or aligner evaluates where those pieces best match relative to a single, linear human reference sequence. This is the standard analytic workflow for high-throughput DNA sequencing.

Different people’s DNA will have different sequences, and those not in the reference genome can’t be studied. Since it is necessary to represent the sequences of many individuals at once to construct a pangenome, the consortium turned to graph data structures to solve this problem. The nodes of a networked genome represent the population’s known collection of sequences, whereas the pathways between the nodes concisely define an individual’s DNA sequences.

Limitations and Emerging Sequencing Technologies to Overcome Them

Graphs introduce a wide variety of complications. They need precise reference sequences and the invention of new techniques that can make use of their data structure. However, exciting developments have been made thanks to the application of modern sequencing technologies, including consensus sequencing and phased assembly approaches.

  • Larger pieces of the genome (10,000 to millions of DNA characters long) can be more easily stitched into assembled genomes, making long-read sequencing technology crucial for generating high-quality reference sequences.
  • High-throughput sequencing methods developed in the 2000s are based on short-read sequencing, which reads portions of the genome that are only 100 to 300 DNA characters long. Despite the benefits of long-read sequencing in creating a reference genome, many informatics approaches developed for short reads needed more counterparts for long-read technology.

Using Transformers to Enhance Pan-Genome Sequences

Similar to how advances in sequencing technology paved the way for novel pangenome methodologies, recent advances in informatics have allowed for enhanced sequencing techniques. To create DeepConsensus, Google applied transformer topologies originally developed to analyze human language to study DNA sequences. This gave the precision needed to keep up with the terabytes of sequencer output without requiring a decoder. Differentiable loss functions that can account for the insertions and deletions seen in sequencing data paved the way for this.

The results and precision of instrument readings are both enhanced by DeepConsensus. Researchers were able to employ DeepConsensus to enhance 47 genome assemblies since primary sequence information was provided through PacBio sequencing. Using DeepConsensus, the consortium members created a genome assembler with base-level accuracy of 99.9997%.

According to the study’s authors, the value will come from the project’s potential to spread scientific knowledge to new demographics and researchers’ commitment to hearing all perspectives as they work toward the project’s lofty goal of creating a unified global reference database. Researchers are developing approaches that should be useful for studying other species. Indeed, several organizations are breaking ground in this area. In tandem with efforts to amass a larger set of diverse and accurate human reference genomes, scientists expect the pangenome reference to undergo further optimization and rapid improvement, opening up many new possibilities for research and clinical practice.


Check out the Paper and Blog. Don’t forget to join our 22k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...