Researchers from NVIDIA and Harvard University have introduced a machine learning-driven toolkit called AtacWorks that has the potential to bring about remarkable advancements in genome sequencing.
What is genome sequencing?
Genome sequencing was introduced by British biochemist Frederick Sanger and his team in 1977. The world was fascinated by how this new technology could uncover human similarity and genetic diversity in new ways.
A genome is a map of all the nucleotides in our body, and genome sequencing is a technique used to generate this nucleotide map to decode our DNA. The human genome consists of over 3 billion nucleotides.
Genome sequencing has helped scientists figure out the location of various genes and how they work together to ensure the growth and maintenance of organisms. It has served as an essential tool in the study of hereditary diseases and genetic abnormalities.
Current challenges and limitations of genome sequencing techniques
The traditional technique, ATAC-seq, measures the intensity of signals across the genome and plots the data in a graph. However, ATAC-seq can perform DNA sequencing efficiently only if it has access to many cells. We need a large number of cells (in the order of thousands) to carry out reasonably efficient genome sequencing. The fewer cells available, the noisier the data, and the more challenging it is to analyze rare cell types.
In addition, the traditional process is time-consuming. This poses a significant challenge to studying genetic mutations in organisms, like viruses, that rapidly mutate.
Introducing AtacWorks: the latest game-changer in genome sequencing
A machine learning driven toolkit called, AtacWorks, was created by researchers from NVIDIA and Harvard University to help address some of the challenges we face in genome sequencing.
AtacWorks is a Pytorch based Convolutional Neural Network (CNN) trained to differentiate between data and noise and pick out peaks in a noisy data set. AtacWorks can be combined with ATAC-seq data to obtain the same quality data using lesser data points (lesser number of cells in this case). Researchers have found that AtacWorks can produce the same quality data from 1 million data points as was earlier done using 50 million data points.
In addition, AtacWorks helps speed up analysis by using tensor core GPUs. This makes it possible to complete the full genome analysis in just 30 minutes – a radical difference compared to the traditional 15-hour time frame.
In the research paper published in Nature Communications, Harvard researchers applied AtacWorks to a dataset of stem cells that produce red and white blood cells. Stem cells are rare cell types and are often found in very small numbers in the human body at a time.
Using a sample of just 50 stem cells, the researchers were able to identify distinct regions of the DNA of a stem cell that causes it to evolve into a red blood cell or a white blood cell. They were also able to isolate DNA sequences that correspond to red blood cells.
This remarkable breakthrough made by machine learning in genome sequencing has the potential to lead to the discovery of new drugs and explore evolution through the study of new mutations.