Researchers at Intel Labs Creates A New Data Science Pipeline That Accelerates Single-cell RNA-Seq Analysis

This Article is written as a summary by Marktechpost Staff based on the research article 'Intel Labs Accelerates Single-cell RNA-Seq Analysis'. All Credit For This Research Goes To Researchers on This Project. 

Please Don't Forget To Join Our ML Subreddit

Nearly 40 trillion cells make up the human body. As a result, these cells have traditionally been studied in quantity, with millions of cells being analyzed simultaneously. In the subject of single-cell analysis, researchers examine the uniqueness of each cell. Finding new cell types, disclosing mechanisms that differentiate them, and showing how cells respond to certain diseases or treatments are beginning to solve the enigma of cell differentiation. Cancer research to Covid-19-related research is only a few examples of the wide range of possible uses for this field, which is still relatively new.

As data measurement methods progress, the volume of single-cell data is growing rapidly. At a comparable pace, the number of individual datasets is also growing. Running a data science pipeline is the most common way to analyze this data. If many parameters are to be updated, having an interactive pipeline that can run in near real-time can be helpful.

To better understand how cells differentiate, many single-cell investigations are available. ScRNA-seq (single-cell RNA-seq) analyses gene expression changes between cells. An advanced technology known as single-cell RNA sequencing is used to evaluate gene expression in individual cells.

Starting with the expression levels of genes in each cell, scRNA-seq analysis typically begins with a matrix. Each cell in the dataset has its unique set of human genes, filtered out and standardized during data preprocessing. Machine learning is frequently used to repair data gathering artifacts at this stage. Following dimensionality reduction, clustering is used to group cells with comparable genetic activity, and the clusters are shown. Scanpy is a popular tool for this type of analysis, having been downloaded more than 800,000 times.

The typical pipeline takes roughly 5 hours on a single CPU instance (n1-highmem-64) on GCP using off-the-shelf (baseline) Scanpy implementation for a dataset containing 1.3 million mouse brain cells. An end-to-end runtime of 686 seconds on a single A100 GPU utilizing Nvidia RAPIDs has been reported by Nvidia.

Source: https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-Labs-Accelerates-Single-cell-RNA-Seq-Analysis/post/1390715

A new study by the Intel oneDAL team in collaboration with Katana Graph teams at Intel Labs improves the pipeline’s performance by applying more parallel algorithms and fine-tuning its performance to the architecture. The entire pipeline can now be completed in just 626 seconds on a single CPU instance (n1-highmem-64) on GCP. Katana Graph implemented the efficient algorithm implementation of Louvain and Leiden for this. 

The researchers used Numba, a just-in-time (JIT) compiler, to speed up data preprocessing by leveraging a warm file cache and multi-threading. The baseline preprocessing performance was enhanced by more than 70 times as a result.

To speed up K-means clustering, KNN (K Nearest Neighbor), and PCA, they used the Intel scikit-learn plugin (Principal Component Analysis).

For a long time, Scanpy relied on an inefficient tSNE (t-distributed Stochastic Neighbor Embedding) implementation from scikit-learn. Building an effective implementation of tSNE resulted in a nearly 40-fold speedup.

Using the low-memory n2-highcpu-64 instances rather than the high-memory n2-highmem-64 instances allowed them to lower the pipeline’s memory requirements. The team started with a 5-hour CPU baseline, which is 40 times quicker. The entire pipeline completes in 459 seconds on a single instance of n2-highcpu-64 running on GCP (7.65 mins). This is nearly 1.5 times faster than Nvidia A100’s performance.

As the team explains, the achieved speedup and reduction in memory requirements have reduced the cloud expenses significantly. The n2-highcpu-64 instance on GCP costs only $ 0.29. N1-highmem-64 running Scanpy is 66 times more expensive, whereas the Nvidia A100 GPU is just 2.4 times more expensive than this option. N1-highmem-64 running Scanpy is 66 times more expensive, whereas the Nvidia A100 GPU is just 2.4 times more expensive than this option.

The researchers hope that the shorter working hours will enable a much better comprehension of various cells, opening the door for medicinal advancements that might have significant overall advantages. 

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.