Google AI Introduces ‘TensorStore,’ An Open-Source C++ And Python Library Designed For Reading And Writing Large Multi-Dimensional Arrays

Various modern applications of computer science and machine learning use multidimensional datasets that span a single expansive coordinate system. Two examples are using air measurements over a geographical grid to estimate the weather or making medical imaging predictions using multi-channel image intensity values from a 2D or 3D scan. Such datasets can be challenging to work with since users may receive and write data at unpredictable intervals and different scales, and they frequently want to run studies on several workstations simultaneously. Even a single dataset under these circumstances might need petabytes of storage. 

Fundamental engineering problems in scientific computing linked to the management and processing of enormous datasets in neuroscience have already been resolved using Google’s TensorStore. TensorStore is an open-source C++ and Python software library developed by Google Research to address the problem of storing and manipulating n-dimensional data. This library supports several storage systems like Google Cloud Storage, local and network filesystems, etc. It offers a unified API for reading and writing diverse array types. With strong atomicity, isolation, consistency, and durability (ACID) guarantee, the library also provides read/writeback caching and transactions. Optimistic concurrency ensures secure access from different processes and computers. 

A simple Python API is available through TensorStore to load and work with massive arrays of data. Arbitrarily huge underlying datasets can be loaded and manipulated without storing the entire dataset in memory because no actual data is read or kept in memory until the precise slice is requested. This is possible with indexing and manipulation syntax, which is substantially the same as that used for NumPy operations. Additional advanced indexing features supported by TensorStore include transforms, alignment, broadcasting, and virtual views (data type conversion, downsampling, lazily on-the-fly generated arrays).

Large numerical datasets demand a lot of processing power to process and analyze. Usually, this is accomplished by parallelizing operations among a large number of CPU or accelerator cores distributed across several devices. Therefore, a core objective of TensorStore has been to enable parallel processing of individual datasets while maintaining high performance (i.e., reading and writing to TensorStore does not become a bottleneck during computation) and safety (by preventing corruption or inconsistencies resulting from concurrent access patterns). TensorStore also has an asynchronous API that lets a read or write operation continue in the background. At the same time, a program completes other tasks and customizable in-memory caching (which decreases slower storage system interactions for frequently accessed data). Optimistic concurrency ensures the security of parallel operations when many machines are accessing the same dataset. It keeps compatibility with various underlying storage layers without severely affecting performance. TensorStore has also been integrated with parallel computing frameworks like Apache Beam and Dask in order to enable distributed computing with TensorStore compatible with many current data processing workflows.

Exciting TensorStore application cases include PaLM and other sophisticated large language models. These neural networks test the limits of computational infrastructure with their hundreds of billions of parameters while demonstrating unexpected proficiency in creating and processing natural language. Efficiency in reading and writing the model parameters presents a difficulty during this training procedure. Although training is spread across numerous machines, it is necessary to routinely save parameters to a single checkpoint on a long-term storage system without slowing down the training process. These issues have already been addressed using TensorStore. It has been coupled with frameworks like T5X and Pathways and used to control checkpoints connected to large-scale (“multipod”) models trained with JAX.

Brain mapping is another intriguing use case. Synapse-resolution connectomics aims to trace the intricate network of individual synapses in animal and human brains. This calls for petabyte-sized datasets, which are produced by imaging the brain at extremely high resolution spanning fields of view of up to millimeters or more. However, given that they need millions of gigabytes to store, manipulate, and process data inside a coordinate system, current datasets present significant storage, manipulation, and processing issues. With Google Cloud Storage serving as the underlying object storage system, TensorStore has been used to address the computational issues posed by some of the largest and most popular connectomic datasets.

To get started, Google Research has provided the TensorStore package that can be installed using simple commands. They have also released several tutorials and API documentation for further reference.


Reference Article:

Refer to the tutorials and API documentation for usage details.

Please Don't Forget To Join Our ML Subreddit

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.