Researchers at Facebook and Google introduce a new technique called ‘LazyTensor’ that combines eager execution and domain-specific compilers (DSCs) to employ both advantages. The method allows complete use of all the host programming language features throughout the Tensor portion of users’ programs.
Domain-specific optimizing compilers have shown notable performance and portability benefits in the past few years. However, they require programs to be represented in their specialized IRs.
Imperative, sequential program execution called ‘eager execution’ is a define-by-run interface that is expressive and easy to debug. It also forms the basis for the most widely-adopted programming languages. Optimizing DSCs is a proven way to improve the performance of machine learning (ML) models. However, it suffers from a “language subset problem” that makes it less expressive. In this problem, some host language features are unsupported in the subset of the user’s program that interacts with the domain-specific compiler.
Therefore, the researchers have worked together and proposed a novel technique called LazyTensor. The method combines an eager programming model of Tensor programs with domain-specific compilers does not restrict the user’s programming language’s expressivity. It is a comprehensive approach that can be applied to any define-by-run machine learning framework.
They have successfully implemented this technique in two programming languages for two ML frameworks: PyTorch and Swift for TensorFlow. Also, they have managed to reuse the majority of the implementation while targeting completely different languages and Tensor APIs.
A Tensor is a generalization of matrices and vectors that can be understood as a multidimensional array, referring to multidimensional array abstraction in the training phase. DSCs are used for Tensor abstraction to target domain-specific hardware (for example, TPUs) and enhance the performance of a given hardware footprint. DSCs take source programs as input in compiler-specific intermediate representation (IR). For instance, XLA HLO IR, where the syntax is hugely verbose, and memory allocation is inflexible.
Contrastingly, eager execution or “define-by-run” libraries provide users with the full power and expressivity of a general-purpose programming language. Additionally, they are easier to debug and more flexible.
A solution that combines both strengths
Firstly, the team built a Tensor API. This provides benefits like using the full host language for function abstraction and control flow and data structure. Then the Tensor API, the LazyTensor system, builds on an underlying eager runtime and an XLA domain-specific compiler based on the Tensor API.
There are three essential components of LazyTensor:
- It has a custom Tensor type with an identical API to an existing Tensor type.
- A mapping from the high-level Tensor operations to XLA HLO sequences implementing the semantics of the requested operation.
- A runtime lowers sequences of Tensor operations into XLA HLO IR and orchestrates the resulting program’s compilation and execution.
The researchers carried out experiments across many dimensions and applications (such as Code Reuse, Training Transformers on Cloud TPUs, and Scaling ResNet-50 on TPUs) to test LazyTensor’s performance.
The evaluation validates the reusability of LazyTensor across several programming languages. PyTorch LazyTensor allows the HuggingFace Transformer library to run on Cloud TPUs using XLA. It has demonstrated vital performance improvements on TPUs compared to GPU hardware. Additionally, they have shown that LazyTensor can scale to large TPU supercomputers.