Recent technological developments have made machine learning a crucial workload in various settings. For instance, it is a part of intricate and complex optimization loops in data centers which serve as a constraint when it comes to achieving quality compilation results. On the other hand, when it comes to interactive devices like mobile phones, it is close to being ready to roll out ML-enabled apps. Since ML workloads frequently have a different structure than traditional applications, many existing systems must be re-evaluated. Allocating memory buffers for on-chip memories in systems aimed toward ML accelerators is one such context that requires re-evaluation.
When it comes to languages like C++, traditional memory allocation is done dynamically, and requests are made while an application runs. Contrary to this approach, ML systems typically operate according to a static control flow graph where logical allocation and deallocation times, buffer size, and other parameters are often known previously. The allocator’s job is to allocate memory such that the total amount of used memory never exceeds the total amount of memory physically available on the device. Finding a solution to this NP-hard optimization problem is exceedingly difficult. However, finding an answer to this allocation problem is necessary in order to achieve full performance.
Existing ML compiler frameworks manage these on-chip memory allocation issues in one of two ways: either by employing solver-based approaches or ad hoc heuristics, each having its pros and cons. While solver-based solutions effectively handle more complicated cases, they are expensive and unsuitable in situations where memory allocation is critical. Heuristic methods work for simple cases but fall short for more complex ones.
When a team from Google Research noticed that several of their crucial models took too long to build while developing the Pixel 6 smartphone, they decided to analyze this problem statement further. To bridge the gap between heuristics and solver-based techniques, the team developed a new methodology and allocator known as TelaMalloc. A heuristic is chosen from a search space whenever a buffer is allocated. The choice made by the selected heuristic is then updated in the solver, and its output is then utilized to direct subsequent decisions. On real ML workloads, Google’s unique method speeds up allocation time by up to two orders of magnitude, supporting certain fundamental models that would otherwise go unsupported. TelaMalloc is currently in production and will be shipped with Google’s Pixel 6 and TPUv4.
The allocation problem is represented as a search space, where each step ensures that the decision does not make the problem unsolvable. Several heuristics are taken into account when choosing this step. To accomplish this, the researchers chose a constraint programming (CP) solver to direct backtracking in the search space. This made it quite straightforward to spot early on when a choice made the problem unsolvable. An underlying framework known as Telamon served as the foundation for all interactions with the CP solver. This framework offers a callback into a search heuristic that can make one variable assignment decision at a time instead of asking the solver to devise a solution to the complete constraint problem. The CP solver’s state is updated once the heuristic has made a decision. If the problem is rendered intractable at any point, Telamon backtracks to investigate an alternative path through the search space.
The researchers benchmarked TelaMalloc against a state-of-the-art ILP-based solution. Since TelaMalloc operates on server CPUs for XLA and on-device for Pixel 6, it was tested for both use cases. The team combined real-world models with microbenchmarks to try for a variety of workloads. A collection of on-device allocator inputs that could be used on standard servers and desktops was also developed for thorough testing. The traces were chosen to make them identical to those used by the real-world Pixel 6’s hardware. The researchers further verified that allocation speedups seen on a desktop CPU translate to speedups on the actual device.
In conclusion, Google’s unique allocator, TelaMalloc, uses a heuristics and a solver-based strategy to address the memory allocation issue in machine learning accelerators. Their method has resulted in a huge improvement, up to two orders of magnitude speedup for certain key models. One key achievement of the allocator is its ability to compile models on-the-fly, supporting even those models that otherwise would have been impossible.
Check out the Paper. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.