RISELab Team At UC Berkeley Open Sources Skypilot: A Novel Framework That Targets Cloud Cost Optimization for Machine Learning and Data Science

The two of the biggest problems for both large and small enterprises are analysis and storage. To begin, the rate at which Big Data is being produced has increased dramatically. One of a company’s key responsibilities is the safe and cost-effective storage of this data, which is where the Cloud comes in.

Although using the Cloud for machine learning and data science is challenging in and of itself, adding cost-reduction measures can significantly increase the difficulty level.

Researchers at UC Berkeley’s RISELab have launched Skypilot, an open-source framework for managing machine learning workloads across several cloud providers with a single user interface. The project’s primary goal is cost minimization; hence it employs an algorithm to determine the most cost-effective availability zone, area, and service provider for the specified resources.

More than a dozen companies are currently making use of it for a wide variety of purposes, such as model training on GPU/TPU (3x cost reduction), distributed hyperparameter tuning, and bioinformatics batch processes on hundreds of CPU spot instances (6.5x cost savings on a recurring basis).

SkyPilot will determine which zones, regions, or clouds have the compute to run a job based on the job’s resource requirements (CPU, GPU, or TPU) and then send the job to the cheapest one to execute.

In addition, SkyPilot is being used to train massive models using Google’s TPUs. Through the TRC program, researchers can request free access to TPUs, and once approved, they can use SkyPilot to get started with TPUs in no time (both devices and pods are supported).

When it comes to reducing expenses in the Cloud, SkyPilot isn’t the first open-source product developed by RISELab. To optimize the transfer of massive datasets across cloud providers and reduce transfer times and costs, the research center released SkyPlane, as previously reported on InfoQ.

SkyPilot’s designers recommend using it to create multi-cloud applications that take advantage of top-tier technology and make more resources, such as powerful NVIDIA V100 and A100 GPUs, available. SkyPilot provides a cloud-agnostic interface that allows these applications to run on several clouds from day one (this is in contrast to tools like Terraform, which, while powerful, focus on lower-level infrastructure instead of jobs and require cloud-specific templates). So that they may concentrate on application-specific logic rather than cloud operations, these programmers appreciate the ability to consistently provide and run jobs on several clouds out of the box.

The framework’s Managed Spot functionality enables the usage of less expensive spot instances. It has automatic recovery from preemptions in addition to the automatic cleanup of inactive clusters (a feature known as “Autostop”). To aid developers in comprehending how the project functions, the group disseminated a set of Jupyter notebooks.

SkyPilot presently works with Amazon Web Services, Google Cloud Platform, and Microsoft Azure, and it offers a command line interface (CLI) and a Python API. The team plans to extend its services to support smaller cloud providers. 

Check out the Blog and Github link. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft