Machine learning is an area that is being incorporated into almost every industry, and there has been a call for standard machine learning benchmarks similar to the SPEC benchmarks created primarily for the CPUs. These benchmarks would prove pivotal in comparing all the relative machine learning solutions available in the marketplace.
MLCommons is an open engineering consortium that has been working in the direction of creating machine learning benchmarks for training and inference through its own platform MLPerf. MLCommons is usually listed as an industry-academic partnership that aims to advance the development and access of the latest AI and machine learning datasets. The benchmarks so created have been discussed and disclosed time and again to make people aware of the refinements taking place along the way. Recently, the company unveiled its platform Inference v1.0 and also released 2000 results into the database. Not only this, but the company also disclosed a new power measurement technique of the platform that would look into providing additional metadata on these results.
Why is a Standard Benchmark Important?
Creating this benchmark could prove to be a turning point in the application of artificial intelligence because during conduction of the inference on edge, usually not even 10% of the peak TOPS (trillions of operations per second) are used. This shows that a lot of static power is wasted in the entire process, which can lessen the utilization of the chip in the models, better utilization, on the other hand, would promote more efficiency as power would not be wasted on the already used resources. Simultaneously, this could lead to slower results, which would not be feasible for some vendors.
The Test
The results that the organization has released are centered around one key point, that is, inference. It follows into the ability of a trained network to process the unseen incoming data. The test claims to standardly measure how much electricity, either in watts or joules, is drawn for any particular task assigned to the machine learning model. Not only this, the test has been tried and tested on an entire plethora of machine learning areas and models so that it can cover a broader machine learning market in its ambit.

The Tasks that the Platform Measures
The platform infers the amount of energy being used for an array of machine learning model tasks, including:
- Image classification
- Object detection
- Medical Image segmentation
- Language processing
- Speech to text
- Recommendation engines
Collection of Results
The prominent market stalwarts like Alibaba, Gigabyte, HPE, Inspur, Intel, Lenovo, NVIDIA, Qualcomm, Supermicro, and Xilinx have provided results for the platform, thereby confirming its authenticity and efficacy. Krai(startup) has also submitted results that have proved significant for the development of this platform. A benchmark suite was created by Krai that was then run on many low-cost edge devices both with and without GPU acceleration. This helped MLPerf by providing a chunk of the total required data (around 50%).
Submission of Results
The results that are deciphered can be submitted and categorized into several sub-divisions, including Datacenter, Edge, Mobile or Tiny. For both Datacenter and Edge, there are two other categories as well:
- Closed category: This works best for apples to apples that have the same reference framework
- Open Category: This includes everything and offers peak optimization
The presented metrics depend on either the stream (single or multiple), the server response, or the offline data flow. The benchmark that has been designed can run on various platforms, including CPU, GPU, FPGA, or a dedicated AI silicon. There is, however, no combined benchmark that has been provided for the simple reason that not every system will not be subjected to every test.
Therefore, a scale of results have been submitted, and all the data points have been released by the MLCommon, which is the trademark company for MLPerf. A little arithmetic is required to make sense of these data points.
Metrics System and Measurement
The metrics being used for Inference v1.0 are system level in contrast to the simple chip levels. This provides extra control, storage, memory, power delivery, and other added advantages. The providers can also submit specific additional details like processor power which may not necessarily be included. It is a holistic measurement of the model that is done.
Source: https://mlcommons.org/en/news/mlperf-inference-v10/
Results: https://mlcommons.org/en/inference-datacenter-10/
Paper: https://arxiv.org/abs/1911.02549
Amreen Bawa is a consulting intern at MarktechPost. Along with pursuing BA Hons in Social Sciences from Panjab University, Chandigarh, she is also a keen learner and writer, having special interest in the application and scope of artificial intelligence in various facets of life.