Bonum Commune Communitatis: Standardization of Machine Learning-Based Video Coding Solutions (How Machine Learning ML is used in Video Encoding Part 6)

At this point of the series, I think we can all agree machine learning will play a key role in the future of video coding. Whether it is used as an alternative to a standard codec component or in end-to-end manner, machine learning will be there for sure. 

Standardization is crucial when it comes to video coding or any application on the Internet as this is a system used in all parts of the world with a device category of hundreds, if not thousands, of different ones. Therefore, the data must follow a strict, standardized format to be able to overcome this extreme heterogeneity. That is why the standardization committees like Moving Pictures Expert Group (MPEG), Joint Collaborative Team on Video Coding (JCT-VC), Joint Video Experts Team (JVET), and Joint Photographic Experts Group (JPEG) exist.

JPEG and MPEG are organized under ISO/IEC JTC 1/SC 29 (Coding of audio, picture, multimedia, and hypermedia information). MPEG focuses on setting standards for multimedia coding, like video and audio compression, the file format for applications, and transmission. On the other hand, JPEG focuses on the same aspects for still images. The role of JCT-VC and JVET is a bit different as they were formed to design video coding standards. Namely, JCT-VC for High-Efficiency Video Coding (HEVC) and JVET for Versatile Video Coding (VVC)

These standardization committees focused on improving the performance of video coding solutions in the last decades. Nowadays, as machine learning is used more and more commonly in video coding, the standardization committees started to form new groups for these approaches.

JPEG-AI had become an official work item in 2021. It focuses on providing a learning-based image compression method that targets a better visual quality with significant compression efficiency over existing image coding standards. Moreover, image coding for machines is also considered for applications like image processing and computer vision tasks.

MPEG has an open group about neural network compression as the efficient transmission of machine learning models will play a key role in video streaming. It is a relatively new group. The motivation behind it was the increasing importance of machine learning-based tools for applications such as video encoding, classification, and descriptor extraction from video content. The first version of the neural network compression was already released in 2021, and version 2 is on the way.

Moreover, MPEG has an exploration group for video coding for machines. Existing video codecs are designed for human consumption. However, today, most videos are analyzed by machines, and the standard codecs are not a suitable solution for video delivery to machines. MPEG activity on Video Coding for Machines (VCM) aims to standardize a bitstream format generated by compressing both a video stream and previously extracted features that will be used in machine vision tasks. 

There is also the Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) organization, independent from MPEG, that aims to develop standards for AI-based data coding. 

AI-based End-to-End Video Coding (MPAI-EEV) is a sub-group of MPAI which focuses on end-to-end video coding with machine learning. The goal here is to develop a method that can compress the video size by utilizing ML-based end-to-end data coding technologies without the constraints of previous video coding standards. Another MPAI project, AI-Enhanced Video Coding (MPAI-EVC), focuses on improving the performance of traditional video codecs by replacing components with machine learning-based methods.

ML-based video coding standards and their starting date

That was all for this blog post series. We started by introducing what video coding is and how it is delivered via HTTP Adaptive Streaming. Moreover, we covered how machine learning can be used to improve video codec performance, the visual quality of decoded videos, and provide end-to-end encoding solutions. And we finalized the series with this post by introducing ongoing standardization works about ML-based video coding. I hope you enjoyed reading it, and I hope it introduced a bit about the wide world of video coding.

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft