The Sine Qua Non of Video Delivery: Video Encoding (How Machine Learning ML is used in Video Encoding Part 1)

Video has become an essential part of the Internet, and numbers support this statement. Video content was responsible for 82% of the total Internet traffic in 2022, and the share is expected to rise in the future. Nowadays, it is hard to get through a day at least seeing multiple videos on our online feed, whether a short video on Twitter, a live stream on Instagram, or a daily recap of the day on any news website. 

The video is essentially a bunch of pictures, called frames, displayed consecutively in a fast manner. If you remember those flipbooks from your childhood, that is the best real-world example of a video. In the digital world, the concept is the same. We capture consecutive pictures using a camera and show them one after the other.

A typical video consists of 30 frames per second. If the video is one minute long, it contains 1800 frames (30 * 60). When you keep in mind that a single image with a size of 1920×1080 pixels (1080p) approximately has around 4 MB data size, then with the same ratio, a minute-long 1080p video should have a size of roughly 7200 MB. However, nowadays, the average size for a 1-minute video is around 25-30 MB. So, what makes this magical reduction possible? The answer is video encoding.

Difference between images and video

Video encoding is a multi-step process that aims to reduce the video file size. The key aspect here is the motion and the similarity among the frames, enabling us to represent the same content using less data than preparing each frame individually. The main goal of the video encoder is to find those similarities and exploit them to reduce the data size. 

Encoding starts with the detection of motion between frames. This step is called motion compensation. Here the goal is to represent the current frame as the difference from the reference frame. To achieve more accurate motion detection, each frame is first split into smaller pieces, called blocks. Afterward, the blocks in the current frame are matched with the reference frame, and the difference is represented as motion vectors. Finally, only the reference frame and motion vectors are stored instead of the entire frame, which is enough to reconstruct the video.  

Block matching and motion vectors

Once the motion compensation step is done, we have the key information to compress the video into a smaller size. However, there is still redundant information that takes too much space and can be eliminated without significantly distorting the visual quality.

Our eyes are more sensitive to brightness (luminance) information than color information. So, if someone changes some pixels’ colors in the video, we would not notice the difference probably. However, the difference can easily be detected when it comes to the brightness of pixels. That’s why we allocate more data to represent the luminance in the video. This step is called chroma subsampling

In the video, luminance (Y), blue-difference chroma (Cb), and red-difference chroma (Cr) values are used to represent color instead of the Red-Green-Blue (RGB) channels as in other digital use cases. In the chroma subsampling step, frames are split into Y and CbCr channels and represented as a:b:c, where a is the width of the region, b is the number of pixels in the first row that have color values, and c is the number of pixels in the second row that have color values. Typical a:b:c setups are 4:4:4 (raw content, cinema), 4:2:2 (high-end digital content, broadcast), and 4:2:0 (video streaming). 

Chroma subsampling examples

Chroma subsampling enables us to use less data to represent color information, significantly reducing the required data size. At this point, we have all the data we need to reconstruct the video again. However, we can still reduce the data size by doing simple tricks on the bit stream. Quantization and Transformation is the final step in the encoding pipeline, which works on the bit stream instead of the frame level information. Here, the goal is to reduce the number of unique values in the bit stream. Moreover, we can control the quality level of the video in this step by changing the quantization factor, for example.

At this point, we have a compressed bit stream instead of the video content. What happens when we want to play the encoded video? We first need to decode it and this is done by the video decoder. In fact, you only get a working system when you combine both the enCODer and the DECoder. This is why we call the tool we use to compress video size as Video Codec. Video codecs are carefully designed by standardization committees and with the contribution of hundreds of scientists from both academia and industry. 

So now we have an idea about what video is and how it is prepared using a video encoder. Now comes the question of how it is delivered to our screens. We will answer that in our next blog post. 

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.

🐝 [FREE AI WEBINAR] 'Beginners Guide to LangChain: Chat with Your Multi-Model Data' Dec 11, 2023 10 am PST