media server logo

How AI contributes to video production and delivery

According to the Ericsson Mobility Report 2020, video will consume about 76% of global wireless network bandwidth by 2025. For comparison, in 2021, the share of video on the Internet was 63%.

Due to the increasing amount of quality content, file sizes are becoming bigger and bigger. 

As a result, new methods of video compression and transmission are being developed, including those that use artificial intelligence.

Every day we learn something new about how AI technologies are trained and implemented in our lives. Artificial intelligence appeared in the field of streaming video back in 2015 in the form of machine learning and gradually improved its algorithms by processing a large amount of data. Now AI, machine learning and deep learning are involved in the production and delivery of video.

Let's look at what each of these concepts are and how they are related.

AI is a broad industry that includes both machine learning and deep learning.

Machine learning is a subspecies of AI.

Deep learning is a subset of machine learning.

AI is the technological ability of a computer to perform human tasks that require intellectual thinking (to make a decision, recognize a picture or understand what the interlocutor said, and also win chess).

To do all this, the computers use algorithms.

Machine learning - understands this reality based on another algorithm that learns on its own, based on experience.

That is, algorithms are trained by processing large databases without pre-programmed instructions. The more data, the more experience, and the better the training is.

Deep learning is a sub-branch of machine learning. Here the computer learns, but in a slightly different way. It uses neural networks, which are algorithms that repeat the logic of human brain neurons.

Large amounts of data pass through these neural networks and the output is ready-made answers. Neural networks are more complex than machine learning and it is not always clear what factors influenced a particular answer.

In deep learning, big data is processed by a digital network that resembles the system of neurons in the human brain. Information is categorized, AI looks for patterns, adapts and learns.

Neural networks are called “black boxes” because it is not always clear what is happening inside these networks. To make decisions, several levels of neural networks are involved here, and it is not always possible to recognize which one was activated to make a decision.

How can these technologies be useful in streaming? 

Let's try to answer this question in this review.

Useful for coding

Due to the differences in streaming content, it requires different handling methods. You cannot encode a football match and a broadcast with “talking heads” in the same way.

Until 2015, there was only a fixed bitrate ladder. Features of the video were not taken into account. Therefore, in one case, the result might not be very high quality, while in another, heavy data might affect the throughput.

Netflix changed this by designing and implementing an optimized coding scheme for each case. By analyzing each particular encoding case, you can understand how to encode a particular file most effectively.

What Netflix were doing

They encoded all outgoing files at multiple resolutions and bitrates, and used the PSNR (peak signal to noise ratio) algorithm for analysis.

Now this methodology can be considered obsolete. But even then it did not provide an ideal result. A year later, the Video Multimethod Assessment Fusion (VMAF) machine learning method appeared. VMAF is capable of a deeper analysis of data that was provided by people.

By including people's scores in their analysis, VMAF learned how to improve results. So it became possible to adapt the video not only to the big screen, but also to mobile devices.

VMAF is open source, so there are many uses for it, from video surveillance systems to animation and sports broadcasts.

How Netflix, YouTube and Facebook started using machine learning

Machine learning has shown that coding can be more efficient.

In essence, VMAF for Netflix is about finding the highest quality streams from

different codes for each source file. The data volumes are very large, and only a small group of files are encoded.

What's happening at YouTube

The platform receives 300 hours of video files every minute.

How to transmit them all in good quality and at once? The neural network takes over this function by collecting about 200,000 test codes and isolating the best files from them, leaving 14,000 clips.

You are uploading a clip. Its data enters the neural network, and a single encoding parameter provided by Youtube is applied to the clip, and the information is transmitted in this capacity.

Netflix is doing something similar with their Dynamic Optimizer technology.

Facebook is addressing the high demand for video encoding by combining a cost-benefit model with a machine learning model that predicts which videos will be actively viewed and encodes them first.

The first machine learning video codec

In 2018, WaveOne developed a machine learning codec that compressed videos 20 times better than H.265 and VP9.

At the same time, if we take the video of “standard definition” (SD / VGA, 640 × 480), then a difference of 60% is visible.

The developers claimed that the classic compression technologies are well done, but not adaptive. Therefore, they are losing relevance against the backdrop of demand for video materials such as object detection, social media, virtual reality and streaming broadcasting.

Their development was the first machine learning method to show positive results.

How did compression happen before?

The main idea was to remove unnecessary junk data, replacing them with short files with a capacious and concise description.

This happened in two steps - motion compression and residual compression.

  1. Motion compression

The codec finds an object, analyzes where this object will be in the next frame. The algorithm then encodes the shape of the object and the direction in which it will move.

It is believed that this option does not work for live broadcasts.

  1. Removing superfluous between frames

That is, the algorithm does not waste space and time to write each pixel. It simplifies the task for itself by defining an area of a particular color, analyzes for how many frames that this area will remain unchanged and produces residual compression.

Machine learning has improved both of these methods

When compressing motion, machine learning algorithms have learned to detect redundancies that are not visible to conventional codecs. For example, these algorithms are able to predict, based on spatiotemporal patterns, what the face of a person will look like when he turns his head from a frontal view to a profile.

With residual compression, machine learning algorithms are better guided in what to choose for a particular frame - to compress motion or to produce residual compression.

Traditional algorithms are not able to make this decision, while machine learning compresses two signals at the same time, and by analyzing the frame complexity, distributes the bandwidth between the two signals in the best possible way.

On the left is the optical stream of the H.265 codec. On the right is the new one from WaveOne.


Everything is perfect? Unfortunately, no. The shortcomings in this technology were noticed by the MIT Technology Review. In one of their issues, they highlighted the fact that a lot of time is spent on encoding and decoding:

The average speed of the new decoder is 10 frames per second for the Nvidia Tesla V100 platform and on VGA-sized video.

For the encoder, this rate is approximately two frames per second. Which is not applicable for a live broadcast at all. And when encoding offline, its capabilities are also not as wide as we would have hoped for.

To watch a video even in minimum SD quality, compressed by a codec based on machine learning, you will need a computing cluster with several graphics accelerators. Which is not at all suitable if you watch videos on a PC.

To view the same video in HD - look for a computer farm.

That's how it all started. 

What's going on in this field today?

In 2022, Collabora shared its development - efficient video compression for video conferencing.

Thus, when there is a person's face in the video, it is transmitted as clearly as possible. The required bandwidth is reduced by 10 times, the quality remains at the level of H.264.

The implementation language is Python using the PyTorch framework.

By using this technology, facial details can be reconstructed if strong compression was used during transmission.

Using a high-quality image of a person and data based on how facial expressions change in a video, machine learning calculates how a face should look.

Video is sent with a low bitrate, but it is already processed when the recipient sees it. A Super-Resolution model is also used to improve quality.

Watch this demonstrative video to see how drastically this technology improves the quality of the image.

Buffering, video streaming and machine learning

Buffering is something the user would prefer not to think about. To solve the buffering problem, the MIT Computer Science and Artificial Intelligence Laboratory developed a neural network-based technology called Pensieve.

Pensieve is able to make decisions based on the data, which streams to extract, so that buffering and other playback difficulties do not occur.

The technology may become simpler and more accessible in the future. We hope so.

If you are interested in learning more about the buffering problem, you can read the article “Video buffering : what causes it and how to prevent it” in our blog.

The future of video without video?

Finally, let's talk a little about the service. It allows you to create video content without a camera, microphone, lights and actors.

All you need is the internet and your script. allows you to turn text into videos with visual characters. Create a presentation, training video, news release, or content for social media posts.

Now this service is more like entertainment, but perhaps in a few years it will become one of the basic tools for marketers and will allow you to create video content on a limited budget.

If you were interested and don't want to miss the latest streaming news, subscribe to our social networks: