Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
i		i
s		s
v		v
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_docker.sh		clean_docker.sh
dct_2d.ipynb		dct_2d.ipynb
enconding_pratical_examples.md		enconding_pratical_examples.md
filters_are_easy.ipynb		filters_are_easy.ipynb
image_as_3d_array.ipynb		image_as_3d_array.ipynb
setup.sh		setup.sh

Repository files navigation

Intro

A gentle introduction to video technology, although it's aimed for software developers / engineering, we want to make it easy for anyone to learn. This idea was born during a mini workshop for newcomers to video technology.

The goal is to introduce some digital video subjects with a simple text, visual element and practical examples, where is possible and make this knowledge available everywhere. Please, feel free to correct, suggest and improve it.

There will be hands-on sections which requires you to have docker installed and this repo cloned.

git clone https://github.com/leandromoreira/digital_video_introduction.git
cd digital_video_introduction
./setup.sh

WARNING: when you see a ./s/ffmpeg or ./s/mediainfo command, it means we're running a containerized version of that program, which already includes all the needed requirements.

All the hands-on should be performed from the folder you cloned this repository, for the jupyter examples you must start the server ./s/start_jupyter.sh and copy the url and use it on your browser.

Index

Intro
Index
Basic video/image terminology - Other ways to encode a color image - Hands-on: play around with image and color - DVD is DAR 4:3 - Hands-on: Check video properties
Redundancy removal
How does a video codec work?
Adaptive streaming
Audio codec
How to use jupyter
References

Basic video/image terminology

An image can be thought as a 2D matrix and if we think about colors, we can extrapolate this idea seeing this image as a 3D matrix where the additional dimensions are used to provide color info.

If we chose to represent these colors using the primary colors (red, green and blue), we then can define the tree planes: the first one red, the second green and the last the blue color.

Each point in this matrix, we'll call it a pixel (picture element), will hold the intensity (usually a numeric value) of that given color. A total red color means 0 of green, 0 of blue and max of red, the pink color can be formed with (using 0 to 255 as the possible range) with Red=255, Green=192 and Blue=203.

Other ways to encode a color image

There are much more models to represent an image with colors. We could use a indexed palette where we'd spend only a byte for each pixel instead of 3, comparing it to RGB model. In this model instead of a 3D matrix we'd use a 2D matrix, saving memory but having much less color options.

For instance, look at the picture down bellow, the first face is full colored, the rest is the red, green and blue (but in gray tones) planes.

We can see that the red color will be the one that contributes more (the brightest parts in the second face) to the final color while the blue color contribution can be mostly only seen in Mario's eyes (last face) and part of his clothes, see how all the planes contributes less (darkest parts) to the Mario's mustache.

And each color intensity requires a certain amount of bits, this quantity is know as bit depth. Let's say we spend 8 bits (accepting values from 0 to 255) per color (plane), therefore we have a color depth of 24 (8 * 3) bits and we can also infer that we could use 2 to the power of 24 different colors.

It's great to learn how an image is captured from the world to the bits.

Another property of an image is the resolution, which is the number of pixels in one dimension. It is often presented as width × height, for example the 4×4 image bellow.

Hands-on: play around with image and color

You can play around with image and colors using jupyter (python, numpy, matplotlib and etc).

You can also learn how image filters (edge detection, sharpen, blur...) work.

Another property we can see while working with images or video is aspect ratio which is simple describes the proportional relationship between width and height of an image or pixel.

When people says this movie or picture is 16x9 they usually are referring to the Display Aspect Ratio (DAR) and we also can have different shapes of a pixel, we call this Pixel Aspect Ratio (PAR).

DVD is DAR 4:3

Although the real resolution of a DVD is 704x480 it still keeps a 4:3 aspect ratio because it has a PAR of 10:11 (704x10/480x11)

Finally we can define a video as a succession of n frames in time which can be seen as another dimension, n is the frame rate or frames per second (FPS).

The amount of bits per second needed to show a video is its bit rate. For example, a video with 30 frames per second, 24 bits per pixel, resolution of 480x240 will need 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we don't employ any kind of compression.

When the bit rate is constant it's called constant bit rate (CBR) but it also can vary then called variable bit rate (VBR).

In the early days engineering come up with a technique for doubling the perceived frame rate of a video display without consuming extra bandwidth, this technique is known as interlaced video. It basically sends half of the screen in 1 "frame" and the next "frame" they send the other half.

Today screens render mostly using progressive scan technique, progressive is a way of displaying, storing, or transmitting moving images in which all the lines of each frame are drawn in sequence.

Now we have an idea about what is an image, how its colors are arranged, how many bits per second do we spend to show a video, if it's constant (CBR) or variable (VBR), with a given resolution using a given frame rate and many other terms such as interlaced, PAR and others.

Hands-on: Check video properties

You can check most of the explained properties with ffmpeg or mediainfo.

Redundancy removal

Colors, Luminance and our eyes

Hands-on: Check YUV histogram

You can check the YUV histogram ffmpeg.

Frame types

I Frame (intra, keyframe)

P Frame (predicted)

B Frame (bi-predictive)

Temporal redundancy (inter prediction)

Hands-on: See the motion vectors

We can generate a video with the inter prediction (motion vectors) with ffmpeg.

Or we can use the Intel Video Pro Analyzer (which is paid but there is a free trial version which limits you to only the first 10 frames).

Spatial redundancy (intra prediction)

Hands-on: Check intra predictions

You can generate a video with macro blocks and their predictions with ffmpeg. Please check the ffmpeg documentation to understand the meaning of each block color.

Or we can use the Intel Video Pro Analyzer (which is paid but there is a free trial version which limits you to only the first 10 frames).

How does a video codec work?

What? Why? How?

What? It's a software / hardware that compresses or decompresses digital video. Why? Market and society demands higher quality videos with limited bandwidth or storage, remember when we calculated the needed bandwidth for a 30 frames per second, 24 bits per pixel, resolution of 480x240 video? It was 82.944 Mbps with none compression applied. It's the only way to delivery HD/FullHD/4K in TVs and Internet. How? We'll take brief look a the major techniques here.

History

Before we jump in the inner works of a generic codec, let's look back to understand a little better about some old video codecs.

The video codec H261 was born in 1990 (technically 1988), it was designed to work with data rates of 64 kbit/s. It already uses ideas such as chroma subsampling, macro block and etc. In the year of 1995 the H263 video codec standard was published but it continued to be extended until 2001.

In 2003 the first version of H.264/AVC was completed, in the same year, a company called TrueMotion released their video codec as a royalty free lossy video compression called VP3. In 2008, Google bought this company, in the same year they released the VP8. In December of 2012, Google released the VP9 and it's supported by roughly ¾ of the browser market (mobile included).

AV1 is a new video codec, royalty-free, open source being designed by the Alliance for Open Media (AOMedia) which is composed by the companies: Google, Mozilla, Microsoft, Amazon, Netflix, AMD, ARM, NVidia, Intel, Cisco among others. The first version 0.1.0 of the reference codec was published on April 7, 2016.

If you want to learn more about the history of the codecs you must learn the basics behind video compression patents.

A generic codec

We're going to introduce the main mechanics behind a generic video codec but most of these concepts are useful and used in modern codecs such as VP9, AV1 and HEVC. Be sure to understand that we're going to simplify things a LOT. Sometimes we'll use a real example (mostly H264) to demonstrate a technique.

1st step - picture partitioning

The first step is to divide the frame into several partitions, sub-partitions and beyond.

But why? There are many reasons, for instance, when we split the picture we can work the predictions more precisely, using small partitions for the moving parts while use bigger partitions to static background.

Usually, the CODECs organize these partitions into slices (or tiles), macro (or coding tree units) and many sub partitions. The max size of these partitions varies, HEVC sets 64x64 while AVC uses 16x16 but the sub-partitions can reach sizes of 4x4.

Remember that we learned how frames are typed?! Well, you can apply those ideas to blocks too, therefore we can have I-Slice, B-Slice, I-Macroblock and etc.

Hands-on: Check partitions

We can also use the Intel Video Pro Analyzer (which is paid but there is a free trial version which limits you to only the first 10 frames). Here's a VP9 partitions analyzed.

2nd step - predictions

3rd step - transform

4th step - quantization

5th step - entropy coding

After we quantized the data (image blocks/slices/frames) we still can compress it in a lossless way. There are many ways (algorithms) to compress data. We're going to briefly experience some of them, for a deeper understanding you can read the amazing book Understanding Compression: Data Compression for Modern Developers.

Delta coding:

I love the simplicity of this method (it's amazing), let's say we need to compress the following numbers [0,1,2,3,4,5,6,7] and if we just decrease the current number to its previous and we'll get the [0,1,1,1,1,1,1,1] array which is highly compressible.

Both encoder and decoder must know the rule of delta formation.

VLC coding:

Let's suppose we have a stream with the symbols: a, e, r and t and their probability (from 0 to 1) is represented by this table.

	a	e	r	t
probability	0.3	0.3	0.2	0.2

We can assign unique binary codes (preferable small) to the most probable and bigger codes to the least probable ones.

	a	e	r	t
probability	0.3	0.3	0.2	0.2
binary code	0	10	110	1110

Let's compress the stream eat, assuming we would spend 8 bits for each symbol, we would spend 24 bits without any compression. But in case we replace each symbol for its code we can save space.

The first step is to encode the symbol e which is 10 and the second symbol is a which is added (not in the mathematical way) [10][0] and finally the third symbol t which makes our final compressed bitstream to be [10][0][1110] or 1001110 which only requires 7 bits (3.4 times less space than the original).

Notice that each code must be a unique prefixed code Huffman can help you to find these numbers. Though it has some issues there are video codecs that still offers this method and it's the algorithm for many application which requires compression.

Both encoder and decoder must know the symbol table with its code therefore you need to send the table too.

Arithmetic coding:

Let's suppose we have a stream with the symbols: a, e, r, s and t and their probability is represented by this table.

	a	e	r	s	t
probability	0.3	0.3	0.15	0.05	0.2

With this table in mind we can build ranges containing all the possible symbols sorted by the most frequents.

Now let's encode the stream eat, we pick the first symbol e which is located within the subrange 0.3 to 0.6 (but not included) and we take this subrange and split it again using the same proportions used before but within this new range.

Let's continue to encode our stream eat, now we take the second symbol a which is within the new subrange 0.3 to 0.39 and then we take our last symbol t and we do the same process again and we get the last subrange 0.354 to 0.372.

We just need to pick a number within the last subrange 0.354 to 0.372, let's chose 0.36 but we could chose any number within this subrange. With only this number we'll be able to recovery our original stream eat. If you think about it, it's like if we were drawing a line within ranges of ranges to encode our stream.

The reverse process (A.K.A. decoding) is equally easy, with our number 0.36 and our original range we can run the same process but now using this number to reveal the stream encoded behind this number.

With the first range we notice that our number fits at the e slice therefore it's our first symbol, now we split this subrange again, doing the same process as before, and we'll notice that 0.36 fits the symbol a and after we repeat the process we came to the last symbol t (forming our original encoded stream eat).

Both encoder and decoder must know the symbol probability table, therefore you need to send the table.

Pretty neat isn't? People are damm smart to come up with such solution, some video codec uses (or at least offers as an option) this technique.

The idea is to lossless compress the quantized bitstream, for sure this article is missing tons of details, reasons, trade-offs and etc. But you should learn more as a developer. Newer codecs are trying to use different entropy coding algorithms like ANS.

Hands-on: CABAC vs CAVLC

You can generate two streams, one with CABAC and other with CAVLC and compare the time it took to generate each of them as well as the final size.

6th step - bitstream format

After we did all these steps we need to pack the compressed frames and context to these steps. We need to explicitly inform to the decoder about the decisions taken by the encoder, things like: bit depth, color space, resolution, predictions info (motion vectors, direction of prediction), profile, level, frame rate, frame type, frame number and many more.

We're going to study, superficially, the H264 bitstream. Our first step is to generate a minimal ¹ H264 bitstream, we can do that using our own repository and ffmpeg.

./s/ffmpeg -i /files/i/minimal.png -pix_fmt yuv420p /files/v/minimal_yuv420.h264

¹ ffmpeg adds, by default, all the encoding parameter as a SEI NAL, soon we'll define what is a NAL.

This command will generate a raw h264 bitstream with a single frame, 64x64, with color space yuv420 and using the following image as the frame.

H264 bitstream

The AVC (H264) standard defines that the information will be send in macro frames (in the network sense), called NAL (Network Abstraction Layer). The main goal of the NAL is the provision of a "network-friendly" video representation, this standard must work on TVs (stream based), Internet (packet based) among others.

There is a synchronization marker to define the boundaries among the NAL's units. Each synchronization marker holds a value of 0x00 0x00 0x01 except to the very first one which is 0x00 0x00 0x00 0x01. If we run the hexdump on the generated h264 bitstream, we can identify at least three NALs in the beginning of the file.

As we said before, the decoder needs to know not only the picture data but also the details of the video, frame, colors, used parameters and others. The first byte of each NAL defines its category and type.

NAL type id	Description
0	Undefined
1	Coded slice of a non-IDR picture
2	Coded slice data partition A
3	Coded slice data partition B
4	Coded slice data partition C
5	IDR Coded slice of an IDR picture
6	SEI Supplemental enhancement information
7	SPS Sequence parameter set
8	PPS Picture parameter set
9	Access unit delimiter
10	End of sequence
12	End of stream
...	...

Usually the first NAL of a bitstream is a SPS, this type of NAL is responsible to inform the general encoding variables like profile, level, resolution and others.

If we skip the first synchronization marker we can decode the first byte to know what type of NAL is the first one.

For instance the first byte after the synchronization marker is 01100111, where the first bit (0) is to the field forbidden_zero_bit, the next 2 bits (11) tell us the field nal_ref_idc which indicates whether this NAL is a reference field or not and the rest 5 bits (00111) inform us the field nal_unit_type, in this case it's a SPS (7) NAL unit.

The second byte (binary=01100100, hex=0x64, dec=100) of a SPS NAL is the field profile_idc which shows the profile that the encoder has used, in this case we used the constrained high profile, it's a high profile without support of B (bi-predictive) slices.

When we read the H264 bitstream spec for a SPS NAL we'll find many values for parameter name, category and a description, for instance let's look at pic_width_in_mbs_minus_1 and pic_height_in_map_units_minus_1 fields.

Parameter name	Category	Description
pic_width_in_mbs_minus_1	0	ue(v)
pic_height_in_map_units_minus_1	0	ue(v)

ue(v): unsigned integer Exp-Golomb-coded

If we do some math with the value of these fields we will end up with the resolution. We can represent a 1920 x 1080 using a pic_width_in_mbs_minus_1 with the value of 119 ( (119 + 1) * macroblock_size = 120 * 16 = 1920) , again saving space, instead of encode 1920 we did it with 119.

If we continue to examine our created video with a binary view (ex: xxd -b -c 11 v/minimal_yuv420.h264), we can skip to the last NAL which is the frame itself.

We can see its first 6 bytes values: 01100101 10001000 10000100 00000000 00100001 11111111. As we already know the first byte tell us about what type of NAL it is, in this case (00101) it's an IDR Slice (5) and we can further inspect it:

Using the spec info we can decode what type of slice (slice_type), frame number (frame_num) among others important fields.

In order to get the values of some fields (ue(v), me(v), se(v) or te(v)) we need to decode it using a special decoder called Exponential-Golomb, this method is very efficient to encode variable values, mostly when there are many default values.

The values of slice_type and frame_num of this video are: 7 (I slice) and 0 (the first frame).

We can see the bitstream as a protocol and if you want or need to learn more about this bitstream please refer to the ITU H264 spec. Here's a macro diagram which shows where the picture data (compressed YUV) resides.

We can explore others bitstreams like the VP9 bitstream, H265 (HEVC) or even our new best friend AV1 bitstream, they're all look similar.

Hands-on: Inspect the H264 bitstream

We can generate a single frame video and use mediainfo to inspect its H264 bitstream. In fact, you can even see the source code that parses h264 (AVC) bitstream.

We can also use the Intel Video Pro Analyzer which is paid but there is a free trial version which limits you to only the first 10 frames but that's okay for learning purposes.

How H265 can achieve better compression ratio than H264

[WIP]

Adaptive streaming

What? Why? How?

Creating multiple playlist thinking about mobile network

HLS and Dash

Building a bit rate ladder

We could create our bit rate options based on many

Encoding parameters the whys

[WIP]

Audio codec

[WIP]

How to use jupyter

Make sure you have docker installed and just run ./s/start_jupyter.sh and follow the instructions on the terminal.

References

The richest content is here, where all the info we saw in this text was extracted, based or inspired by. You can deepen your knowledge with these amazing links, books, videos and etc.

License

Rock1965/digital_video_introduction

Folders and files

Latest commit

History

Repository files navigation

Intro

Index

Basic video/image terminology

Other ways to encode a color image

Hands-on: play around with image and color

DVD is DAR 4:3

Hands-on: Check video properties

Redundancy removal

Colors, Luminance and our eyes

Hands-on: Check YUV histogram

Frame types

I Frame (intra, keyframe)

P Frame (predicted)

B Frame (bi-predictive)

Temporal redundancy (inter prediction)

Hands-on: See the motion vectors

Spatial redundancy (intra prediction)

Hands-on: Check intra predictions

How does a video codec work?

What? Why? How?

History

A generic codec

1st step - picture partitioning

Hands-on: Check partitions

2nd step - predictions

3rd step - transform

4th step - quantization

5th step - entropy coding

Delta coding:

VLC coding:

Arithmetic coding:

Hands-on: CABAC vs CAVLC

6th step - bitstream format

H264 bitstream

Hands-on: Inspect the H264 bitstream

How H265 can achieve better compression ratio than H264

Adaptive streaming

What? Why? How?

HLS and Dash

Building a bit rate ladder

Encoding parameters the whys

Audio codec

How to use jupyter

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages