Images and Videos

Image Encoding#

An image can be represented using a two-dimensional matrix, where each point in the matrix is called a pixel. The color of each pixel is represented using the three primary colors: red, green, and blue.

bafkreifxg6vo5wnigcns2fokhhyxu6kg6y5zb4wtu4tmdignqnkyw3jqqy

Each pixel can be represented with different bit depths, commonly used quantization depths include 16 bits, 24 bits, and 32 bits.

24-bit color mode: Each pixel is encoded with 24 bits (bits per pixel, bpp) RGB values: using three 8-bit unsigned integers (0 to 255) to represent the intensity of red, green, and blue.
16-bit color mode: Allocates 5 bits for each primary color, with green getting 6 bits because the human eye is more sensitive to green hues. In some cases, each primary color occupies 5 bits, with the remaining 1 bit unused.
32-bit color mode: Same as the 24-bit color mode, with the remaining 8 bits used to represent the pixel's transparency (Alpha).

Another important attribute of an image is its resolution, represented as width x height.

bafkreiah5ux6zbguyiokny6fgwcdeqkcswsuetjgtdxntwbpee6yeytg5u

When processing images or videos, another attribute is the aspect ratio, which describes the proportional relationship between the width and height of the image or pixel. Common ratios include 4:3, 16:9, and 21:9, typically referring to the display aspect ratio (DAR). Similarly, pixels can have different aspect ratios, referred to as pixel aspect ratio (PAR).

bafkreiewwvii2dpjlsifsxqn74wlyiuzmi6djeojorkczkftfgsfhmlbsy

YUV Color Model#

RGB appeals to the human eye's perception of color, while YUV focuses on the visual sensitivity to brightness. Y represents luminance (Luminance, Luma), and UV represents chrominance (Chrominance/Chroma) (thus black-and-white films can omit UV, being similar to RGB), represented by Cr and Cb, respectively. Therefore, YUV is typically recorded in the format Y.

bafkreia2dsmley7477h4uua4kvxbpklg2d57pqszxqyjdwehebsmoyab3u

To save bandwidth, most YUV formats average less than 24 bits per pixel. The main sampling formats include YCbCr 4:2:0, YCbCr 4:2:2, YCbCr 4:1:1, and YCbCr 4:4:4. The representation of YUV is called A:B representation:

4:4:4 indicates full sampling.
4:2:2 indicates 2:1 horizontal sampling, vertical fully sampled.
4:2:0 indicates 2:1 horizontal sampling, 2:1 vertical sampling.
4:1:1 indicates 4:1 horizontal sampling, vertical fully sampled.

Conversion Between RGB and YUV#

# Step one: Calculate luminance
Y = 0.299R + 0.587G + 0.114B
# Once luminance is obtained, color (chrominance blue and red) can be separated:
Cb = 0.564(B - Y)
Cr = 0.713(R - Y)
# Conversion can also be done using YCbCr to obtain RGB
R = Y + 1.402Cr
B = Y + 1.772Cb
G = Y - 0.344Cb - 0.714Cr

Concepts of Sampling Rate, Bit Rate, Frames, and Fields#

An image is obtained by sampling and quantizing an analog signal, while a video consists of a series of images. The higher the resolution and quantization bit depth of the images during capture, the more information can be expressed, resulting in clearer visuals. Video has a property called sampling frequency, which is the number of samples taken per unit time. The sampling frequency of video is also limited by the human eye, typically between 20 to 30 frames per second. When the sampling frequency is between 10 to 20 frames per second, the human eye can perceive a lack of smoothness in fast-moving images, while increasing the sampling frequency to 20 to 30 frames makes it appear smoother. If the sampling frequency is increased further, the human eye finds it difficult to perceive this difference, which is why the current film shooting uses a sampling frequency of 24 or 30 frames.

The number of bits required to display video per second is called the bit rate, also known as the data rate. The calculation formula is Bit Rate = Width x Height x Bit Depth x Frames per Second. For example, if we do not use any type of compression, a video at 30 frames per second, 24 bits per pixel, and a resolution of 480x240 will require 82,944,000 bits/second or 82.944 Mbps (30x480x240x24).

When the bit rate is nearly constant, it is called Constant Bit Rate (CBR), but it can also vary, in which case it is called Variable Bit Rate (VBR). The following diagram shows a constrained VBR, where not too many bits are used when the frame is black.

bafkreid3gdtk6unvikp5pei4sdzpm75eghi4ikjyrigb7skimxou7sp77a

In video sampling, a complete image obtained through progressive scanning is called a frame, with a typical frame rate of 25 frames (PAL) or 30 frames per second (NTSC). When interlaced scanning (odd and even lines) is used, a frame image is divided into two fields, with a typical field frequency of 50Hz (PAL) or 60Hz (NTSC). This technique was proposed by engineers in the early days to double the perceived frame rate of displays without consuming additional bandwidth. This technique is called interlaced video; it essentially sends half of the screen in one frame and the other half in the next frame.

bafkreifpux2p2edxpebxghyrps5cgfiqcpzvnag5cbxc6ghbysd5bw6hpe

Introduction to H264#

H264 is a standard format for the encoding layer of video, clearly aimed at compressing size.

Concepts#

SODB: String of data bits -> The raw encoded data
RBSP: Raw byte sequence payload -> Adds trailing bits (RBSP trailing bits, one bit "1") and several bits "0" to the end of SODB for byte alignment.
EBSP: Extended byte sequence payload -> Adds emulation prevention bytes (0X03) to RBSP. The reason is that when NALU is added to Annex B, it needs to add a start code (StartCodePrefix) before each group of NALU. If the NALU corresponds to a slice that starts a frame, it is represented with 4 bytes, 0x00000001; otherwise, it is represented with 3 bytes, 0x000001. To prevent conflicts with the start code in the NALU body during encoding, whenever two consecutive bytes are 0, a byte of 0x03 is inserted. During decoding, 0x03 is removed, also known as de-shelling.

The functionality of H.264 is divided into two layers: Video Coding Layer (VCL) and Network Abstraction Layer (NAL).

VCL data is the sequence of video data that has been compressed and encoded. It can only be transmitted or stored after being encapsulated into NAL units.

bafkreiaefcxhtams7qptc46er22wucl4iv6kxnkfffpem4z75kk7hhpyq4

The encoded video sequence of H.264 includes a series of NAL units, each containing an RBSP, as shown in Table 1. Encoded slices (including data partition slices and IDR slices) and sequence RBSP end markers are defined as VCL NAL units, while the rest are NAL units. A typical RBSP unit sequence is shown in Figure 2.

bafkreidubua2emr4jzasebschjflriqryrganel7embopfmydtgkngnouq

Each unit is transmitted as an independent NAL unit. The information header of the unit (one byte) defines the type of the RBSP unit, while the rest of the NAL unit is the RBSP data.

NAL Unit#

Each NAL unit is a variable-length byte string of a certain syntactic element, including a one-byte header (used to indicate the data type) and several integer bytes of payload data. A NAL unit can carry an encoded slice, A/B/C type data partition, or a sequence or picture parameter set.

The NALU header consists of one byte, and its syntax is as follows:

NAL units are transmitted in order according to the RTP sequence number. Among them, T indicates the payload data type, occupying 5 bits; R is the importance indication bit, occupying 2 bits; and the last F is the forbidden bit, occupying 1 bit. Specifically:

bafkreickxngl7hldegdehuumtirvpbfu2c7isj3pfh3nlxnuepv52mofye

The NALU type bit can represent 32 different types of NALU characteristics. Types 1-12 are defined by H.264, while types 24-31 are used for non-H.264 purposes. The RTP payload specification uses some of these values to define packet aggregation and fragmentation, while other values are reserved for H.264.
The importance indication bit is used to mark the importance of a NAL unit during reconstruction; the higher the value, the more important it is. A value of 0 indicates that this NAL unit is not used for prediction and can be discarded by the decoder without error propagation; a value greater than 0 indicates that this NAL unit is to be used for drift-free reconstruction, and the higher the value, the greater the impact of losing this NAL unit.
The forbidden bit defaults to 0 in encoding. When the network detects a bit error in this unit, it can be set to 1 so that the receiver discards this unit, mainly to adapt to different types of network environments (such as a combination of wired and wireless environments).

Common frame header data for H.264 includes:

00 00 00 01 67 (SPS)
00 00 00 01 68 (PPS)
00 00 00 01 65 (IDR frame)
00 00 00 01 61 (P frame)

The above 67, 68, 65, 61, and 41, etc., are all identification levels of the NALU.

F: Forbidden bit, 0 indicates normal, 1 indicates error, generally 0.

NRI: Importance level, 11 indicates very important.

TYPE: Indicates the type of the NALU.

From the table below, it can be seen that 7 corresponds to the sequence parameter set (SPS), 8 corresponds to the picture parameter set (PPS), and 5 represents an I frame. 1 represents a non-I frame.

{% asset_img h264-header.png h264-header %}

Thus, it can be seen that 61 and 41 are actually both P frames (type value is 1), just with different importance levels (one has NRI 11 BIN, the other has NRI 10 BIN).

H264 (NAL Introduction and I Frame Determination)#

bafkreia4uo4h67h5jks4gzfzd4v5pzjmynjmdgcoeecnnevsjwee74birq

We continue to analyze the data corresponding to the stream in the top image layer by layer. The byte following 00 00 00 01 is the NALU type. Converting it to binary data, the interpretation order is from left to right, as follows:
(1) The first bit is the forbidden bit, a value of 1 indicates a syntax error.

(2) The 2nd and 3rd bits are the reference level.

(3) Bits 4-8 are the NAL unit type.

For example, after 00000001, there are 67, 68, and 65.

The binary code for 0x67 is:

0110 0111

Bits 4-8 are 00111, converting to decimal gives 7, referring to the second image: 7 corresponds to the sequence parameter set SPS.

The binary code for 0x68 is:

0110 1000, bits 4-8 are 01000, converting to decimal gives 8, referring to the second image: 8 corresponds to the picture parameter set PPS.

The binary code for 0x65 is:

0110 0101

Bits 4-8 are 00101, converting to decimal gives 5, referring to the second image: 5 corresponds to the IDR slice (I frame).

Thus, the algorithm to determine whether it is an I frame is:

(NALU type & 0001 1111) = 5, i.e., (NALU type & 31) = 5, for example, 0x65 & 31 = 5.

Detailed Explanation of RTP Packaging and Sending H264#

RFC3984 is the specification for transmitting H.264 baseline streams in RTP format.

Structure of H264 Stream

bafkreiaes3rw3r4hxrwkq5hhnlmjojg3wy2duroaypfwzagybmv3vcgnxa

Single NALU#

The audio and video data following the 12-byte RTP header is relatively simple. A single NAL unit packet encapsulated into the RTP NAL unit stream must comply with the decoding order of the NAL unit. For NALUs with lengths smaller than the MTU size, a single NAL unit mode is generally used. A raw H.264 NALU unit typically consists of three parts: [Start Code] [NALU Header] [NALU Payload], where the Start Code indicates the beginning of a NALU unit and must be "00 00 00 01" or "00 00 01". The NALU header is only one byte, and everything that follows is the content of the NALU unit. During packaging, the start code "00 00 01" or "00 00 00 01" is removed, and the remaining data is packaged into the RTP packet.

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |F|NRI|  type   |                                               |
  +-+-+-+-+-+-+-+-+                                               |
  |                                                               |
  |               Bytes 2..n of a Single NAL unit                 |
  |                                                               |
  |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                               :...OPTIONAL RTP padding        |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

For example, if an H.264 NALU is as follows:
[00 00 00 01 67 42 A0 1E 23 56 0E 2F …]

This is a sequence parameter set NAL unit. [00 00 00 01] is the four-byte start code, 67 is the NALU header, and 42 starts the NALU content. Encapsulated into an RTP packet, it will look like:
[RTP Header] [67 42 A0 1E 23 56 0E 2F]

Thus, only the 4-byte start code needs to be removed.

Packet Aggregation#

When the length of the NALU is particularly small, several NALU units can be encapsulated in one RTP packet. To address the significant differences in MTU between wired and wireless networks, the RTP protocol defines a packet aggregation strategy:

STAP-A: Aggregated NALUs have the same timestamp, no DON (decoding order number);
STAP-B: Aggregated NALUs have the same timestamp, with DON;
MTAP16: Aggregated NALUs have different timestamps, with timestamp differences recorded in 16 bits;
MTAP24: Aggregated NALUs have different timestamps, with timestamp differences recorded in 24 bits;
During packet aggregation, the RTP timestamp is the minimum value of all NALU timestamps;

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F|NRI|  Type   |                                               |
+-+-+-+-+-+-+-+-+                                               |
|                                                               |
|             one or more aggregation units                     |
|                                                               |
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 3.  RTP payload format for aggregation packets

STAP-A example:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          RTP Header                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|STAP-A NAL HDR |         NALU 1 Size           | NALU 1 HDR    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         NALU 1 Data                           |
:                                                               :
+               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               | NALU 2 Size                   | NALU 2 HDR    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         NALU 2 Data                           |
:                                                               :
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 7.  An example of an RTP packet including an STAP-A
containing two single-time aggregation units

FU-A Fragmentation Format#

Larger H264 video packets are sent via RTP fragmentation. Following the 12-byte RTP header is the FU-A fragmentation: when the length of the NALU exceeds the MTU, the NALU unit must be fragmented for packaging. This is also known as Fragmentation Units (FUs).

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FU indicator  |   FU header  |                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
|                                                               |
|                         FU payload                            |
|                                                               |
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 14.  RTP payload format for FU-A

The FU indicator has the following format:

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|F|NRI|  Type   |
+---------------+

The type field of the FU indicator Type=28 indicates FU-A. The NRI field value must be set according to the NRI field value of the fragmented NAL unit.

The format of the FU header is as follows:

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|S|E|R|  Type   |
+---------------+

S: 1 bit, when set to 1 indicates this is the first fragment of the NALU. When the following FU payload is not the start of the fragmented NAL unit payload, the start bit is set to 0.

E: 1 bit, when set to 1 indicates this is the last fragment of the NALU, meaning the last byte of the payload is also the last byte of the fragmented NAL unit. When the following FU payload is not the last fragment of the fragmented NAL unit, the end bit is set to 0.

R: 1 bit, reserved bit must be set to 0, the receiver must ignore this bit.

Type: 5 bits
The definition of the NAL unit payload type is shown in the table below.

Summary of Unit Types and Payload Structures

.Type   Packet      Type name                       
  ---------------------------------------------------------
  0      undefined                                    -
  1-23   NAL unit    Single NAL unit packet per H.264  
  24     STAP-A     Single-time aggregation packet    
  25     STAP-B     Single-time aggregation packet    
  26     MTAP16    Multi-time aggregation packet     
  27     MTAP24    Multi-time aggregation packet     
  28     FU-A      Fragmentation unit                
  29     FU-B      Fragmentation unit                 
  30-31  undefined

Unpacking and Packing

Unpacking: When the encoder needs to fragment the original NAL according to FU-A during encoding, the relationship between the original NAL header and the fragmented FU-A header is as follows:

The first three bits of the original NAL header are the first three bits of the FU indicator, and the last five bits of the original NAL header are the last five bits of the FU header.

The remaining bits of the FU indicator and FU header are determined based on the actual situation.

Packing: When the receiving end receives the FU-A fragmented data, it needs to combine all the fragment packets back into the original NAL packet. The relationship between the FU-A header and the restored NAL is as follows:

The eight bits of the restored NAL header are composed of the first three bits of the FU indicator and the last five bits of the FU header, i.e.:

nal_unit_type = (fu_indicator & 0xe0) | (fu_header & 0x1f)

References#

Introduction to Video Technology