fwrite

fwrite

好好生活
twitter
github
email

Images and Videos

Image Encoding#

An image can be represented using a two-dimensional matrix, where each point in the matrix is called a pixel. The color of each pixel is represented using the three primary colors: red, green, and blue.

bafkreifxg6vo5wnigcns2fokhhyxu6kg6y5zb4wtu4tmdignqnkyw3jqqy

Each pixel can be represented using different bit depths, with common quantization bit depths being 16 bits, 24 bits, and 32 bits.

  • 24-bit color mode: Each pixel is encoded with 24 bits (bits per pixel, bpp) RGB values: using three 8-bit unsigned integers (0 to 255) to represent the intensity of red, green, and blue.
  • 16-bit color mode: Each primary color is allocated 5 bits, with green getting 6 bits because the human eye is more sensitive to green hues. However, in some cases, each primary color may occupy 5 bits, with the remaining 1 bit unused.
  • 32-bit color mode: Similar to the 24-bit color mode, with the remaining 8 bits used to represent the pixel's transparency (Alpha).

Another important attribute of an image is its resolution, represented as width x height.

bafkreiah5ux6zbguyiokny6fgwcdeqkcswsuetjgtdxntwbpee6yeytg5u

When processing images or videos, another attribute is the aspect ratio, which describes the proportional relationship between the width and height of the image or pixel. Common ratios include 4:3, 16:9, and 21:9, typically referring to the display aspect ratio (DAR). Similarly, pixels can also have different aspect ratios, referred to as pixel aspect ratio (PAR).

bafkreiewwvii2dpjlsifsxqn74wlyiuzmi6djeojorkczkftfgsfhmlbsy

YUV Color Model#

RGB appeals to the human eye's perception of color, while YUV focuses on the sensitivity of vision to brightness. Y represents luminance (Luminance, Luma), and UV represents chrominance (Chrominance/Chroma) (thus black-and-white films can omit UV, being similar to RGB), represented by Cr and Cb, respectively. Therefore, YUV is typically recorded in the format Y.

bafkreia2dsmley7477h4uua4kvxbpklg2d57pqszxqyjdwehebsmoyab3u

To save bandwidth, most YUV formats average less than 24 bits per pixel. The main sampling formats include YCbCr 4:2:0, YCbCr 4:2:2, YCbCr 4:1:1, and YCbCr 4:4:4. The representation of YUV is referred to as A:B representation:

  • 4:4:4 indicates full sampling.
  • 4:2:2 indicates 2:1 horizontal sampling, with vertical fully sampled.
  • 4:2:0 indicates 2:1 horizontal sampling, with 2:1 vertical sampling.
  • 4:1:1 indicates 4:1 horizontal sampling, with vertical fully sampled.

Conversion between RGB and YUV#

# First, calculate luminance
Y = 0.299R + 0.587G + 0.114B
# Once luminance is obtained, color (chrominance blue and red) can be separated:
Cb = 0.564(B - Y)
Cr = 0.713(R - Y)
# Conversion can also be done using YCbCr to obtain RGB
R = Y + 1.402Cr
B = Y + 1.772Cb
G = Y - 0.344Cb - 0.714Cr

Concepts of Sampling Rate, Bit Rate, Frames, and Fields#

An image is obtained by sampling and quantizing an analog signal, while a video consists of a series of images. The higher the resolution and quantization bit depth of the images during capture, the more information can be expressed, resulting in clearer visuals. Video has a property called sampling frequency, which refers to the number of samples taken in a unit of time. The sampling frequency of video is also limited by the human eye, typically between 20 to 30 frames per second. When the sampling frequency is between 10 to 20 frames per second, the human eye perceives the motion as choppy, while increasing the sampling frequency to 20 to 30 frames makes it appear smoother. If the sampling frequency is increased further, the human eye finds it difficult to perceive the difference, which is why films are typically shot at 24 or 30 frames per second.

The number of bits required to display video per second is called the bit rate, also known as the data rate. The calculation formula is bit rate = width height bit depth frames per second. For example, if no type of compression is used, a video at 30 frames per second, with 24 bits per pixel, and a resolution of 480x240 will require 82,944,000 bits/second or 82.944 Mbps (30x480x240x24).

When the bit rate is nearly constant, it is referred to as constant bit rate (CBR), but it can also vary, in which case it is called variable bit rate (VBR). The following diagram shows a constrained VBR, where not much bit is spent when the frame is black.

bafkreid3gdtk6unvikp5pei4sdzpm75eghi4ikjyrigb7skimxou7sp77a

In video sampling, a complete image obtained through progressive scanning is called a frame, with a typical frame rate of 25 frames (PAL) or 30 frames per second (NTSC). When interlaced scanning (odd and even lines) is used, a single frame image is divided into two fields, with typical field frequencies of 50Hz (PAL) or 60Hz (NTSC). This technique was proposed by engineers in the early days to double the perceived frame rate of displays without consuming additional bandwidth. This technique is called interlaced video; it essentially sends half of the screen in one frame and the other half in the next frame.

bafkreifpux2p2edxpebxghyrps5cgfiqcpzvnag5cbxc6ghbysd5bw6hpe

Introduction to H264#

H264 is a standard format for the encoding layer of video, clearly aimed at compressing size.

Concepts#

  • SODB: Data bitstream -> The raw encoded data
  • RBSP: Raw byte sequence payload -> Adds trailing bits (RBSP trailing bits, one bit "1") and several bits "0" to align bytes after SODB.
  • EBSP: Extended byte sequence payload -> Adds emulation prevention bytes (0X03) to RBSP. The reason is: when NALU is added to Annex B, a start code (StartCodePrefix) must be added before each group of NALU. If the NALU corresponds to a slice that starts a frame, it is represented with 4 bytes, 0x00000001; otherwise, it is represented with 3 bytes, 0x000001. To prevent conflicts with the start code in the NALU body during encoding, whenever two consecutive bytes are 0, a byte of 0x03 is inserted. During decoding, 0x03 is removed, also known as de-encapsulation.

H.264's functionality is divided into two layers: Video Coding Layer (VCL) and Network Abstraction Layer (NAL).

VCL data is the sequence of video data that has been compressed and encoded. This VCL data must be encapsulated into NAL units before it can be transmitted or stored.

bafkreiaefcxhtams7qptc46er22wucl4iv6kxnkfffpem4z75kk7hhpyq4

The encoded video sequence in H.264 consists of a series of NAL units, each containing an RBSP, as shown in Table 1. Encoded slices (including data partition slices and IDR slices) and sequence RBSP end markers are defined as VCL NAL units, while the rest are NAL units. A typical RBSP unit sequence is shown in Figure 2.

bafkreidubua2emr4jzasebschjflriqryrganel7embopfmydtgkngnouq

Each unit is transmitted as an independent NAL unit. The unit's header (one byte) defines the type of RBSP unit, while the remainder of the NAL unit is the RBSP data.

NAL Units#

Each NAL unit is a variable-length byte string of a certain syntax element, including a one-byte header (used to indicate the data type) and several integer bytes of payload data. A NAL unit can carry an encoded slice, A/B/C type data partition, or a sequence or picture parameter set.

The NALU header consists of one byte, with the following syntax:

NAL units are transmitted in order according to the RTP sequence number. Among them, T indicates the payload data type, occupying 5 bits; R indicates the importance indicator bit, occupying 2 bits; and the last F is the forbidden bit, occupying 1 bit. Specifically:

bafkreickxngl7hldegdehuumtirvpbfu2c7isj3pfh3nlxnuepv52mofye

  1. The NALU type bit can represent 32 different types of NALU characteristics. Types 1-12 are defined by H.264, while types 24-31 are used for purposes outside H.264. The RTP payload specification uses some of these values to define packet aggregation and fragmentation, while other values are reserved for H.264.
  2. The importance indicator bit is used to mark the importance of a NAL unit during reconstruction; the higher the value, the more important it is. A value of 0 indicates that this NAL unit is not used for prediction and can be discarded by the decoder without error propagation; a value greater than 0 indicates that this NAL unit is to be used for drift-free reconstruction, and the higher the value, the greater the impact of losing this NAL unit.
  3. The forbidden bit has a default value of 0 in encoding; when the network detects a bit error in this unit, it can be set to 1 so that the receiver discards this unit, mainly used to adapt to different types of network environments (e.g., a combination of wired and wireless environments).

Common frame header data for H.264 includes:

  • 00 00 00 01 67 (SPS)
  • 00 00 00 01 68 (PPS)
  • 00 00 00 01 65 (IDR frame)
  • 00 00 00 01 61 (P frame)

The above 67, 68, 65, 61, and 41, etc., are all identification levels for the NALU.

F: The forbidden bit, 0 indicates normal, 1 indicates error, generally 0.

NRI: Importance level, 11 indicates very important.

TYPE: Indicates what type this NALU is.

From the table below, it can be seen that 7 is the sequence parameter set (SPS), 8 is the picture parameter set (PPS), 5 represents I frames, and 1 represents non-I frames.

{% asset_img h264-header.png h264-header %}

Thus, it can be seen that 61 and 41 are both P frames (type value is 1), just with different importance levels (their NRI is one 11 BIN, the other 10 BIN).

H264 (NAL Introduction and I Frame Judgment)#

bafkreia4uo4h67h5jks4gzfzd4v5pzjmynjmdgcoeecnnevsjwee74birq

Let's continue to analyze the data corresponding to the stream in the top image layer by layer. The next byte after splitting by 00 00 00 01 is the NALU type. Converting it to binary data, the interpretation order is from left to right, as follows:
(1) The 1st bit is the forbidden bit, with a value of 1 indicating a syntax error.

(2) The 2nd and 3rd bits are the reference level.

(3) The 4th to 8th bits are the NAL unit type.

For example, after 00000001, there are 67, 68, and 65.

The binary code for 0x67 is:

0110 0111

Bits 4-8 are 00111, converting to decimal gives 7, referring to the second image: 7 corresponds to the sequence parameter set SPS.

The binary code for 0x68 is:

0110 1000, with bits 4-8 being 01000, converting to decimal gives 8, referring to the second image: 8 corresponds to the picture parameter set PPS.

The binary code for 0x65 is:

011 00101

Bits 4-8 are 00101, converting to decimal gives 5, referring to the second image: 5 corresponds to the IDR slice (I frame).

Therefore, the algorithm to determine if it is an I frame is:

(NALU type & 0001 1111) = 5, i.e., (NALU type & 31) = 5, for example, 0x65 & 31 = 5.

RTP Packaging and Sending H264 Detailed Explanation#

RFC3984 is the specification for transmitting H.264 baseline streams in RTP.

Structure of H264 Stream

bafkreiaes3rw3r4hxrwkq5hhnlmjojg3wy2duroaypfwzagybmv3vcgnxa

Single NALU#

The audio and video data following the 12-byte RTP header is relatively simple. A single NAL unit packet encapsulated into the RTP NAL unit stream must conform to the decoding order of the NAL unit. For NALUs with lengths smaller than the MTU size, a single NAL unit mode is generally used. A raw H.264 NAL unit typically consists of three parts: [Start Code] [NALU Header] [NALU Payload], where the Start Code indicates the beginning of a NAL unit and must be "00 00 00 01" or "00 00 01". The NALU header is only one byte, and the rest is the content of the NAL unit.

When packaging, the start code "00 00 01" or "00 00 00 01" is removed, and the remaining data is packaged into the RTP packet.

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |F|NRI|  type   |                                               |
  +-+-+-+-+-+-+-+-+                                               |
  |                                                               |
  |               Bytes 2..n of a Single NAL unit                 |
  |                                                               |
  |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                               :...OPTIONAL RTP padding        |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

For example, if an H.264 NALU is like this:
[00 00 00 01 67 42 A0 1E 23 56 0E 2F …]

This is a sequence parameter set NAL unit. [00 00 00 01] is the four-byte start code, 67 is the NALU header, and 42 starts the NALU content.

Encapsulated into an RTP packet, it will look like:
[RTP Header] [67 42 A0 1E 23 56 0E 2F]

This means just removing the four-byte start code is sufficient.

Packet Aggregation#

When the length of the NALU is particularly small, several NAL units can be encapsulated in one RTP packet. To reflect/cope with the huge differences in MTU between wired and wireless networks, the RTP protocol defines a packet aggregation strategy:

  • STAP-A: Aggregated NALUs have the same timestamp, without DON (decoding order number);
  • STAP-B: Aggregated NALUs have the same timestamp, with DON;
  • MTAP16: Aggregated NALUs have different timestamps, with the timestamp difference recorded in 16 bits;
  • MTAP24: Aggregated NALUs have different timestamps, with the timestamp difference recorded in 24 bits;
  • During packet aggregation, the RTP timestamp is the minimum value of all NALU timestamps;
0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F|NRI|  Type   |                                               |
+-+-+-+-+-+-+-+-+                                               |
|                                                               |
|             one or more aggregation units                     |
|                                                               |
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 3.  RTP payload format for aggregation packets

STAP-A Example:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          RTP Header                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|STAP-A NAL HDR |         NALU 1 Size           | NALU 1 HDR    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         NALU 1 Data                           |
:                                                               :
+               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               | NALU 2 Size                   | NALU 2 HDR    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         NALU 2 Data                           |
:                                                               :
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 7.  An example of an RTP packet including an STAP-A
containing two single-time aggregation units

FU-A Fragmentation Format#

For larger H264 video packets, RTP fragmentation is used. Following the 12-byte RTP header is the FU-A fragmentation: when the length of the NALU exceeds the MTU, the NALU unit must be fragmented for packaging. This is also referred to as Fragmentation Units (FUs).

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FU indicator  |   FU header  |                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
|                                                               |
|                         FU payload                            |
|                                                               |
|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 14.  RTP payload format for FU-A

The FU indicator has the following format:

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|F|NRI|  Type   |
+---------------+

The type field of the FU indicator Type=28 indicates FU-A. The NRI field value must be set according to the NRI field value of the fragmented NAL unit.

The FU header format is as follows:

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|S|E|R|  Type   |
+---------------+

S: 1 bit, when set to 1 indicates this is the first fragment of the NALU. When the following FU payload is not the start of the fragmented NAL unit payload, the start bit is set to 0.

E: 1 bit, when set to 1 indicates this is the last fragment of the NALU, meaning the last byte of the payload is also the last byte of the fragmented NAL unit. When the following FU payload is not the last fragment of the fragmented NAL unit, the end bit is set to 0.

R: 1 bit, reserved bit must be set to 0, the receiver must ignore this bit.

Type: 5 bits
The definition of NAL unit payload types is shown in the table below.

Summary of Unit Types and Payload Structures

.Type   Packet      Type name                       
  ---------------------------------------------------------
  0      undefined                                    -
  1-23   NAL unit    Single NAL unit packet per H.264  
  24     STAP-A     Single-time aggregation packet    
  25     STAP-B     Single-time aggregation packet    
  26     MTAP16    Multi-time aggregation packet     
  27     MTAP24    Multi-time aggregation packet     
  28     FU-A      Fragmentation unit                
  29     FU-B      Fragmentation unit                 
  30-31  undefined                            

Unpacking and Packing

Unpacking: When the encoder needs to fragment the original NAL according to FU-A during encoding, the relationship between the original NAL header and the fragmented FU-A header is as follows:

The first three bits of the original NAL header are the first three bits of the FU indicator, and the last five bits of the original NAL header are the last five bits of the FU header.

The remaining bits of the FU indicator and FU header depend on the actual situation.

Packing: When the receiver receives FU-A fragmented data, it needs to combine all the fragment packets to restore the original NAL packet. The relationship between the FU-A header and the restored NAL is as follows:

The eight bits of the restored NAL header are composed of the first three bits of the FU indicator and the last five bits of the FU header, i.e.:

nal_unit_type = (fu_indicator & 0xe0) | (fu_header & 0x1f)

References#

Introduction to Video Technology

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.