Concept#
- SODB: Data bit stream -> The most primitive encoded data
- RBSP: Raw byte sequence payload -> A number of bits "0" are added after the SODB to fill in the trailing bit (RBSP trailing bits, one bit "1") for byte alignment.
- EBSP: Extended byte sequence payload -> Adds emulation prevention bytes (0X03) on top of the RBSP. The reason is: when the NALU is added to Annex B, a start code StartCodePrefix must be added before each NALU group. If the slice corresponding to the NALU is the start of a frame, it is represented with a 4-byte code, 0x00000001; otherwise, it is represented with a 3-byte code, 0x000001. To ensure that the NALU body does not include conflicts with the start code, during encoding, whenever two consecutive bytes are 0, a byte of 0x03 is inserted. During decoding, 0x03 is removed. This is also known as the emulation prevention operation.
The functionality of H.264 is divided into two layers: Video Coding Layer (VCL) and Network Abstraction Layer (NAL)
VCL data is the compressed encoded video data sequence. The VCL data must be encapsulated into NAL units before it can be transmitted or stored.
The encoded video sequence of H.264 includes a series of NAL units, each containing an RBSP, as shown in Table 1. Encoded slices (including data partition slices and IDR slices) and the sequence RBSP end marker are defined as VCL NAL units, while the rest are NAL units. A typical RBSP unit sequence is shown in Figure 2.
Each unit is transmitted as an independent NAL unit. The information header of the unit (one byte) defines the type of the RBSP unit, while the rest of the NAL unit is RBSP data.
NAL Unit#
Each NAL unit is a variable-length byte string of a certain syntax element, including a one-byte header (used to indicate the data type) and several integer byte payload data. A NAL unit can carry an encoded slice, A/B/C type data partition, or a sequence or picture parameter set.
The NALU header consists of one byte, and its syntax is as follows:
NAL units are transmitted in order according to the RTP sequence number. Among them, T is the payload data type, occupying 5 bits; R is the importance indication bit, occupying 2 bits; and the final F is the forbidden bit, occupying 1 bit. Specifically:
- The NALU type bit can represent 32 different types of NALU characteristics. Types 1 to 12 are defined for H.264, while types 24 to 31 are used for non-H.264. The RTP payload specification uses some of these values to define packet aggregation and fragmentation, while other values are reserved for H.264.
- The importance indication bit is used to mark the importance of a NAL unit during reconstruction; the higher the value, the more important it is. A value of 0 indicates that this NAL unit is not used for prediction and can be discarded by the decoder without error propagation; a value greater than 0 indicates that this NAL unit is to be used for drift-free reconstruction, and the higher the value, the greater the impact of losing this NAL unit.
- The forbidden bit has a default value of 0 in encoding. When the network detects a bit error in this unit, it can be set to 1 so that the receiver discards this unit, mainly to adapt to different types of network environments (such as a combination of wired and wireless environments).
Common frame header data for H.264:
00 00 00 01 67 (SPS)
00 00 00 01 68 (PPS)
00 00 00 01 65 (IDR frame)
00 00 00 01 61 (P frame)
The above 67, 68, 65, 61, and 41, etc., are all identification levels of the NALU.
F: Forbidden bit, 0 indicates normal, 1 indicates error, generally 0
NRI: Importance level, 11 indicates very important.
TYPE: Indicates what type the NALU is.
See the table below, from which it can be seen that 7 corresponds to the sequence parameter set (SPS), 8 corresponds to the picture parameter set (PPS), and 5 represents I frames. 1 represents non-I frames.
Thus, it can be seen that both 61 and 41 are actually P frames (type value is 1), but with different importance levels (their NRI values are 11BIN and 10BIN respectively).
H264 (NAL Introduction and I Frame Judgment)#
We continue to analyze the data corresponding to the bitstream in the topmost figure layer by layer. The next byte after the split by 00 00 00 01 is the NALU type. After converting it to binary data, the interpretation order is from left to right, as follows:
(1) The first bit is the forbidden bit; a value of 1 indicates a syntax error.
(2) The 2nd and 3rd bits are the reference level.
(3) The 4th to 8th bits represent the NAL unit type.
For example, after 00000001, there are 67, 68, and 65.
The binary code for 0x67 is:
0110 0111
Bits 4-8 are 00111, which converts to decimal 7. Refer to the second figure: 7 corresponds to the sequence parameter set SPS.
The binary code for 0x68 is:
0110 1000
Bits 4-8 are 01000, which converts to decimal 8. Refer to the second figure: 8 corresponds to the picture parameter set PPS.
The binary code for 0x65 is:
011 00101
Bits 4-8 are 00101, which converts to decimal 5. Refer to the second figure: 5 corresponds to the slice in an IDR picture (I frame).
Therefore, the algorithm to determine whether it is an I frame is:
(NALU type & 0001 1111) = 5, i.e., (NALU type & 31) = 5, for example, 0x65 & 31 = 5.
RTP Packaging and Sending H264 Detailed Explanation#
RFC3984 is the specification for transmitting H.264 baseline bitstreams in RTP format.
H264 Bitstream Structure
Single NALU#
The audio and video data following the 12-byte RTP header is relatively simple. A single NAL unit encapsulated into the RTP NAL unit stream must comply with the decoding order of the NAL unit. For NALUs with a length smaller than the MTU size, a single NAL unit mode is generally used. A raw H.264 NAL unit typically consists of three parts: [Start Code] [NALU Header] [NALU Payload]
, where the Start Code indicates the beginning of a NAL unit and must be "00 00 00 01" or "00 00 01". The NALU header is only one byte, and the rest is the content of the NAL unit.
When packaging, remove the start code "00 00 01" or "00 00 00 01" and encapsulate the other data into the RTP packet.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F|NRI| type | |
+-+-+-+-+-+-+-+-+ |
| |
| Bytes 2..n of a Single NAL unit |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
For example, if an H.264 NALU is as follows:
[00 00 00 01 67 42 A0 1E 23 56 0E 2F …]
This is a sequence parameter set NAL unit. [00 00 00 01] is the four-byte start code, 67 is the NALU header, and 42 begins the NALU content.
Encapsulating it into an RTP packet would look like this:
[RTP Header] [67 42 A0 1E 23 56 0E 2F]
That is, just remove the 4-byte start code.
Packet Aggregation#
When the length of the NALU is particularly small, several NAL units can be encapsulated in one RTP packet.
To reflect/address the significant differences in MTU between wired and wireless networks, the RTP protocol defines a packet aggregation strategy:
- STAP-A: Aggregated NALUs have the same timestamp, no DON (decoding order number);
- STAP-B: Aggregated NALUs have the same timestamp, with DON;
- MTAP16: Aggregated NALUs have different timestamps, with the timestamp difference recorded in 16 bits;
- MTAP24: Aggregated NALUs have different timestamps, with the timestamp difference recorded in 24 bits;
- During packet aggregation, the RTP timestamp is the minimum value of all NALU timestamps;
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F|NRI| Type | |
+-+-+-+-+-+-+-+-+ |
| |
| one or more aggregation units |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3. RTP payload format for aggregation packets
STAP-A Example:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|STAP-A NAL HDR | NALU 1 Size | NALU 1 HDR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NALU 1 Data |
: :
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | NALU 2 Size | NALU 2 HDR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NALU 2 Data |
: :
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7. An example of an RTP packet including an STAP-A
containing two single-time aggregation units
FU-A Fragmentation Format#
For larger H264 video packets, they are sent via RTP fragmentation. Following the 12-byte RTP header is the FU-A fragmentation: when the length of the NALU exceeds the MTU, the NALU unit must be fragmented for packaging. This is also known as Fragmentation Units (FUs).
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FU indicator | FU header | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |
| FU payload |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 14. RTP payload format for FU-A
The FU indicator has the following format:
+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|F|NRI| Type |
+---------------+
The type field of the FU indicator Type=28 indicates FU-A. The NRI field value must be set according to the NRI field value of the fragmented NAL unit.
The format of the FU header is as follows:
+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|S|E|R| Type |
+---------------+
S: 1 bit, when set to 1 indicates this is the first fragment of the NALU. When the following FU payload is not the start of the fragmented NAL unit payload, the start bit is set to 0.
E: 1 bit, when set to 1 indicates this is the last fragment of the NALU, i.e., the last byte of the payload is also the last byte of the fragmented NAL unit. When the following FU payload is not the last fragment of the fragmented NAL unit, the end bit is set to 0.
R: 1 bit, reserved bit must be set to 0, and the receiver must ignore this bit.
Type: 5 bits, the definition of NAL unit payload types is shown in the table below.
Summary of Unit Types and Payload Structures
.Type Packet Type name
---------------------------------------------------------
0 undefined -
1-23 NAL unit Single NAL unit packet per H.264
24 STAP-A Single-time aggregation packet
25 STAP-B Single-time aggregation packet
26 MTAP16 Multi-time aggregation packet
27 MTAP24 Multi-time aggregation packet
28 FU-A Fragmentation unit
29 FU-B Fragmentation unit
30-31 undefined
Unpacking and Packing
Packing: When the encoder needs to fragment the original NAL according to FU-A during encoding, the relationship between the original NAL header and the fragmented FU-A header is as follows:
The first three bits of the original NAL header are the first three bits of the FU indicator, and the last five bits of the original NAL header are the last five bits of the FU header.
The remaining bits of the FU indicator and FU header depend on the actual situation.
Unpacking: When the receiver receives the FU-A fragmented data, it needs to combine all the fragmented packets to restore the original NAL packet. The relationship between the FU-A header and the restored NAL is as follows:
The eight bits of the restored NAL header are composed of the first three bits of the FU indicator and the last five bits of the FU header, i.e.:
nal_unit_type = (fu_indicator & 0xe0) | (fu_header & 0x1f)