Concepts
|
|||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
Figure 1. For a particular source frame |
Figure 2. But when source complexity varies…. |
Figure 3 illustrates open loop (or VBR) operation of a video encoder.
The user supplies two key inputs – the uncompressed video
source and a value for QP. As the source sequence progresses, you
will get compressed video of fairly constant quality, but the bitrate
may vary dramatically. Because the complexity of pictures is continually
changing in a real video sequence, it is not so obvious what value
of QP to pick. If you fix QP for an "easy" part of the
sequence having slow motion and uniform areas, then the bit rate
will go up dramatically when you reach the "hard" (i.e.,
more complex) parts.
In reality, constraints imposed by decoder buffer size and network
bandwidth force us to encode video at a more nearly constant bitrate.
To do this, Figure 4 suggests that we must dynamically vary QP based
upon estimates of the source complexity, so that each picture (or
group of pictures) gets an appropriate allocation of bits to work
with. Rather than specifying QP as input, the user specifies demanded
bitrate instead.
![]() |
![]() |
Figure 3. Open Loop Encoding (VBR) |
Figure 4. Closed Loop Rate Control (CBR) |
Elements
of H.264 Rate Control
With a focus on the recommended approach [4,
5, 6] for H.264, Figure
5 identifies important elements within the rate controller. Most
of these elements are common to other rate control schemes. Note
that Figure 5 is conceptual and is not a literal representation
of any software implementation. Many details are glossed over –
for example, that B and P pictures are treated differently, and
that some estimates are averages of sampled data over multiple pictures.
![]() |
Figure 5. Elements of H.264 Rate Controller |
| ResidualBits
= C1 * MAD / QP + C2 * MAD / QP |
but
it may take a simpler form (with C2 = 0) or a more complicated form
involving exponentials or other basis curves for fitting. This equation
[note that our term ResidualBits is synonomous with the term Texture
Bits used by other authors [2]] corresponds
to equation 2-84 of [6] and to equation 1 of
[2]. The free coefficients C1 and C2 may be
estimated empirically, by providing hooks in the encoder for extracting
the residual coefficients, as well as the number of residual bits
needed to transmit them.
Having established the model in (2), we can solve for the demanded
QP when the target value of ResidualBits is supplied by the Bit
Allocation modules in Figure 5.
Complexity Estimation
As indicated above, we need a simple metric that reflects the encoding
complexity associated with the residuals. The MAD of the prediction
error is a convenient surrogate for this purpose:
![]() |
Figure 6. Comparison to MPEG2 Test Model 5 |
Similarities include the use of the virtual buffer model, the calculation
of layered bit targets for the GOP and picture, and the overall
goal of generating a quantization parameter (in this case, called
Mquant) for a basic unit. The Mquant for the basic unit (always
a single macroblock) is adjusted in proportion to its estimated
complexity.
Differences include:
| • | The Basic Unit is always the macroblock in this scheme. It is possible to get significant variations of quantization parameter across different macroblocks in the same picture |
| • | Differences between I, P and B picture types arise in the allocation of target bits. Otherwise, they are treated similarly. |
| • | MPEG-2 does not have the same multiplicity of prediction modes. In the absence of advanced intra prediction, it need not be so rigorous in relating quantization parameter (which controls residual quality) to measured properties of the residual itself. |
| • | Macroblock-level spatial complexity is estimated from the source activity, regardless of whether the complexity is handled by transmitting motion vectors (inter-prediction) or residual coefficients. |
| • | Allocation of bits to a picture considers the picture type, GOP structure and demanded bitrate, but not the picture's measured complexity. However, within the picture, the buffer fullness and relative spatial activity of each macroblock is used to allocate the picture bits among the macroblocks. |
It
is easy to recognize this Test Model 5 approach as an ancestor of
the H.264 approach, which accommodates the more general prediction
methods of H.264 and provides more flexibility to scale the granularity
of control.
H.264 Rate-Distortion Optimization and Global
Rate Control
H.264 provides 7 modes for inter (temporal) prediction, 9 modes
for intra (spatial) prediction of 4x4 blocks, 4 modes for intra
prediction of 16 x 16 macroblocks, and one skip mode. Each 16 x
16 macroblock can be broken down in numerous ways. Thus, mode selection
for each macroblock is a critical and time-consuming step that enables
much of the dramatic bitrate reduction.
Selection of the optimal mode is done by an algorithm called rate-distortion
optimization (RDO) [8], which essentially involves
1) an exhaustive pre-calculation of all feasible modes to determine
the bits and distortion of each; 2) evaluation of a metric that
considers both bitrate and distortion; and 3) selection of the mode
that minimizes the metric.
QP is input to the RDO process, which does not regulate QP or modify
the quality of the residual coefficients. RDO is complementary to
rate control; these two aspects of the problem are decoupled because
a fully coupled optimization would require a more expensive iterative
solution.
The interplay with RDO, described in [4] as
a "chicken and egg" dilemma, influences implementation
of a rate control algorithm. The MAD is needed by the rate control
algorithm, but it is available only after the RDO has used a QP
value to generate it. Thus, the rate control algorithm must use
an estimate for MAD based upon complexity of prior pictures in the
sequence.
ExpertH264 Implementation of Rate Control
PixelTools has implemented the H.264 rate control recommendations
in a recent release of ExpertH264. For this release, we have provided
picture level control without frame skip. Especially for offline
applications for encoding to stored media, this algorithm provides
excellent tracking of bitrates for GOPs of a wide variety of sizes.
Typical results track GOP bitrate within 1% without B pictures or
2-3% with B pictures, with good stabilization of QP to prevent noticeable
swings in quality. You can try this for yourself by requesting a
free demo of ExpertH264 from PixelTools
Corporation.
In subsequent releases, we plan to allow flexibility for smaller
basic units, which will allow closer bitrate tracking on the individual
picture level, as well as for smaller virtual buffer capacities.
We will also support both frame skip and stuffing bits in a subsequent
release – depending upon the end requirements, use of one
or both of these techniques will reduce variations in bitrate.
The algorithm is a separate module having several interfaces that
can be called by the encoder, and with callbacks to the encoder
for retrieving key information such as residual bits and residual
coefficients. Construction of the complexity metric (i.e., prediction
error MAD) is part of the rate control algorithm. C Interfaces and
utility functions include:
| • init_rateControl | • frameRateControl | • updateBFrameState |
| • initRateControlParams | • getQB | • getMbMAD |
| • gopRateControl | • updateModel | • initialQP |
Thus, developers of hardware and software encoders can consider
integrating this algorithm into their own environments. For example,
after the encoding step, a call to updateModel refreshes the empirical
coefficients such as C1 and C2 in equation (2). Similarly frameRateControl
is called prior to encoding each picture and supplies the quantization
parameter.
Terminology
The following glossary is intended to help with a common understanding
of rate control issues.
Prediction. Both H.264 and MPEG-*
may predict a macroblock by traditional inter (temporal) prediction,
i.e., a motion estimation from previous reference pictures followed
by transmission of the motion vector. Additionally, H.264 supports
advanced intra (spatial) prediction of a macroblock from encoded
values for neighboring pixels that have already been encoded (e.g.,
in raster-scan order).
Residual. The difference between the
source and prediction signals is called the residual,
or the prediction error. A spatial transform is then applied to
the residual to produce transformed coefficients that carry any
spatial detail that is not captured in the prediction itself or
its reference pictures.
Distortion. Distortion refers to the
difference between the original source image x, and the reconstructed
image y after it has been decoded. In H.264, sum of squared difference
is used to quantify distortion as (1/N)
i
|yi – xi |2, for any set of N pixels.
Complexity. As the saying goes, I can't
define complexity, but I know it when I see it! A single source
picture is complex if it is "busy" and has lots of spatial
detail. The term spatial activity is synonymous
with source complexity for this case. However, for a video sequence,
the meaning of complexity is, well, more complex! For example, if
a video sequence consists of one busy object that translates slowly
across the field of view, it may not require very many bits because
the temporal prediction can easily capture the motion using a single
reference picture and a series of motion vectors. It is difficult
to define an inclusive video complexity metric that is also easy
to calculate. See MAD
MAD: Mean Absolute Difference of Prediction
Error. For rate control, what is more important is the encoding
complexity of the residuals that are left over after the inter or
intra prediction process is finished. The Mean Absolute Difference
of Prediction Error is usually closely related to encoding complexity.
Suppose xi is the source value for ith pixel, then:
Spatial
Activity. This term is used to quantify the amount of spatial
variation within a part of the picture, normally a block of N pixels.
Suppose the N pixel values xi, i = 1,..,N. Then the activity for
those N pixels is: (1/N)
i
(xi – <x> )2, where <x> = (1/N)
i
xi. In other words the spatial activity is the sample variance of
a block's values. It is the measure for local complexity used in
MPEG-2.
Bitrate. Bitrate refers to the bits per
second consumed by a sequence of pictures, i.e., bitrate = (average
bits per picture) / (frames per second). In practice, it is equated
to the reliable network bandwidth that is provisioned or available
for the stream.
Quantization Parameter (QP). Residuals
are transformed into the spatial frequency domain by an integer
transform that approximates the familiar Discrete Cosine Transform
(DCT). The Quantization Parameter determines the step size for associating
the transformed coefficients with a finite set of steps. Large values
of QP represent big steps that crudely approximate the spatial transform,
so that most of the signal can be captured by only a few coefficients.
Small values of QP more accurately approximate the block's spatial
frequency spectrum, but at the cost of more bits. In H.264, each
unit increase of QP lengthens the step size by 12% and reduces the
bitrate by roughly 12%.
Group of Pictures (GOP). The
Group of Picture concept is inherited from MPEG and refers to an
I-picture, followed by all the P and B pictures until the next I
picture. A typical MPEG GOP structures might be IBBPBBPBBI. Although
H.264 does not strictly require more than one I picture per video
sequence, the recommended rate control approach does require a repeating
GOP structure to be effective. Thus, H.264 rate control will not
work properly if the IntraPeriod parameter is set to 0.
Basic unit. The authors of references
[4] and [5] introduced this
useful term that expresses the granularity on which QP is adjusted
in the feedback control loop. If the basic unit is a picture, then
the rate controller's adjustments to QP are uniform across the picture.
In MPEG-2, the basic unit is a macroblock. Initially, most H.264
applications will probably use the picture as basic unit, but ultimately
a full or partial row of macroblocks is expected to yield the best
compromise between uniform bitrate and uniform quality.
Summary
This white paper presents the basics of rate control for H.264 and
compares them to the Test Model 5 approach of MPEG-2. Implementers
needing a detailed description of the algorithm should see [5]
or [6]. The structure shown in our Figure 5, the discussion of its
modules, and the terminology glossary should provide a useful companion
to help in understanding the densely packed equations found in these
references.
References
1. C. Poynton, Digital Video
and HDTV, Elsevier Science 2003, pp. 491-2
2. A. Vetro, "MPEG-4 Rate Control for Multiple
Video Objects," IEEE Transactions on Circuits and Systems for
Video Technology," Vol. 9, No. 1, February 1999
3. G. Sullivan, T. Wiegand and K.P. Lim, "Joint
Model Reference Encoding Methods and Decoding Concealment Methods;
Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
4. Z. Li et al., "Adaptive Basic Unit Layer
Rate Control for JVT," JVT-G012, 7th Meeting: Pattaya, Thailand,
March 2003
5. Z. Li et al., "Proposed Draft of Adaptive
Rate Control," JVT-H017, 8th Meeting: Geneva, May 2003
6. G. Sullivan, T. Wiegand and K.P. Lim, "Joint
Model Reference Encoding Methods and Decoding Concealment Methods;
Section 2.6: Rate Control" JVT-I049, San Diego, September 2003
7. MPEG 2 Test Model 5, Rev. 2, Section 10: Rate
Control and Quantization Optimization, ISO/IEC/JTC1SC29WG11, April
1993
8. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini
and G. Sullivan, "Rate-Constrained Coder Control and Comparison
of Video Coding Standards," IEEE Transactions on Circuits &
Systems for Video Technology, 13, #7, July 2003