2
may be necessary to restrict computational processing in order to maximize
battery life. Current hardware video applications typically control processor
utilization by dropping frames during encoding, leading to intermittent and
“jerky” motion in the decoded video sequence. Computational complexity of
video CODECs becomes a major constraint on the coding performance and it
is therefore important to develop methods of managing the computational
complexity of video CODECs.
In the literature, the existing contributions of reducing the computational
complexity of hardware video CODEC focus on speeding up specified
operations during encoding or decoding, such as fast algorithms for
computationally expensive functions (especially motion estimation, Integer
Transform (IT) or Inverse Integer Transform (IIT)). The computation
reduction by this type of algorithm can vary significantly depending on the
scene content of the video sequence. The amount of complexity reduction
and its influence on video quality and bitrate is difficult to predict.
The aim of this work is to design & implement H.264/AVC integer
transform and quantization module such that to flexibly manage the
computational complexity of the H.264/AVC video encoder and control the
tradeoffs between the complexity and rate-distortion performance. Unlike the
popular 8 x 8 discrete cosine transform used in previous standards, the 4x4
transforms in H.264 can be computed exactly in integer arithmetic, thus
avoiding inverse transform mismatch problems. The new transforms can also
be computed without multiplications, just additions and shifts, in 16-bit
arithmetic, thus minimizing computational complexity, especially for low-end
processors. By using short tables, the new quantization formulas use
multiplications but avoid divisions. In contrast to conventional solutions, the
approaches proposed in this thesis are designed to provide comprehensive
and predictable control of the computational complexity of video encoder
rather than simply decreasing it. Given a target complexity, the video quality
and rate distortion performance will be optimized within this computation
limit. These approaches aim to provide nearly the same subjective video
quality as the encoders without complexity reduction. This is based on a fact
that a small loss in video quality caused by reduction of computational
complexity may not be perceived by audiences due to the idiosyncrasy of the
human visual system.
By developing complexity management techniques to decrease and control
the computational complexity, the coding performance of hardware-only
video encoders will be improved, resulting in good perceived video quality
3
in real-time multimedia systems and prolonging the battery life in mobile
video devices while maintaining the same video quality.
1.2 Objectives
In order to fulfill the aim of this work, the first step that needs to be
performed is examining and analyzing the computational cost of each
operation involved in the encoding process. Based on these investigations,
techniques are proposed to reduce and manage the most computationally
complex functions, including IT and related functions as well as the
Quantization function.
The first innovatory approach to the complexity problem is developed for the
IT function. The objective is to achieve near-constant complexity reduction
of the IT function throughout the entire video sequence at the expense of a
controlled loss in video quality. This approach should also be applied to
other related functions of IT, including IIT, quantization and inverse
quantization. Under the motivation of providing a comprehensive and more
significant computation reduction in the encoder, the objective is to
investigate approaches that can decrease and control the complexity of the
entire coding process. This method should be able to achieve the best rate-
distortion performance within the complexity limit.
The objectives of this work are:
To develop integer transform & Quantization Algorithm
H.264/AVC
Matlab simulation of the proposed algorithm.
Proposed architecture for the integer transform.
Hardware (VLSI) implementation of the proposed architecture
Prototype digital design using FPGA prototype board.
Optimization: Modify the architecture for efficient
implementation on XLINX ® FPGA(s).
Comparing the results.
1.3 Organization of the Thesis
The organization of this thesis is as follows:
Chapter 2 presents a comprehensive description of the block-based video
coding technologies and most popular video coding standards. It starts with a
4
brief introduction to the fundamental terms used in block-based video
coding. The video coding standard (H.264/AVC), on which the video
CODEC used in the experimental work in this thesis are based, are
overviewed.
Chapter 3 is about H.264 Integer Transform, where the mathematics of the
integer transform and Inverse integer transform has been explained.
Chapter 4 presents Quantization of H.264, quantization offset and the
concept of dead-zone has also been discussed.
Chapter 5 is the real work representing the objective of this research, .i.e.
VLSI Implementation of H.264 Integer Transform & Quantization module.
Proposed HW architecture and Optimized HW architecture are explained and
compared.
Chapter 6 elaborates the experimental results and simulations.
Chapter 7 concludes this research work.
5
CHAPTER 2
Overview of Block Based video Coding
2.1 Background
Digital video has taken the place of traditional analogue video in a wide
range of applications due to its compatibility with other types of data (such
as voice and text). However, at the same time, the high bitrate required to
represent digital video dramatically increases the burden on storage space,
processing ability and transmission bandwidth. For example, using a video
format of 352×240 pixels with 3 bytes of color data per pixel, playing at 30
frames per second, 7.6 Megabytes of disc space is needed for one second of
video and it is only feasible to store around 10 minutes of video in a 4.6
Gigabytes DVD. When it is transmitted in real time through the internet, it
requires a channel with 60.8 mbps, which is 118 times of the bandwidth (500
kbps) of Asymmetric Digital Subscriber Line (ADSL) for current broadband
internet service. Even if high bandwidth technology is able to provide
sufficient transmission speed and the storage problems of digital video are
overcome, the processing power needed to handle such massive amounts of
data would make video processing hardware very expensive. Although
significant progress in storage, transmission and processing technology is
being made, it is primarily compression technology that has made the
widespread use of digital video possible.
Generally speaking, there is a large amount of statistical and subjective
redundancy in digital video sequences. Video compression techniques are
designed to reduce the size of information for storage and transmission.
Through exploiting both statistical and subjective redundancy, a compact
representation of video data is achieved and important information is kept.
The performance of compression depends not only on the amount of
redundancy in the video sequence but also on compression techniques used
for coding. There are two classes of techniques for image and video
compression: lossless coding and lossy coding. Lossless coding techniques
compress the image and video data without any loss of information and the
compressed data can be decoded exactly the same as original data; however,
these techniques obtain a very low compression ratio and result in large files.
Consequently, they are appropriate for applications requiring no loss
introduced by compression, for example, medical image storage. On the
6
other hand, lossy coding methods sacrifice some image and video quality to
achieve a significant decrease in file size and a high compression ratio.
Lossy coding techniques are widely used in digital image and video
applications due to the high compression ratio provided.
The growing interest in digital image and video applications has led
academics and industry to work together to standardize compression
techniques in order to meet the requirement of various applications. Several
series of standards have been successfully developed by two organizations:
International Organization of Standardization, International Electrotechnical
Commission (ISO/IEC) and the International Telecommunications Union,
Telecommunications Standardization Sector (ITU-T). These standards
address a wide range of video applications in terms of bitrate, image quality,
and complexity and so on. In the following section, the fundamental video
coding techniques and popularly used compression standards are introduced.
2.2 Block-Based Video Coding
Most of the popular video coding standards utilize block-based video coding
techniques. Within an image or a single frame of a video, there are usually
similarities between neighboring pixels, referred to as spatial redundancy.
The compression of a single frame is accomplished by replacing the same
image information with a smaller-size representation. Many techniques have
been developed to allow the replacement of whole segments of an image
with a set of data representing the image in a transformed state, among
which, the H.264/AVC Integer Transform (IT) [1] is widely used in block-
based coding. The IT is commonly applied to an 8×8 or 4x4 blocks.
The movement and detail in a video scene tends to vary gradually, and thus
adjacent video frames are often similar. This is temporal redundancy. If an
area of a frame is compared with a previous frame, it is possible that an
identical or similar area can be found. Temporal redundancy is reduced by
replacing the area with a corresponding area derived from one or more
reference frames. The basic area for comparison and replacement is typically
a macroblock in block-based coding. If the replacement is not exact,
IT-based coding can be performed to further reduce the spatial redundancy.
The block-based video coding technique can achieve a good compression
ratio while it is also computationally efficient, which has led to its wide-
spread use in many coding standards, such as H.263 [2], MPEG-2 [3] and
MPEG-4 [4], H.264/AVC.
7
2.3 Video Coding Fundamentals
2.3.1 Motion Estimation & Compensation
A video sequence typically contains temporal redundancy: that is two
successive pictures are often very similar except for changes induced by
object movement, illumination, camera movement and so on. Motion
estimation and compensation is used to reduce this type of redundancy in
moving pictures. The Block-Matching Algorithm (BMA) for motion
estimation has been proved to be very efficient in terms of quality and bit
rate; therefore it has been adopted by many standards based CODECs. In this
section, the basic principle of block matching motion estimation and
compensation is introduced and fast motion search algorithms are addressed.
Principle of Block Matching Motion Estimation and Compensation
In block matching motion estimation, a single image is subdivided into non-
overlapping N×N blocks, where N is usually 16, 8 or 4. Each block in the
current frame is compared with blocks of the same size in reference pictures
in order to find out the best match, which meets an error criterion based on
the measurement. The location of the block is defined by co-ordinates (x, y)
of top-left corner of the block. The vector pointing from the current block to
the best match block is chosen as Motion Vector (MV). The residual
“difference” between current and reference frames is computed by the
process of motion compensation, and then coded and transmitted with
motion vectors.
Two error measurements are commonly used for block matching criteria:
Sum of Squared Error (SSE) [5] and Sum of Absolute Difference (SAD) [6],
which is described in Equation 2-1 and Equation 2-2 respectively.
2
1
0,
)),(),((),(
yxr
N
yx
cyx
dydxfyxfddSSE ++−=
∑
−
=
(2.1)
∑
−
=
++−=
1
0,
),(),(),(
N
yx
yxrcyx
dydxfyxfddSAD (2.2)
Where f
c
(x, y) is luminance pixel value of N×N block in current frame and
f
r
(x+d
x
, y+d
y
) is the block in the position of (d
x
, d
y
) in reference frame. For a
block size of 16 × 16, SSE requires 16 × 16=256 multiplications and 2 × 16
8
× 16=512 additions, whereas, SAD only need 2 × 16 ×16=512 additions.
Compared with SSE, SAD is much less computationally demanding and is
consequently more widely utilized. Figure 2-1 illustrates the process of
block-based motion search. An N×N block in the current frame (f
c
) located
by the co-ordinates (x, y) is compared with a same-size block in the search
range of reference frame (f
r
). SAD is computed for each search and the best
motion vector is the one with the minimum SAD.
2.3.2 DCT-Based Transform
Through motion estimation and compensation, the temporal redundancy in
the current frame is reduced and a residual frame is generated by subtracting
blocks in the current frame from corresponding best match in reference
blocks. There is still a certain amount of spatial redundancy in the residual
frame and Discrete Cosine Transform (DCT) is applied, followed by
quantization to further reduce the redundancy. DCT is a mathematical
method transforming image data from the spatial domain to the frequency
domain. An N×N block of samples is converted into transform coefficients.
The block size for DCT transform in image and video coding is usually
chosen to be 8 because an 8×8 block provides good spatial correlations
between pixels and it does not put tremendous burden on processing ability
and memory storage [7]. Figure 2-2 illustrates the basis functions of an 8×8
block. The top-left one is the “DC” basis function, which represents zero
spatial frequency. The spatial frequency of other “AC” basis functions
increases horizontally along the top row and vertically down the left column.
The weighted value of basis functions is added or subtracted to each other to
generate the 8×8 transformed coefficients.
9
Figure 2-2 Basis functions of DCT transform
The general equation of N×N two-dimensional DCT can be defined by the
following equation, where f is N×N block of original pixels and F is the
matrix of transformed coefficients.
),()
2
)12(
cos()
2
)12(
cos()()(
2
),(
1
0
1
0
jif
N
vj
N
ui
vAuA
N
vuF
N
i
N
j
ππ ++
=
∑∑
−
=
−
=
(2.3)
2
1
)( =xA , for x=0, 1 otherwise
The DCT transform involves costly matrix multiplication. Mathematically,
the DCT is perfectly reversible without any loss of information. If an inverse
DCT is applied to the transform coefficients, the reconstructed image data is
exactly as same as the original. The transformed coefficients represent how
much of each basis function is present in the original block. A large value of
a coefficient means that the original image data vary corresponding to a
specific frequency. An N×N block contains spatially related samples and the
values of these samples do not vary dramatically, which can be represented
by low frequency basis coefficients. Consequently, after DCT transform,
most coefficients are small values. The high valued ones are often those that
represent the lower frequency functions and cluster around the DC
coefficient position. Although compression has not been achieved at this
stage, DCT transform makes a good preparation for next stage of coding
with many low value coefficients. The block diagram of DCT-Based codec is
shown in Figure 2-3.
10
Figure 2-3 Block Diagram of DCT-based CODEC a) Encoder b) Decoder
2.3.3 Quantization
There are typically a lot of near zero coefficients in the transformed block.
Quantization is used to discard these less important coefficients by dividing
each coefficient by an integer. Only significant DCT coefficients are left
after quantization so that compression of the image or residual is obtained.
Quantization is an irreversible process, the data loss caused by which can not
be recovered. The amount of discarded coefficients may be varied by using a
quantization step size (Q). Large Q tends to throw away most of the
coefficients, retaining only the most important large-value coefficients. On
the contrary, more coefficients are kept in the quantized block by using small
Q. The level of Q decides the number of zero-coefficient generated by
quantization and affects video quality and final compression rate.
2.3.4 Entropy Coding
There are only a few non-zero coefficients left, typically low frequencies
around the “DC” one, after quantization. The quantized DCT coefficients are
further coded by three steps: reordering, run-level coding and entropy
coding. The quantized coefficients are reordered into a one-dimensional
array by scanning them in zigzag order. DC coefficient is at the first position
of the array, followed by the remaining AC coefficients from low frequency
to high frequency. Since most of the high frequency coefficients tend to be
zero, this arrangement separates the non-zero and zero coefficients. The
11
rearranged coefficients array is coded as a series of run-level pairs: “run” is
the number of consecutive zero before the next non-zero coefficient
represented by “level”. The run-level pairs and other coding information
(such as motion vector and prediction types) are further compressed by
entropy coding such as Variable Length Codes (VLC). The more frequently
occurring pairs are represented by shorter codes whereas the infrequently
occurring pairs are represented by longer codes. The most popular statistical
algorithms used in encoding are Huffman or modified Huffman coding,
arithmetic coding and Context Adaptive binary arithmetic coding (CABAC).
2.4 Video Coding Standards
ITU-T Video Coding Experts Group (VCEG) and ISO Motion Picture
Experts Group (MPEG) are two formal organizations that develop video
coding standards. These standards are designed for a variety of video
applications ITU-T standards are called Recommendations and H.26x series
(H.261 [8], H.262 [3], H.263 [2], and H.264 [9]) , are designed for
applications, such as video conferencing and video telephony. Meantime,
ISO/IEC MPEG is responsible for the MPEG-x series: MPEG-1 [10],
MPEG-2 [3], MPEG-4 [11], MPEG-7 [12] and MPEG-21 [13]. They address
the problem of video storage, broadcasting video and video streaming
through internet and mobile networks.
2.4.1 H.263
H.263 [14], originally standardized by ITU-T in 1993, is a video coding
standard for low bit rate video communication over Public Switched
Telephone Network (PSTN) and mobile networks with transmission bitrates
of around 10-24kbps or above.
2.4.2 H.263+ and H.263++
Following the wide use of the first version of the H.263 standard, new
negotiable options were added, leading to the second version of the standard,
known as H.263+ [15].
2.4.3 MPEG-4
MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture
Experts Group). MPEG-4 builds on the proven success of three fields:
12
(1)Digital television. (2) Interactive graphics applications (synthetic
content). (3) Interactive multimedia (www, distribution of & access to
content).
2.4.4 Other MPEG standards
Other MPEG standardization efforts & approval dates are given in Table 2-1.
Table 2-1MPEG approval dates
2.4.5 H.264
H.264 previously known as H.26L was initially started by ITU-T VCEG in
1998. In 2001, VCEG and ISO MPEG established the Joint Video Team
(JVT) [16] to take the responsibility of developing it into a standard. The
standard is now finished and called officially Advanced Video Coding
(AVC), also known as ITU-T H.264 and ISO MPEG-4 part 10. The main
objective of the emerging H.264 is to improve coding performance and
efficiency with a simple syntax specification. The basic video coding
approach used in H.264 is very similar to that adopted in previous standards,
such as H.263. However, new features and enhanced prediction methods
make it able to provide low bit rate, low coding delay and high complexity
video coding. H.264 can be applied in a variety of video applications:
internet video streaming, mobile video, high definition TV, video storage on
DVD and so on. In the following chapters we will have in depth discussion
on H.264.Figure 2-4 summarizes the evolution of the ITU-T
recommendations & the ISO/IEC MPEG standards.
2.4.5.1 Overview of H.264
The main objective behind the H.264 project was to develop a high-
performance video coding standard by adopting a “back to basics” [16]
approach where simple and straightforward design using well-known
building blocks is used. The emerging H.264 standard has a number of
13
advantages that distinguish it from existing standards, while at the same
time, sharing common features with other existing standards. The following
are some of the key advantages of H.264:
1. Up to 50% in bit rate savings: Compared to H.263v2 (H.263+) or
MPEG-4 Simple Profile, H.264 permits a reduction in bit rate by up to 50%
for a similar degree of encoder optimization at most bit rates [17].
2. High quality video: H.264 offers consistently good video quality at high
and low bit rates.
3. Error resilience: H.264 provides the tools necessary to deal with packet
loss in packet networks and bit errors in error-prone wireless networks.
4. Network friendliness: Through the Network Adaptation Layer, H.264 bit
streams can be easily transported over different networks.
Figure 2-4 Progression of the ITU-T Recommendations and MPEG standards
2.4.5.2 Technical Description of H.264
The main objective of the emerging H. 264 standard is to provide a means to
achieve substantially higher video quality as compared to what could be
achieved using any of the existing video coding standards. Nonetheless, the
underlying approach of H.264 is similar to that adopted in previous standards
such as H.263 and MPEG-4, and consists of the following four main stages:
1. Dividing each video frame into blocks of pixels so that processing of the
video frame can be conducted at the block level.
14
2. Exploiting the spatial redundancies that exist within the video frame by
coding some of the original blocks through transform, quantization and
entropy coding (or variable-length coding).
3. Exploiting the temporal dependencies that exist between blocks in
successive frames, so that only changes between successive frames need to
be encoded. This is accomplished by using motion estimation and
compensation. For any given block, a search is performed in the previously
coded one or more frames to determine the motion vectors that are then used
by the encoder and the decoder to predict the subject block.
4. Exploiting any remaining spatial redundancies that exist within the video
frame by coding the residual blocks, i.e., the difference between the original
blocks and the corresponding predicted blocks, again through transform,
quantization and entropy coding.
2.4.5.3 Design Feature Highlights of H.264
The H.264/AVC design covers a Video Coding Layer (VCL), which is
designed to efficiently represent the video content, and a Network
Abstraction Layer (NAL), which formats the VCL representation of the
video and provides header information in a manner appropriate for
conveyance by a variety of transport layers or storage media (Figure 2-5).
Figure 2-5 Structure of H.264/AVC video encoder.
Relative to prior video coding methods, as exemplified by MPEG-2 video,
some highlighted features of the design that enable enhanced coding
efficiency include the following enhancements of the ability to predict the
values of the content of a picture to be encoded:
15
Variable Block-Size Motion Compensation with Small Block Sizes: This
standard supports more flexibility in the selection of motion compensation
block sizes and shapes than any previous standard; with minimum luma
motion compensation block size as small as 4x4.
Decoupling Of Referencing Order From Display Order: In prior
standards, there was a strict dependency between the ordering of pictures for
motion compensation referencing purposes and the ordering of pictures for
display purposes. In H.264/AVC, these restrictions are largely removed,
allowing the encoder to choose the ordering of pictures for referencing and
display purposes with a high degree of flexibility constrained only by a total
memory capacity bound imposed to ensure decoding ability. Removal of the
restriction also enables removing the extra delay previously associated with
bi-predictive coding.
Decoupling Of Picture Representation Methods From Picture
Referencing Capability: In prior standards, pictures encoded using some
encoding methods (namely bi-predictively-encoded pictures) could not be
used as references for prediction of other pictures in the video sequence. By
removing this restriction, the new standard provides the encoder more
flexibility and, in many cases, an ability to use a picture for referencing that
is a closer approximation to the picture being encoded.
Weighted Prediction: A new innovation in H.264/AVC allows the motion-
compensated prediction signal to be weighted & offset by amounts specified
by the encoder. This can dramatically improve coding efficiency for scenes
containing fades, & can be used flexibly for other purposes as well.
In-The-Loop Deblocking Filtering: Block-based video coding produces
artifacts known as blocking artifacts. These can originate from both the
prediction and residual difference coding stages of the decoding process.
Application of an adaptive deblocking filter is a well-known method of
improving the resulting video quality, and when designed well, this can
improve both objective and subjective video quality. Building further on a
concept from an optional feature of H.263+, the deblocking filter in the
H.264/AVC design is brought within the motion-compensated prediction
loop, so that this improvement in quality can be used in inter-picture
prediction to improve the ability to predict other pictures as well.