Chapter 1
Introduction
1.1 Preface
This thesis is meant to be the nal report of three years of research in the con-
text of the Doctoral Curriculum Dottorato di Ricerca in Ingegneria
dell Informazione (XIV Ciclo) on the topic of video analysis in real-time.
High speed processing of videos is a key need for many elds; rst, multi-
media applications in which the videos are growing as relevance. Think for
instance to the videos through the Internet web: standards as the MPEG-1,
MPEG-2, MPEG-4 and the upcoming MPEG-7 are video co-decs
(COmpressor-DECompressorS) very frequently used to broadcast videos
through the web. In fact, the bandwidth limitation of current web infrastruc-
tures prevents from the transmission of a huge video as it is. Compression
before transmitting it and decompression to view it at the other side is more
ef cient since it allows less bandwidth consumption. In the MPEG standards
(especially in the more recent ones) the main part of the co-dec algorithm is
the shape coding of the objects that are moving in the scene: this is, indeed, a
typical video analysis task.
A second very large eld of application of the video analysis is the pure
information extraction from the video itself. The level of the information to
be extracted characterizes the video analysis application. Those applications
range from the shot detection (low level of information) to the object detection
and tracking (medium level) to the scene understanding and modeling (high
level). For example, the shot detection task is used to segment a video into
scenes, where a scene is a sub-sequence of the video (i.e. a sequence of
consecutive frames) with a homogeneous context. This is a very useful task
for indexing videos and for context-based information retrieval from videos.
The object detection and tracking from a sequence of images is probably
18 CHAPTER 1. INTRODUCTION
the more spread eld of video analysis applications. It is a key process for
video-based traf c analysis and management systems, for video-surveillance
and security systems, for target detection and pointing in military applica-
tions, and for many other applications. Therefore, the researches on video
analysis reported in the literature are basically on this topic.
Lastly, the scene understanding and modeling task uses the information
from the lower levels to model the scene (and, typically, also the objects
present in the scene) in order to understand the behaviour of the objects or
to represent the scene with a higher level of description.
All the above-mentioned applications typically require a real-time (or quasi
real-time) execution and are characterized by a huge amount of data to be pro-
cessed. For instance, the real-time processing of a video at the standard PAL
(25 frames/sec) at a low resolution of 320x240 pixels. If we have color images
(that is 3 channels for each pixel by using the RGB color space), each frame
will require 320x240x3 bytes and it must be processed in 40 msec. With this
low resolution a simple transfer of data will require, indeed, a bandwidth of
5.49 MB/sec!!! Studying and, consequently, improving the performance and
the ef ciency of such systems is one of the main topic of the research de-
scribed in this thesis. The study has focused both on the hardware and on
the software point of view, trying to propose solutions that t both with an
embedded specialized system and with a general-purpose one.
Besides improving the performance of video analysis applications, during
this research new computational models and algorithms for video analysis
has been analyzed and de ned. In particular, this research has developed
novel algorithms for motion detection and moving object segmentation from
cluttered and hostile environments, such as outdoor scene in which the sudden
changes of the light conditions, the frequent occlusions of moving objects by
means of buildings, poles, and so on, and the presence of shadows, are very
limiting factors.
Moreover, this research covers also motion analysis in high speed videos,
that is videos in which the objects are moving with a very high speed and in
which the noise often renders the images almost unusable. This last topic is
very promising and little research has been done (for now) on it by the com-
puter vision community.
This interest on the video analysis results in the last decades in the wide
diffusion of international journals, conferences and workshops on this and
related topics. Moreover, a lot of funds has addressed this topic. In fact, the
research described in this thesis has been supported by the following funding:
• Fund for Progetto di Ricerca Orientata Estrazione di informazioni vi-
suali complesse in tempo reale: modelli computazionali e tecniche di
1.2. RESEARCH GOALS 19
elaborazione di immagini (namely a project for oriented research with
the title Complex visual information extraction in real-time: computa-
tional models and image processing techniques )
• Financial support to young researcher for the research on Analisi di se-
quenze di immagini per sorveglianza e controllo del traf co ( Analysis
of image sequences for surveillance and traf c control )
• Contract for the Analysis of Camera Car Video of Formula 1 , sup-
ported by the Ferrari SpA - Gestione Sportiva
• Partial support and collaboration with the Department of Electrical and
Computer Engineering of the University of California, San Diego
(UCSD), Computer Vision and Robotics Research (CVRR) laboratory,
headed by the Prof. Mohan M. Trivedi, to work on the project ATON
(Autonomous Agents for On-Scene Networked Incident Management)
• Partially funded by PROGRAMMA STRATEGICO PER LA MO-
BILITA NELLE AREE METROPOLITANE - BOLOGNA from the
Italian Ministry of Public Works
• Fund for Progetto di Ricerca di Interesse Nazionale, supported by the
MIUR (Ministero dell Istruzione, dell Universit ‘a e della Ricerca) with
the title Sistemi Web ad elevata qualit ‘a del servizio (namely a national
project for Web systems with high quality of service )
1.2 Research Goals
Having in mind this preface, we will now depict the goals of this research. As
already stated, the rst goal is the de nition of computational models for sat-
isfying the highly demanding requirements of real-time video analysis. The
rst solution studied exploits the natural speed of hardware systems in per-
forming time-consuming operations. In fact, we initially explored the possi-
ble architectures able to improve the performance of certain algorithms, for
example, by parallelizing the computation. We studied and developed dedi-
cated architectures for real-time (frame rate) video processing onto a FPGA
(Field Programmable Gate Array) board. The scope was to evaluate the hard-
ware solution for a vision-based traf c control system that must satisfy the
real-time constraints.
This dedicated solution has been discarded due to two main reasons: the
rst is the still high cost of recon gurable devices (as the FPGAs are) and the
20 CHAPTER 1. INTRODUCTION
lack of availability of such resources at our lab; the second is the availabil-
ity nowadays of cheap and powerful general-purpose systems able to reach
almost the same performance of specialized systems. For this reason, we
focused on models able to improve the performance on general-purpose sys-
tems. It is well known that in such systems the bottleneck is represented
by the memory hierarchy that delays the CPU execution. In particular, in
the computer architecture community many efforts have been done to study,
model and improve the performance of the cache memories. For this reason,
this work will present a comprehensive study of the locality, the obtainable
performance and the possible improvements of a cache for image processing
and multimedia applications, with particular focus on the video processing
applications.
In the general-purpose context, besides the performance analysis and im-
provement, novel algorithms and approaches for the motion detection have
been studied. The goal is to develop new techniques for moving object seg-
mentation and for object and feature tracking that can result in a further im-
provement both of the ef ciency and the ef cacy. For this goal a complete
system, called Sakbot (Statistical And Knowledge-Based Object Tracker),
has been developed and deeply tested in many different contexts and applica-
tions, from the traf c analysis to the video-surveillance.
1.3 Video Analysis Requirements
Video analysis applications in real-time are necessarily performed on-line,
that is with the images directly (live) feeding from a camera. In this context
we can have, basically, the two situations reported in Fig. 1.1. The setup
sketched in Fig. 1.1(a) is the case in which the camera is directly connected
to the computer by means of a frame grabber or another acquisition device.
In this case the real-time constraints are due to the video standard used and to
the speed of the acquisition device. In the second case (Fig. 1.1(b)) the data
acquired by the camera are processed by a video server that has the scope of, if
necessary, compressing/decompressing the video data, performing some pre-
processing task and assuring a user-friendly interface for the application. The
video server will then send to client computers through a web architecture the
video data to be visualized or furtherly processed. With these premises, in the
second situation the performance is degraded also by the web s bandwidth.
We can summarize the factors that drive the requirements for video anal-
ysis applications into four classes:
1) data type: whether we have color images or not, at which resolution and
at which frame rate are relevant information on the application we are
1.4. STRUCTURE OF THE THESIS 21
(a) Local processing (b) Distributed processing
Figure 1.1: Local vs remote processing in a video analysis application
going to study and develop. In the case of video analysis this implies
huge amount of data and large bandwidth required. Moreover, in the
case of distributed applications the data type will in uence the co-dec
functioning too;
2) hardware available: which hardware is available in the system is very
important. Besides being of great relevance for the performance of the
system (see the above considerations on the acquisition device), the
hardware can be used to improve the ef ciency of the system. Think
for instance to the MPEG decoder/encoder boards that are currently
spreading in the home PCs. As a conclusion, the requirements of the
application can be relaxed by devolving some processing to specialized
hardware;
3) local/distributed processing: as reported in Fig. 1.1 the video analysis
task has different requirements depending on the type of processing;
4) type of the information to be extracted: as already stated, the level, the
amount and the complexity of the information to be extracted by the
application (i.e. the nal aim of the application) are key factors for
evaluating the computational load required.
1.4 Structure of the Thesis
This thesis has been divided in three main parts, in accordance with the steps
in which this research has been conducted. The rst part will describe the
study of embedded, special-purpose systems as a solution to real-time video
analysis. We will rst describe the development of a Real Time Convolver
(RTC) with a parallelized systolic architecture. The system has been improved
22 CHAPTER 1. INTRODUCTION
by functionally partitioning it onto a multi-FPGA device. The performance
and the limits of the proposal will be depicted.
Moreover, the FPGA solution is proposed for an UTC (Urban Traf c Con-
trol) system called VTTS (Vehicular Traf c Tracking System). In this case,
the low level module for daytime condition is depicted and deeply detailed,
proposing a multi-FPGA implementation and its performance in our prototy-
pal board.
The second part of the thesis will focus on the multimedia cache research.
In this part, that has been the main part of this three-year research, a cache
tuned to multimedia and image processing application has been studied. An
a-priori analysis of the feasibility by means of locality study has allowed a
comprehensive development of novel prefetching techniques able to improve
the overall performance of the cache of up to 140%!!! The cache has been
tested on a complete benchmark including both multimedia and image pro-
cessing algorithm.
The last part is, indeed, the largest since it includes two of the more prof-
itable topics of our research: motion detection and study of shadow detection
algorithms. In the rst chapter of this part the Sakbot system already men-
tioned will be deeply described, with particular focus on the shadow detec-
tion algorithm. The second chapter will, instead, resume in part the previous
one to present a comprehensive empirical evaluation and comparison of the
state-of-the-art on moving shadow detection. A two-layer taxonomy will be
introduced and more than 21 papers dealing with this topic will be classi ed.
Four of them (including the one that we developed for Sakbot) have been
implemented in software and compared by means of novel quantitative and
qualitative metrics.
Lastly, preliminary results of high-speed video analysis with the aim of
computing the angle that the steering wheel of a Formula 1 s car does will
be presented in the last chapter of this part. This topic is, indeed, very new
and only preliminary results are available. Nonetheless, this topic seems very
promising and will be, hopefully, a very relevant topic for the future research.
Part I
Architectures and Models with
Embedded Systems
Chapter 2
Introduction
2.1 Preface
Hardware dedicated solutions and re-con gurable/user dedicated CCMs (Cus-
tom Computing Machines) are the topics of a worldwide intense research. In
particular, great effort has been made to implement dedicated architectures for
image processing algorithms [1][2][3]. This is due to the high computational
load, the large amount of resources needed and (sometimes) the real-time
constraints typical of these applications. Several different solutions have been
proposed, for example dedicated VLSI chips (e.g. Plessley s PDSP 16488) or
DSPs optimized for image processing (e.g. Texas Instrument s TMS320C80).
In this research, Field Programmable Gate Arrays (FPGAs)-based solution is
adopted (as in [1][2][4] and many others) since it seems the most promising
choice in applications where processing speed - typical of dedicated solutions
- has to be matched with low-cost, exible systems capable of performing sev-
eral different tasks. Moreover, FPGAs are ISPDs (In-System Programmable
Devices) since re-programmability is assured at run-time (or quasi run-time).
Finally, FPGAs do not suffer parallel processing scalability problems as for
DSPs. Recently, the rapid development of FPGAs made possible the imple-
mentation of many real time image processing algorithms into a single FPGA
chip. For example, Greenbaum and Baxter in [2] enhance the work exposed in
[1] to bring 2-D block motion estimation into a single FPGA Xilinx XC4013
with several off-chip memories. Furthermore, the rapid growth of FPGA com-
plexity and device size (e.g. Xilinx Virtex family [5]) and the contemporary
decrease of a device prize make FPGAs solution more and more attractive and
suitable.
The work presented in this chapter is one of the result of a research activ-
ity aimed at implementing dedicated architectures for image processing. In
26 CHAPTER 2. INTRODUCTION
particular, the research activity focuses on recon gurable devices like Field
Programmable Gate Arrays (FPGAs) [6] as the most promising ones in ap-
plications where processing speed typical of dedicated solutions has to be
matched with low-cost, exible systems capable of performing several differ-
ent tasks.
The rst part of this chapter will present our proposal to the implemen-
tation of a Real-Time Convolver (RTC) on a FPGA board. In particular, we
focused our attention on the 2-D convolution, where a input image with size
M ×N has to be convoluted with a K × R kernel to obtain an output image
where each pixel depends on a K × R window of neighboring pixels in the
input image [7]. Results have been presented in [8].
Instead, the second part of this chapter will address the study of hardware
dedicated (embedded) solution to the problem of traf c management. Traf c
ow monitoring based on computer vision aims to extract information about
the traf c ow from traf c scenes acquired with cameras. This information
is required to substantially support traf c management policies with regularly
updated data such as the number of vehicles passing on a road per time unit,
vehicles turning rates at intersections, queue length measurement, and many
others. Results are described in [9] and in [10].
2.2 Prototypal Board
Both the systems mentioned in the preface have been developed using the
VHDL language. The nal prototype has been simulated and implemented on
a multi-FPGA board designed for rapid prototyping [11]. For this purpose, in
this section the main characteristics of this board will be described in order to
refer to speci c implementation issues in the following sections. In particular,
they will be highlighted the characteristics that limit the degrees of freedom
in the mapping and the routing of the prototype.
The prototypal board we used is the GigaOps G800 Spectrum board [11],
sketched in Fig. 2.1. The main blocks of this board are:
• The actual computation is performed by pairs of Xilinx XC4010E FP-
GAs, connected in modules called XMODs: in Fig. 2.1 four modules
(MOD0 thru MOD3) are shown. The two FPGAs in each module are
called YPGA and XPGA (from the name of the bus they are connected
with). Both these FPGAs have two memory ports: one connected only
to a 2 MBytes DRAM and one connected both to a 2 MBytes DRAM
and to a 128 KBytes SRAM device. XPGA and YPGA communicate
through a bus switch on the rst memory port. This switch works on
2.2. PROTOTYPAL BOARD 27
Figure 2.1: Block diagram of the GigaOps G800 prototypal board
two virtual busses: a 16-bit data bus and a 10-bit address bus. It is im-
portant to stress that only YPGAs are connected to YBUS, i.e. to the
input/output data bus
• A module called SCVIDMOD (S-VIDEO, COMPOSITE, VIDEO MOD-
ULE), that decodes/encodes video signals (PAL or NTSC). This module
interfaces to YBUS for data input and output
• An input FPGA (here called VLPGA) connected to the VESA local bus
of the PC hosting the board. The VLPGA is interfaced with the HBUS
and the YBUS. It contains all the registers needed for correct board op-
eration (e.g. the CLKMODE register, that sets frequencies of the clocks
distributed on the board)
• An output FPGA (here called VMC) connected to SCVIDMOD. This is
an additional FPGA, directly interfaced with the video output and the
XBUS
• Three main busses that allow connections among the various blocks of
the board. These busses are:
YBUS, a 32-bit I/O bus connected with VLPGA, VMC and the YP-
GAs of the XMODs
HBUS, a 16-bit bus used to con gure and to load the FPGAs
XBUS, a 64-bit bus normally used as four 16-data busses. Each of
these busses is connected only to the XPGAs and to the VMC.
The main data path of our applications is the following: pixels generated
by the video decoder are passed through the YBUS both to VLPGA and to the
YPGAs of the XMODs. These modules process data and pass the results either
28 CHAPTER 2. INTRODUCTION
to the VMC through YBUS or to the XPGAs through the bus switches. In the
latter case, the XPGAs can perform a further computation or simply pass the
results to the VMC through the 64-bit XBUS. In both cases, the VMC outputs
the results of its processing upon data coming from the XBUS or YBUS.