PARTE II
2.Motivations
16
PARTE II
2. Motivations
any network management tasks, such as flow prioritization,
traffic policing and diagnostic monitoring, require always of-
tener accurate identification and categorization of network traffic accord-
ing to the type of application that has generated it [2][2].
The identification, which can be packet, flow or session-based, is becom-
ing a fundamental prerequisite for numerous other network activities,
such as granting an adequate level of QoS (e.g.: differentiated services,
priority queuing, minimum bit-rate, …) or managing ISPs’ billing poli-
cies [3][4]; moreover it can help in solving some network engineering
problems such as workload characterization and modelling, capacity
planning and route provisioning.
A reliable traffic characterization could be also a good starting point, for
network administrators, either to investigate in case of sudden changes in
traffic dynamics and to counter possible security attacks.
There are (see [4]) at least three categories of application identification
methods: Session-based, Content-based and Constraint-based.
M
PARTE II
2.Motivations
17
Figure 1: Application Traffic Identification Methods
The traditionally used classification methods, such as well-known port
identification or exhaustive packet payload analysis, belong, respectively,
to the first and second category; they are becoming obsolete and helpless
in front of the emerging of peer-to-peer applications and mechanism such
as tunnelling and encryption used mainly to avoid detection or violate se-
curity policies.
A detailed description of features and drawbacks of these methods fol-
lows.
PARTE II
2.Motivations
18
2.1 Well Known-Port based Methods
Known-port methods rely on the observation of TCP or UDP port num-
bers, which are divided into three ranges: the Well Known Ports (0-
1023), the Registered Ports (1024 - 49151) and the Dynamic and/or Pri-
vate ports (49152 - 65535). A typical TCP connection starts with a SYN,
SYN-ACK, ACK handshake from client to server; the client addresses its
initial SYN packet to the well known server port of a particular applica-
tion. The source port number of the packet is typically chosen dynami-
cally by the client. UDP uses ports similarly to TCP but in a connec-
tionless way. All successive packets in either a TCP or UDP session will
use the same pair of ports to identify the client and the server side of the
session; therefore, in principle, the TCP or UDP server port number can
be used to recognize the higher layer application, by simply identifying
which port is the server port and mapping this port to an application using
the IANA (Internet Assigned Numbers Authority) list of registered ports
[6]. However these methods are often unusable because of some limita-
tions [1][7]:
• First, the mapping from ports to applications is not always well de-
fined; many implementations of TCP use client ports in the regis-
tered range. Some applications such as P2P applications (e.g.: Ka-
zaa, Napster) haven’t standard port numbers and began using dy-
namic ports and disguising themselves by using port numbers for
commonly used protocols such as HTTP and FTP, there are ambi-
PARTE II
2.Motivations
19
guities in the port registrations, etc..
• A second limitation is that a port can be used by a single applica-
tion to transmit traffic with different QoS requirements; for exam-
ple Lotus Notes transmits both email and database transaction traf-
fic using the same ports, and scp (secure copy), a file transfer pro-
tocol, runs over SSH (secure shell) which is also used interactively
on the same port (TCP port 22) by remote shell applications.
PARTE II
2.Motivations
20
2.2 Payload-based Analysis Methods
The aforementioned disadvantages of port-based classification led to sev-
eral payload-based analysis techniques [2], in which there is a research of
characteristic prints of known applications. These techniques avoid com-
pletely the reliance on fixed port numbers [7].
In the so called ‘Signature Matching Method’, a portion of payload data,
indicated as the signature of the application, that is static, unique and dis-
tinguishable, is examined for all applications, regardless of their protocol.
This method tries to identify the application by comparing every packet
payload with pre-determined signatures. Many Network Intrusion Detec-
tion Systems (NIDS) rely on signature-based techniques to recognize
known attack patterns on standard service ports. The choice of these
methods is due to their quickness and their efficiency in recognizing
known attacks without generating too many false alarms ([10], [11]).
The Protocol Matching Method shares a similar concept of signature
matching but it needs to be aware of the complete protocol format. Ethe-
real [9], which will be exploited in the practical realization of the thesis,
is a monitoring tool that offers the protocol matching functionality.
Besides some benefits, these payload based mechanisms, on the other
hand, require in advance an exhaustive search frequent updates of signa-
ture information to maintain the high accuracy; these are operations that
impose significant complexity and processing load on the traffic identifi-
cation device [7].
PARTE II
2.Motivations
21
Moreover they become useless in front of tunnelling and encryption
mechanisms. Let’s see how.
2.2.1 Tunnelling Techniques
The application level payload of at least two protocols (HTTP and DNS)
could in principle be used to encapsulate packets generated by other pro-
tocols and to carry them hidden in and out of a given network. Exploiting
these features and the fact that network administrators normally let HTTP
and DNS traffic pass their network boundaries, one can install entry and
exit points in different places of Internet and therefore bypass any secu-
rity policy enforced by firewalls or proxy [12].
A popular, open source package capable of tunnelling any application
level protocol into HTTP is [13]. It provides two daemons, htc and hts,
running at the two ends of the tunnel; htc listens for incoming TCP con-
nections at a given port, when a connection is established htc opens a
couple of HTTP sessions towards hts that runs at the opposite side of the
tunnel. For example, if SMTP (port number 25) is tunnelled into HTTP
(port number 80), hts will forward any incoming connection on port 80 to
port 25, while htc will redirect any request to port 80 of the server.
The packet of the tunnelled flows are encoded so that they can be incor-
porated in a regular, semantically valid HTTP session; an analysis of the
TCP payloads, even if performed by means of pattern matching, could
not reveal any difference between the htc/hts traffic and a true HTTP
flow.
PARTE II
2.Motivations
22
2.2.2 Encrypted Traffic Examples
A practical example of the uselessness of signature-based methods when
cryptography is employed is Skype traffic [14]. Skype is a very popular
VoIP software whose protocols and algorithms are unknown and follow a
closed source and proprietary design that leverages on strong encryption
mechanism; so it is very difficult to even identify the presence of Skype
traffic in a traffic aggregate. Only few pieces of information about Skype
messages building are available: a Codec encodes the voice, a Framer
multiplexes into a single Skype frame some encoded blocks, a Cypher
encrypts a frame once it has been created, and finally an additionally not
ciphered header (Start of Message) may be added. The result is a Skype
message. Only if it is present the SoM a payload-based classifier can be
used. In fact, although a PBC (Payload Based Classification) is made dif-
ficult by both obfuscation and cryptographic techniques such as AES and
RSA algorithms [15], indeed, Skype flows that employ UDP must use
SoM because of the possible packet reordering or dropping (UDP is unre-
liable!). Nonetheless, without this eventually SoM, encryption would
make infeasible every PBC and moreover results of a PBC reach the best
performance when it is used with complementary tools [14].
All these reasons led us to address towards the third category of identifi-
cation methods, the Constraint-Based ones, and in particular toward the
stochastic identification.
PARTE II
3.Statistical Types of Identification
23
3. Statistical Types of Identification
hese methods belong to the third category depicted in Figure 1
(pag.17): the Constraint-based methods. This is actually a sub-
category of session-based identification but what characterizes these
methods is that they borrow concepts generally used in the area of statis-
tics and normally do not require any application-level protocol informa-
tion [4].
3.1 Previous Works
The idea of using the statistical properties of network traffic to classify
flows, or at least to describe their behaviour is not new. Pioneering stud-
ies by Paxson et al. on Internet traffic characterization ([17] and [18]) fo-
cus on the relationship between observed statistical properties of flows
and the application protocols that generated them. These papers, although
show that analytical models describing random variables can be suitable
to express the behaviour of a few protocols, however, don’t make any at-
tempt to classify flows according to application layer protocols. This goal
is reached by Mena et al. [19] who showed how Real Audio flows may be
identified among aggregates through a simple analysis of packet lengths
and inter-arrival times.
A similar approach has been used in [29] to analyze chat traffic. Stem-
T
PARTE II
3.Statistical Types of Identification
24
ming from the observation that this kind of traffic is dominated by human
interactions, this work proved the feasibility of identifying chat flows,
whether or not they are using their own transport protocol or are layered
on top of other application protocols like HTTP. To overcome one of the
key issues with statistically trained classifiers, i.e. the lack of verifiable
reference data, this work was based on the statistical analysis of Internet
Relay Chat traffic traces, since such traffic flows are easily identifiable
even by payload analysis. This work, however, focuses exclusively on a
single class of applications.
Other approaches (see [22]) confirm the possibility of discrimination be-
tween different application classes with the objective of supporting ser-
vice differentiation. A recent work of Bernaille et al. [20] proposes the
use of clustering techniques to achieve fine-grained classification based
on size and direction of packets, in [21] Nilsson et al. focuses on the sta-
tistical analysis of network traffic too, and shows promising results for
fine-grained protocol classification.