- 6 -
basic concepts since bioinformatics is a borderline area: a computer scientist
will find that biological notions are particularly useful and vice versa.
The second part of the thesis will first analyse the software instruments
applied in the effort to develop a procedure as fast and accurate as possible.
The variety of programs and scripts involved will be described in details as
well as the general strategy based on the idea of a two-step searching
method. Results achieved through the method will be presented and
discussed in the main part of the section.
An extensive bibliography is given for every major subject examined as
well as web references to resources and databanks.
Support material to this thesis can be retrieved from the web site
http://bio.lundberg.gu.se/srpscan/paper.html [117].
- 7 -
Part A: Biological and
bioinformatics background
- 8 -
2. Biological background: noncoding RNA
2.1 Flow of genetic information
The genetic information stored in DNA is expressed through two
essential processes: transcription of DNA into mRNA and translation of
mRNA into proteins.
Messenger RNA works hence as a template for protein synthesis:
triplets of bases in the mRNA code for amino acids in proteins following
specific rules known as the “genetic code”.
Other RNAs (transfer RNA, tRNA, and ribosomal RNA, rRNA) are
involved in this process playing a key role in the recognition of the triplets
and being the major component of ribosome, with both a catalytic and
structural role in protein synthesis. All forms of cellular RNA are
synthesized by RNA polymerase, which use DNA as a template; proteins
are then synthesized by the ribosome using mRNA templates.
2.2 Structure of nucleic acids
Despite their importance in cellular function, nucleic acid structure is
unexpectedly simple. RNA and DNA are long polymers of only four
nucleotides: adenine, guanine, cytosine and thymine (or uracil for RNA).
The nucleotide structure can be broken up into two parts: the sugar-
phosphate backbone and the base. All nucleotides share the sugar-phosphate
backbone. Nucleotide polymers are formed by linking the monomer units
together using an oxygen atom on the phosphate, and a hydroxyl group on
the sugar. A, T (or U), G and C are capable of being linked together forming
a long chain. The 3'-hydroxyl group on the ribose unit reacts with the 5'-
phosphate group on its neighbour to form a chain. The base on each
- 9 -
nucleotide is different, but they still show similarity: adenine (A) and
guanine (G) are purines, with a two-ring structure, with the differences in
the molecules coming in the groups attached to the ring. Similarly, cytosine
(C), thymine (T) and uracil (U) are pyrimidines and share a similar
structure, but differ in their side groups.
If two strands of nucleic acid are adjacent to one another, the bases along
the polymer can interact with complementary bases in the other strand.
Adenine forms hydrogen bonds with thymine and cytosine can base pair
with guanine. Adenine forms two hydrogen bonds with thymine while
cytosine forms three with guanine.
2.2.1 DNA structure
Cells contain two strands of DNA that are exact mirrors of each other.
When correctly aligned, A can pair with T and G can pair with C: in
solution, the two strands will usually find each other and form a double
helix. This reaction is favourable because of the numerous hydrogen bonds
that can be formed between the complementary bases. The DNA molecule
can stretch for millions of base pairs and the DNA sizes of different
organisms can vary greatly.
2.2.2 RNA structure
RNA is similar in structure to DNA except that uracil takes the place of
thymine and that the ribose unit on each sugar contains a hydroxyl group.
The RNA in most cells exists as single-stranded, but, if complementary base
sequences are present in the RNA, it can fold back upon itself and base pair.
This secondary structure of RNA often results in loops and stems that
drastically affect the function of the molecule. RNA with an extensive
amount of secondary structure plays important physiological roles in
translation, in transcription and in DNA replication.
- 10 -
2.3 Functions of RNA molecules
A number of RNAs that do not function as mRNAs, transfer RNAs
(tRNAs), or ribosomal RNAs have been discovered. In the literature, the
non-mRNAs have been referred in many different ways: the term small
RNAs (sRNAs) has been more common in bacteria, whereas the term
noncoding RNAs (ncRNAs) has been preferred in eukaryotes.
ncRNAs vary in size from 21 to 25 nt for the big family of microRNAs
(miRNAs) that modulate development in C. elegans, Drosophila, and
mammals [2], up to 200 nt for sRNAs commonly found as translational
regulators in bacterial cells [3] and to >10,000 nt for RNAs implicated in
gene silencing in higher eukaryotes [4]. The functions described for
ncRNAs so far are extremely varied.
The mechanisms of action for the characterized ncRNAs can be grouped
into numerous general categories. There are ncRNAs where base pairing
with another RNA or DNA molecule is central to function. The snoRNAs
that direct RNA modification and the bacterial RNAs that modulate
translation by forming base pairs with specific target mRNAs are examples
of this category.
Some ncRNAs resemble the structures of other nucleic acids: the 6S
RNA structure is reminiscent of an open bacterial promoter and the tmRNA
has characteristics of both tRNAs and mRNAs.
Other ncRNAs, such as the RNase P RNA, have catalytic functions.
Most ncRNAs are associated with proteins that augment their functions;
however, some ncRNAs, such as the snRNAs and the SRP RNA, serve key
structural roles in RNA-protein complexes. Several ncRNAs fit into more
than one mechanistic category.
The mechanisms of action for a number of ncRNAs are unknown, and it
is probable that some ncRNAs act in ways that have not yet been
established.
- 11 -
Some investigators have suggested that many ncRNAs are residues of a
world in which RNA carried out all of the functions in a primitive cell.
Nevertheless, given the versatility of RNA and the fact that the properties of
RNA provide advantages over peptides for some mechanisms, it is likely
that a number of ncRNAs have evolved more recently [5, 6].
2.4 Secondary structure of RNA
DNA molecules are usually encountered as two complementary strands,
forming a double helix over long stretches [7]. In contrast to DNA, RNA
prevails as a single strand. Due to small self-complementary regions, the
RNA commonly exhibits a complex secondary structure, consisting of
relatively short, double helical segments alternated with single stranded
regions.
Many complex tertiary interactions fold the RNA in its final three-
dimensional form. The folded RNA molecule is stabilized by a variety of
interactions, the most important being hydrogen bonding of the bases and
stacking. Some of the commonly found structural elements are illustrated in
Fig. 2.1.
2.4.1 Base interactions in RNA
Canonical pairs. The bulk of secondary structure interactions is formed
by the normal (Watson-Crick) type of base pairing. These are formed by a
double hydrogen bond between A and U, or a triple hydrogen bond between
G and C.
Wobble pairs. The wobble hypothesis formulated that other interactions
than the mere canonical pairings were possible between the third base of the
anticodon and the first base of the codon. Of these 'wobble' interactions, the
non-canonical pairing between G and U is often found in RNA secondary
structure. It usually appears to play an incidental role. It can be seen at a
variety of positions in the tRNAs, which are usually represented by
- 12 -
canonical pairs in homologous tRNAs. This suggests that they generally do
not present characteristic fixed features of RNA structure, but rather behave
as canonical pairings. However, this is not always the case, and in some
instances, the wobble base pairing is highly conserved, and has e.g. been
shown to be a specific feature in the identity of alanine-tRNA [8]
Other non-canonical pairs. Non-canonical base pairing has mainly
been demonstrated and studied in short DNA duplexes using X-ray
diffraction [9], NMR [10] and thermodynamic studies [11]. However, also
in RNA non-canonical pairs have been experimentally observed, e.g. the U-
C pair has been detected in an X-ray diffraction study by Holbrook et al.
[12]. Several comparative and experimental results indicate that G:A pairs
are not rare in RNA structure.
2.5 Elements of the RNA secondary structure
2.5.1 Duplexes
Duplex RNA consists of a right-handed double helix stabilized by
hydrogen bonds between the bases on opposite complementary strands and
by stacking between adjoining bases. X-ray diffraction studies of fibres and
crystals [13] have shown that the helices are of the A-form. The A-form
RNA helix has 11 bp per turn, as opposed to 10 bp per turn for the usual B-
form DNA helix.
- 13 -
Fig. 2.1 Illustration of some of the structure elements found in rRNA. The RNA
backbone is depicted as a thick line, whereas bases are shown as thin lines: a =
hairpin, b = internal loop, c = bulge loop, d = junction, e = duplex (long range
interaction), f = pseudoknot. Picture adapted from [7].
2.5.2 Single stranded regions
Single stranded regions are formed by unpaired nucleotides. In absence
of tertiary interactions to constrain the single stranded regions, they are
assumed to be roughly ordered by base stacking in a helical geometry.
2.5.3 Hairpins
A hairpin consists of a duplex bridged by a loop of unpaired nucleotides.
The smallest possible loop in a hairpin was originally thought to be three
nucleotides but there is growing evidence that in some sequences, two
unpaired nucleotides suffice [14]. Thermodynamic studies of hairpins with
loop sequences (U)n, (C)n and (A)n (n=3 to 9) showed that loops containing
four or five nucleotides are the most stable [15]. However, the stability of a
- 14 -
hairpin loop changes with different loop sequences and sizes. Some of the
tetra-loops seem more abundant in RNA structure. Two of these have been
shown to form unusually stable hairpins: UUCG [16] and GAAA [17].
NMR studies of the hairpin GGAC(UUCG)GUCC demonstrated that
interactions between loop bases and the sugar-phosphate backbone
contribute to this unusual stability [14]. The backbone angles of the
nucleotides in small hairpin loops differ significantly from A-form
geometry. In longer loops, some of the nucleotides have been shown to
stack in normal A-form geometry.
2.5.4 Bulge loops
Bulge loops are formed by unpaired nucleotides in one strand of a
double-stranded region, where the other strand has contiguous base pairing.
Single base bulges can intercalate into the helix or loop out of the helix
depending on the temperature, the identity of the bulged nucleotide and the
sequence of the surrounding duplex. Bulge loops can affect the long-range
structure of RNA by creating a bend in a duplex. Bending has been detected
by the altered mobility in non-denaturating gel electrophoresis of RNA's
containing bulge loops. Distortions due to bulges may extend into the
surrounding duplex region [18].
2.5.5 Internal loops
Internal loops contain several nucleotides not capable of forming
Watson-Crick base pairs. Symmetrical internal loops contain an equal
number of unpaired nucleotides in each strand. Asymmetrical internal loops
contain an unequal number.
2.5.6 Junctions
Junctions or multi-branched loops are formed where three or more
duplexes come together, separated by single stranded stretches with a
variable number of unpaired nucleotides. Different helical regions can stack
coaxially in these junctions. The conformation of the unpaired nucleotides
- 15 -
in the junctions has a great impact on the three-dimensional structure by
orienting the stem regions that meet. They also have been implicated in the
catalysis of specific reactions [19].
2.6 Tertiary structure interactions
Tertiary interactions could be defined in terms of chord crossing: the
RNA sequence is drawn on a plane following a circle. Interactions between
bases are depicted as straight lines (chords) connecting the bases. A
secondary structure can be represented without any line crossing. Tertiary
interactions occur when lines do cross. In some cases, it is difficult to
discern which is the secondary and which is the tertiary interaction though.
2.6.1 Tertiary base pairing
There are several examples of RNA molecules containing tertiary
contacts between nucleotides in loop regions of the secondary structure
called loop-loop interactions. These interactions can show conformations
uncommon in secondary structure, and even a parallel strand pair has been
identified.
A typical tertiary interaction is the pseudoknot. It is formed by the
interaction of bases in a hairpin loop with bases just outside this hairpin
structure. Pseudoknots have been found in an increasing number of
biological systems [20].
2.6.2 Other tertiary interactions
Intercalation of an individual base of one strand between two bases of an
adjacent strand has been demonstrated in tRNA. Base triples also occur in
tRNA. The third base of a base triple may bind to a Watson-Crick pair in
either the major or the minor groove. It is stabilized by hydrogen bonding
and stacking. Helix-helix interactions have also been observed in the crystal
structure of some RNA duplexes [13].
- 16 -
2.7 Three-dimensional structure of RNA
In solution, noncoding RNA molecules have a well-defined three-
dimensional structure that is critical for their physiological function. The
general architecture resembles proteins with structured, rigid domains
connected by less structured and more flexible stretches [21].
Fig. 2.2 The crystal structure of yeast phenylalanine tRNA at 1.93 Å resolution
(PDB entry 1EHZ) [22].
Fig. 2.2 provides an example of an RNA three-dimensional structure:
yeast phenylalanine tRNA. Bases form both Watson-Crick pairs and non-
Watson-Crick pairs, which pile together to form stems. In tRNA, four stems
pile together pairwise to make the two arms of the L-shaped tRNA. Bigger
RNAs can have much more complex structures, e.g. ribosomal RNA.
- 17 -
23S ribosomal RNA, like others RNA components called ribozymes
[23], has an active catalytic function. The catalytic activity of RNA also
encouraged the proposition that RNA predates DNA and proteins as the
information storage and catalytic component of life. In this view, the
presence of catalytically active RNA in the ribosome is readily explained by
its ancient and crucial function. The first molecules performing the
functions now carried out by the ribosome could have been mere RNAs,
which acquired the proteins during their evolution. The function of the
acquired proteins might then be to tune the RNA structure and functions, or
even to take over some of the functions [7].