10 1 Introduction
As the range of information and services expands, the time and effort
required to sort through them expand too. As reported in [51], the effort
required to browse through thousands, if not millions, “pieces” of information
and services becomes like searching for the proverbial needle in a haystack.
Due to these reasons, one of the most pervasive and bedeviling challenges
is to disseminate “the right information to the right people at the right time”
[241]. However, to achieve this goal, it is necessary to cope with many ques-
tions and most of them are significant (and open) research problems. We can
recognize the following problems to face:
• In practical contexts, a user should be able to quickly access the informa-
tion/services that are effectively relevant for him. In this scenario, a user is
often obliged to manually contact and query a large number of databases
for retrieving and merging information of his interest. The “manual” search
could lead to network congestion since the larger the number of users is,
the larger the number of messages exchanged over the network, and the
consequent traffic volume, will be.
In spite of traffic congestion it causes, a manual search is often charac-
terized by scarcely accurate and incomplete results. To better clarify this
concept, observe that a network of data sources/information providers is
a complex and rapidly evolving system, since the corresponding contents
are continuously updated; in this scenario, the presence of expert users
is uncommon, since the user knowledge about information distribution is
quickly out of date.
As a consequence, users will inevitably submit generic (and, then, impre-
cise) queries; in this situation, the probability to produce enormous sets
of results, with many false positives (i.e., useless answers that fail to fulfill
user needs) is high. These sets are likely to overwhelm users with undesir-
able material.
In addition, the rapid evolution of the network of available data sources
and information providers makes it extremely difficult to correctly identify
those ones appearing to be the most promising, according to user needs.
As a consequence, it might exist a significant amount of useful answers
that the system fails to deliver to users (false negatives).
• The data sources available for the users are characterized by a strong level
of autonomy [230, 204]. This means that each data source: (i) exploits an
its own formalism for representing and managing data; (ii) controls the
evolution of its information (e.g., it can decide if it is/is not necessary to
update the data or to restructure its organization); (iii) controls the usage
of its data from other, external, consumers (e.g., it can decide to partially
hide some information).
Clearly, the independence and autonomy of each data source reflect the
internal organization of the institution which constructs and manages it.
However, the autonomy of each data source leads to a scenario character-
ized by a strong heterogeneity, since organizations operating in the same
1.1 Motivations 11
technological/economical environment may use different data models and
formalisms for managing their own data [196].
The autonomy of an organization is often conflicting with end user needs
[230]. Specifically, end users are experiencing the need for the interconnec-
tion of existing, possibly heterogeneous, databases [230]; such an intercon-
nection should allow: (i) human users to easily retrieve and handle data
coming from a large network of autonomous databases; (ii) information
providers to exchange their own data and achieve a better consistency
and coordination among data sources containing related information (for
instance, think to a the database of a manufacturing company, including
data on employee’s activities, and a personnel database).
In the past a lot of approaches have been proposed to enhance a user
capability of retrieving useful information from a large network of heteroge-
neous data sources. The common feature of these approaches is the aptitude of
building a seamless connection among people and information sources; early
connectivity efforts have focused on the “physical” aspect of the problem;
to this purpose, suitable technologies, like ODBC data gateways, message-
oriented middleware, and software adapters [241] have been proposed.
While this has been a major achievement, the fact that connectivity was
only obtained at the physical level is not enough to satisfy user information
needs, as well as to maximize over time the value of information owned by
an organization. In other words, these technologies allow streams of data to
be transferred between highly heterogeneous information systems but they do
not provide any meaning to the transferred flow.
Various researchers agree on the fact that this lack can be overcome by
defining a semantic model capable of: (i) capturing the exigencies/preferences
of users, and (ii) pointing out the inner structure of a data source, as well as
its relationships with the other ones.
The problem of constructing these semantic models is a key challenge in
Information Systems and Artificial Intelligence research areas. Specifically, we
can classify the ways for solving this problems as follows:
• In the field of e-service management, several approaches suggest a shift
towards a more “user-centered” paradigm for accessing information [107,
236].
Specifically, these approaches observe that a classical information source
(or service provider) is not able to pick the personal exigencies of a user;
as a consequence, users having different preferences and exigencies might
pose the same query to the same information source and they will always
receive the same answer [154].
Many approaches suggest that the analysis of user preferences enhances
the querying capability of a system and allows to alleviate the problems
of false positive and false negatives outlined above [154, 155, 228].
In practical cases, existing approaches associate each user with a suitable
data structure named user profile. A user profile records the preferences
12 1 Introduction
and exigencies of a user; each time a user submits a query or requires a ser-
vice, the system integrates the submitted query with information available
on user profiles and, then, process it [153].
According to [86], the process of combining preferences with queries has
many desirable properties; for instance, it is possible to efficiently deal
the network congestion problem, as well as to reduce the number of false
positives described previously; specifically, the exploitation of user profiles
allows irrelevant information to be cut, and reduces the volume of data
transferred on the network.
In addition, it lowers also the number of false negatives since, for answering
a query, it takes into account not only the query itself but also the profile
of the user formulating it. As a consequence, they can adequately cope
with too selective queries (i.e., queries consisting of few specialistic terms)
or ambiguous queries (i.e., queries containing terms that might have vari-
ous interpretations) and can retrieve query answers that would be usually
filtered out by a classic query processing technique.
Due to these reasons, a large variety of data structures have been proposed
for representing user profiles such as vectors [167], trees [152] and graphs
[154]. As for the methods for learning and updating user profiles, we can
observe that some of the existing approaches are neural networks [220],
genetic algorithm [185], nearest neighbor algorithm [199], Decision Trees
[199], relevance feedback and Rocchio algorithm [167].
Finally, interesting examples of the application of user profiles in query
answering activities can be found in [50, 58, 75, 109, 124, 184].
• In the Information Systems research area, various approaches propose to
build tools that provide users with a uniform interface to a multitude
of data sources; this interface should free them from knowing the imple-
mentation details of the involved data sources and the specific mode of
interaction with each source [161, 162, 196].
These approaches constructs a mediated schema, i.e., a virtual schema that
captures the salient aspects characterizing a domain to which an input
schema refers to [52, 162].
In addition they allow users to easily pose queries against the mediated
schema; in other words, they provide a user with the illusion of managing
just one, global data source and define a suitable query language for the
mediated schema.
To answer queries, a set of semantic matchings between the mediated
schema and the local schemas of the data sources is required. These match-
ings are often known as interschema properties [78, 110, 171, 191, 192, 204],
and they represent terminological and structural relationships holding
among attributes and elements belonging to involved data sources.
Interschema properties are used to reformulate a user query into a set of
queries on the original data sources. Finally, each data source is provided
with suitable wrapper programs, which are in charge of transforming data
1.1 Motivations 13
from the the data model of the mediated schema to the data model of each
local data source.
A key bottleneck in the construction of the mediated schema is the ac-
quisition of semantic matchings. In the field of relational databases, some
approaches [63] point out that the semantics of a data source can be ex-
tracted from the knowledge of functional dependencies existing among
attributes. In other approaches [123, 180], these matchings were manually
provided by a set of domain experts; this methodology is clearly laborious
and prone-to-error.
Many researchers agree on the fact that the task of finding semantic match-
ing must be semi-automatic, i.e. it should require human intervention only
for validating the semantic matchings discovered by a system or, some-
times, to provide an initial set of properties starting from which the system
should learn more complex matchings.
To this purpose, many sophisticated approaches have been defined to semi-
automatically find interschema properties; for instance, some of them rely
on graph-matching algorithms [191] whereas other ones exploit other tools,
like neural networks [163], Bayesian networks [101] or logic formalisms
[183].
Even if the approaches mentioned above are quite sophisticated and com-
plex, they do not take into account several and relevant novelties. These factors
are:
• The importance of device adaptivity. As previously pointed out, a large
number of public institutions and businesses is willing to deliver their
services information to a wide range of devices. Due to these reasons, the
improvement of hardware and network technologies is one of the most
relevant research effort to be pursued to allow a larger and larger set of
devices to exploit new and sophisticated services.
We can observe that hardware/network technologies are improving at an
impressive rate; in particular, from an hardware standpoint, the main focus
is to reduce device size, cost, and power requirements. From a network
standpoint, the emphasis is on improving network speed and on developing
more fault-tolerant protocols.
However, advances in hardware and network research fields are not enough
to allow users to easily receive and manage complex data flows and sophis-
ticated services.
Specifically, in our opinion, a significant research theme consists of identify-
ing mechanisms for ensuring information access that merges the knowledge
of personal exigencies of users with the knowledge of technical features of
devices they are exploiting in their activities.
This requires to couple user profiles and device profiles; these last are data
structures storing device capabilities and characteristics. This introduces
the concept of a twofold adaptivity policy, i.e., it allows to think to an
14 1 Introduction
intelligent system which is capable of adjusting its behaviour on the basis of
both user needs/preferences and the technical features of exploited devices.
• The importance of XML. XML is a well-established W3C language for data
representation and exchange. Specifically, W3C has defined a formalism
named XML Schema which provides a mean for defining the structure,
the content and the semantics of an XML document.
XML is emerging as a standard syntax for sharing information among
sources in various scenarios, like e-commerce, Web oriented data integra-
tion, schema evolution and migration, data warehousing, Web site creation,
and management and peer-to-peer systems [204]; due to these reasons,
XML fuels data sharing applications and makes it necessary to develop
algorithms for acquiring interschema properties.
The problem of detecting interschema properties among XML data sources
has gained its own relevance as a research topic since it introduces some
challenging issue which were absent in the area of relational databases and
in other formalisms for representing data. Specifically, we can recognize the
following key issues:
– XML supplies powerful tools for allowing a user to define complex
data types which are not usually present in other data representation
languages. For instance, XML provides two general ways, called com-
position and sub-classing [205], for defining new data types beginning
from already existing ones; the composition and sub-classing activities
can be recursively applied so that arbitrarily nested type hierarchies
are possible.
This feature is typical of XML and the exploitation of type informa-
tion plays a key role in the effective detection of interschema properties
between two Schemas. As a consequence, an efficient algorithm for de-
tecting semantic matchings between XML data sources should properly
deal with data types; on the contrary, other formalisms for data rep-
resentation often define a limited spectrum of generally simple data
types and, hence, they require simple mechanisms for managing data
type compatibility (e.g., a static table) [205].
– XML supplies other tools like XSD compositors (sequence, all, choice),
cardinality restrictions (minOccurs, maxOccurs), as well as other con-
structs, that can be exploited to effectively find interschema properties.
This variety of tools is not generally available in other data represen-
tation languages [205].
This thesis aims at solving the problems outlined above, i.e., it defines a set
of semantic approaches for: (i) defining a a policy for accessing information
and services which that are sensitive to both the user’s exigencies and the
features of the device he is exploiting; (ii) capturing the inner semantics of
XML data sources; this is exploited for ensuring a full interoperability among
them.
1.2 Adaptive Management of e-services 15
The ideas and the algorithms outlined in this thesis have been concretized
in the design and the implementation of two prototypes: the former is AESM
(Adapative E-AService Manager) and is devoted to manage the “user-device
adaptive” delivery of e-services; the latter is XIKE (XML Intensional Schema
Knowledge Extractor) and handles the interoperability among XML data
sources.
1.2 Adaptive Management of e-services
AESM (Adaptive E-service Manager), is an XML based multi-agent system for
handling user accesses to e-services; AESM is capable of adapting its behaviour
to both user and device profiles.
It comprises three typologies of agents, namely:
• Service Provider Agents. They are associated with each e-service provider
and are in charge of handling the proposals stored therein, as well as the
interaction with the user.
• User-Device Agents. They are associated with a specific user operating by
means of a specific device; they aim at supporting him in his activities
• Profile Agents. They stored the profiles of both involved users and devices.
The general behaviour of AESM is as follows. Whenever a user accesses
an e-service by means of a certain device, the corresponding Service Provider
Agent sends information about its proposals to the User-Device Agent asso-
ciated with him and the device he is exploiting. The User-Device Agent de-
termines similarities between the proposals presented by the provider and the
interests of the user. For each of these similarities, both the Service Provider
Agent and the User-Device Agent cooperate for presenting to the user suitable
information that is adapted to the device he is exploiting and illustrates the
proposal.
In AESM the device profile plays a central role. Indeed, the proposals of
a provider that are presented to a user, as well as their presentation formats,
depend on the technical features of the device he is presently exploiting. How-
ever, the AESM capability of adapting its behaviour to the device the user
is exploiting is not restricted to the presentation format of proposals; specif-
ically, the exploited device can influence also the computation of the interest
degree showed by a user for the proposals presented by each provider. In other
words, AESM can associate each service with a numerical coefficient (rank)
stating its relevance for the user and can filter out those services having the
lowest rank.
As previously pointed out, many agents are simultaneously active in
AESM; they strongly interact each other and continuously exchange infor-
mation. In this scenario, an efficient management of information exchange
appears crucial. One of the most promising solutions to this problem has
been the adoption of XML. XML capabilities make it particularly suited to
16 1 Introduction
be exploited in the agent research. In AESM, the role of XML is central;
indeed:
• The Support Data Structures of involved agents are stored as XML doc-
uments; as a consequence, they are light, versatile, easy to be exchanged
and can reside on different devices and software platforms.
• The agent communication language is ACML; this is the XML encoding
of the agent communication language defined by the Foundation for Intel-
ligent Physical Agents (FIPA).
• The extraction of information from the various data structures is car-
ried out by means of XQuery. This language has capabilities typical of
database query languages as well as features typical of document manage-
ment systems; moreover, it is provided with various high level constructs
for simplifying querying over the Web.
• The manipulation of Support Data Structures of involved agents is per-
formed by means of the Document Object Model (DOM). This makes
possible for programmers to write applications properly working on all
browsers and servers as well as on a large variety of hardware and soft-
ware platforms.
AESM has been successfully specialized in various application contexts
such as QoS management, Information Content (IC) management in telecom-
munications networks, e-commerce, e-learning, e-government and e-recruitment.
In the following sections we illustrate how AESM has been specialized in each
application context.
QoS management
A specialization of AESM has been conceived for handling the Quality of Ser-
vice (QoS) in telecommunications networks. To explain the AESM behaviour
in this case, assume that a given user is accessing network services and he is
receiving a certain amount of bandwidth; assume, now, that the bandwidth
assignment relative to user connection needs to be modified (e.g., due to a
network fault).
According to the new available bandwidth, the system modifies the “high-
level” parameters (e.g., the video horizontal resolution), that specify the QoS
guaranteed to the user, by taking into account user exigencies and needs. To
this purpose, our system determines the optimal distribution of video, audio
and data bandwidth for maximizing the user desires by solving a suitable
Linear Programming Problem.
Once the optimal video, audio and data bandwidth distribution has been
computed, the system must determine the combination of high-level QoS pa-
rameters that fit the bandwidth constraints and maximize the QoS perceived
by the user. To this purpose it defines and solve a suitable quadratic program-
ming problem.
1.2 Adaptive Management of e-services 17
Information Content management in telecommunications networks
A specialization of AESM has been conceived for selecting the most relevant
information for a user accessing network services.
Our system is activated whenever the bandwidth received by a given user
u must be modified. In this case, it requires information about the profile of
u, along with the current bandwidth available for him; after this, it exploits
this information for defining the set of Information Contents that maximize
user satisfaction.
Such an activity can be appropriately formalized as a 0/1 Knapsack Prob-
lem.
E-commerce
AESM has been specialized for managing “user-device” adaptive e-commerce
activities. To illustrate its behaviour, assume that a customer is willing to
visit an e-commerce site E. In this case, our system analyzes the contents
of E and selects the subset of the products stored therein and appearing
to be interesting for the customer. To this purpose, it exploits a suitable
ranking function for discriminating relevant concepts from irrelevant ones;
such a function takes into account the information stored in user and device
profiles.
E-learning
AESM can be specialized to manage “user-device” adaptive e-learning activ-
ities. Specifically, each time a user needs/likes to acquire a new topic he acti-
vates the system and specifies the topic of his interest, along with his profile
and the profile of the device he is currently exploiting. Our system constructs
a directed graph called Subject Dependency Graph. The nodes of this graph
represent the topics that are already known by the user, as well as those not
known by him but appearing to be necessary for acquiring the desired topic.
An arc, connecting a pair of nodes 〈A,B〉, indicates that the topic associated
with A is a pre-requisite of the topic associated with B (i.e., it is necessary to
know the topic associated with A before studying those associated with B).
After this, our approach finds a path connecting a node representing a topic
already known by the user with the node associated with the topic he desires
to acquire. This path must satisfy some requirements specifying the exigencies
of the user, his past behaviour, as well as the technical characteristics of the
device he is currently exploiting. Its definition (that implies the construction
of a learning program) is carried out by solving a suitable integer programming
problem.
18 1 Introduction
E-government
AESM can be specialized in the field of e-government; the corresponding pro-
totype allows citizens to easily pose their queries to the government agencies’
databases.
In order to optimize query answering activity, our system associates each
citizen with a suitable profile, registering his preferences and exigencies; the
information stored in citizen profiles are used to rank available government
agencies’ databases, i.e., to determine the aptitude of each database to cor-
rectly answer citizen queries. These ranks are, then, exploited for intelligently
directing citizen queries to the most appropriate databases in such a way to
produce complete and precise results.
In addition, the system can be exploited for allowing the ubiquitous access
and exploitation of data and services provided by government agencies.
E-recruitment
AESM can be specialized for supporting personalized recruitment services. It
acts as a recommender system; specifically, each time a user wants to retrieve
a list of job proposals, he submits a suitable query that is processed by means
of Information Retrieval algorithms.
After this, our system ranks the selected proposals on the basis of user
preferences and picks from this list a suitable set of them (seeds). This list is,
then, expanded by taking into account the semantic affinities existing among
job proposals, as well as the feedback provided by the users. The final list of job
proposals suggested to the user contains at least the seed proposals; moreover,
it could include also some job proposals relative to topics not particularly
relevant for the user in the past, but that could be of interest for him in the
future (since they are sufficiently similar to those he judged relevant in the
past).
1.3 Interoperability among XML Schemas
XIKE (XML Intensional Knowledge Extractor) is a system for ensuring the
interoperability among XML data sources.
It supplies two main functionalities: (i) the extraction of interschema prop-
erties holding among a group of XML Schemas; (ii) the usage of them for
constructing a global schema or for clustering input XML schemas.
In our approach we assume the existence of a reference dictionary (e.g.,
WordNet or a specific-domain dictionary provided by an expert) that states
the lexical synonymies existing among the elements of the input XML Schemas.
In order to uniformly handle the elements and the attributes of an XML
Schema we introduce the concept of x-component.
1.3 Interoperability among XML Schemas 19
Informally, given an XML Schema S, we say that an x-component of S is
either an element or an attribute of S. An x-component is characterized by
its name, its typology (indicating if it is either a complex element or a simple
element or an attribute) and its data type.
In the following subsections, we illustrate in more detail the main steps
composing our approach.
Derivation of interscheme properties
As previously pointed out, interschema properties [78, 110, 171, 191, 192, 204],
are terminological and structural relationships holding among attributes and
elements belonging to a group of XML Schemas. They can be roughly classified
into:
• Synonymies [77, 110, 123, 156, 171, 256]: a synonymy between two x-
components x1 and x2 indicates that they have the same type and the
same meaning, even if they might have different names.
• Homonymies [77, 110, 149, 156, 211, 106]: an homonymy between two x-
components x1 and x2 indicates that they have the same type and the
same name but different meanings.
• Hyponymies/Hypernymies [77, 17, 98, 123, 171]: given two x-components
x1 and x2, x1 is a hyponym of x2 (which is, in its turn, the hypernym of
x1) if x1 has a more specific meaning than x2.
• Overlappings [98, 17]: given two x-components x1 and x2, an overlapping
exists between them if there exist non-empty sets of attributes {A11, A12,
. . ., A1n} of x1 and {A21, A22, . . . , A2n} of x2 such that, for 1 ≤ i ≤ n, A1i
is a synonym of A2i.
• Sub-schema similarities [15, 98, 100, 171, 254]: we call sub-schema a self
contained XML sub-source such that a concept represented therein is con-
nected to other concepts of the sub-source by means of at least one relation-
ships. We call sub-schema similarity a similarity between two sub-schemas
belonging to different sources.
Our approach for extracting synonymies and homonymies basically con-
sists in “measuring” the semantic similarity of pairs of x-components.
In order to discover a similarity between two x-components x1 and x2,
we need to formally define the concept of neighborhood of an x-component.
Preliminarily, we introduce some boolean functions that allow the strength
of the relationship existing between two x-components x1 and x2 of an XML
Schema S to be determined. These functions are:
• veryclose(xS , xT ), that returns true if and only if: (i) xT = xS , or (ii) xT
is an attribute of xS , or (iii) xT is a simple sub-element of xS ;
• close(xS , xT ), that returns true if and only if: (i) xT is a complex sub-
element of xS , or (ii) xT is an element of S and there exists a keyref
element stating that an attribute of xS refers to a key attribute of xT ;
20 1 Introduction
• near(xS , xT ), that returns true if and only if either veryclose(xS , xT ) =
true or close(xS , xT ) = true; in all the other cases it returns false;
• reachable(xS , xT ), that returns true if and only if there exists a sequence
of distinct x-components x1, x2, . . . , xn such that xS = x1, near(x1, x2) =
near(x2, x3) = . . . = near(xn−1, xn) = true, xn = xT .
The exploitation of the functions introduced above above allow each pair
〈x1, x2〉 of an XML Schema to be associated with a coefficient (named Con-
nection Cost) specifying the correlation degree existing between x1 and x2; in
other words, the Connection Cost indicates how much the concept expressed
by x1 is semantically close to the concept represented by x2.
Finally, we can informally define the jth neighborhood of an x-component
x1 of an XML Schema S as the set of x-components of S having a Con-
nection Cost from xS lesser than or equal to j; this set will be denoted as
neighborhood(xS , j).
Given two x-components x1 (belonging to an XML Schema S1) and x2
(belonging to an XML Schema S2), and an integer u (called severity level),
our approach first considers neighborhood(x1, 0) and neighborhood(x2, 0) and
determines if they are similar. This decision is made by computing the ob-
jective function associated with the maximum weight matching of a suitable
bipartite graph constructed from the x-components of neighborhood(x1, 0)
and neighborhood(x2, 0), their lexical synonymies (as stored in the reference
thesaurus), their cardinalities and their data types.
If neighborhood(x1, 0) and neighborhood(x2, 0) are similar, it is possible
to conclude that x1 and x2 are synonymous [78, 191]. However, observe that
neighborhood(x1, 0) (resp., neighborhood(x2, 0)) takes into account only at-
tributes and simple elements of x1 (resp., x2); therefore, it considers quite a
limited context.
If we need a more “severe” level of synonymy detection it is necessary to
require not only the similarity of neighborhood(x1, 0) and neighborhood(x2, 0)
but also that of the other neighborhoods of x1 and x2. Specifically, we can
generalize the reasonings expressed above and we can say that x1 and x2 are
synonymous with severity level equal to u if neighborhood(x1, v) is similar to
neighborhood(x2, v), for each v less than or equal to u. As in the previous case,
in order to check if neighborhood(x1, v) is similar to neighborhood(x2, v), we
need to construct a specific bipartite graph obtained from the x-components of
neighborhood(x1, v) and neighborhood(x2, v) and, then, we need to compute
the objective function associated with the maximum weight matching relative
to this bipartite graph.
The computation of hyponymies and overlappings is performed in an anal-
ogous fashion.
As for the computation of sub-schema similarities, it is necessary to ob-
serve that the number of possible sub-schemas that could be derived from an
XML Schema is extremely high; in certain circumstances, it could be even ex-
ponential against the number of elements and attributes of the input schemas.
1.3 Interoperability among XML Schemas 21
In order to avoid huge numbers of pairs of sub-schemas to be handled, XIKE
exploits a heuristic technique for singling out only the most promising ones.
A pair of sub-schemas is considered “promising” if the sub-schemas at hand
include a large number of pairs of concepts whose similarity has been already
stated (i.e., a large number of pairs of concepts for which a synonymy, an hy-
ponymy or an overlapping has been already derived). In this way, it is probable
that the overall similarity of the promising pair of sub-schemas will be high.
After the most promising pairs of sub-schemas have been selected, they
must be examined for detecting those ones that are really similar. The similar-
ity degree associated with each pair of sub-schemas is determined by applying
suitable matching functions defined on suitable bipartite graphs, constructed
from the components of the sub-schemas into consideration.
Construction of a Global Schema
Interschema properties can be exploited by a suitable integration algorithm
for constructing a global schema.
Our algorithm receives two XML Schemas S1 and S2 and a severity level u
and returns the integrated XML Schema SG. It consists of two steps, namely:
(i) construction of a Merger Dictionary MD(u) and a Rename Dictionary
RD(u); (ii) exploitation of MD(u) and RD(u) for obtaining SG.
As for the first step, observe that it could happen that an x-component
of a Schema is synonymous with more x-components of the other Schema.
The integration algorithm we are proposing here needs each x-component of a
Schema to be synonymous with at most one x-component of the other Schema.
In order to satisfy this requirement, it is necessary to construct a Merger
Dictionary MD(u) and a Rename Dictionary RD(u) by suitably filtering
previously derived synonymies and homonymies.
The construction of MD(u) begins with the definition of a support bi-
partite graph SimG(u). The nodes of SimG(u) are associated with the com-
plex elements of S1 and S2; in addition and edge in SimG(u) connects an
x-components x1 of S1 with an x-components x2 of S2 such that x1 is syn-
onymous with x2. After this, a maximum weight matching is computed on
SimG(u); it selects a subset SimASubSet(u) ⊆ SimASet(u) that maxi-
mizes a suitable objective function and inserts the edges of SimASubSet(u)
in MD(u).
After MD(u) has been constructed, it is possible to derive RD(u). More
specifically, a pair of x-components 〈x1, x2〉 is inserted into RD(u) if x1 and
x2 are both elements or both attributes, have the same name and 〈x1, x2〉 6∈
MD(u).
After MD(u) and RD(u) have been derived, it is possible to exploit them
for constructing SG. A first, rough, version of SG can be obtained by con-
structing a list containing all constructs of S1 and S2; however, this version of
SG could present some redundancies and/or ambiguities. In order to remove