AgreementMaker:Efficient Matching for Large Real-World 翻译


这篇文章还是我看前几天那个基于框架进行本体匹配的一个Previous Work里面的一个Previous Work。可以说有点菜,但是还是比较有参考意义的, 所以我把源码下载了下来,然后准备把对应的文章读一读,然后我个人比较喜欢中英对照,直接看中文的时候略过一些不重要的地方,在关键部位看原文。所以就有了这么多的翻译版本了。。

引用如下:Cruz I F, Antonelli F P, Stroe C. AgreementMaker: efficient matching for large real-world schemas and ontologies[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1586-1589.





We present the AgreementMaker system for matching real world schemas and ontologies, which may consist of hundreds or even thousands of concepts. The end users of the system are sophisticated domain experts whose needs have driven the design and implementation of the system: they require a responsive, powerful, and extensible framework to perform, evaluate, and compare matching methods. The system comprises a wide range of matching methods addressing different levels of granularity of the components being matched (conceptual vs. structural), the amount of user intervention that they require (manual vs. automatic), their usage (stand-alone vs. composed), and the types of components to consider (schema only or schema and instances). Performance measurements (recall, precision, and runtime) are supported by the system, along with the weighted combination of the results provided by those methods. The AgreementMaker has been used and tested in practical applications and in the Ontology Alignment Evaluation Initiative (OAEI) competition. We report here on some of its most advanced features, including its extensible architecture that facilitates the integration and performance tuning of a variety of matching methods, its capability to evaluate, compare, and combine matching results, and its user interface with a control panel that drives all the matching methods and evaluation strategies.

我们提出了AgreementMaker系统,用于匹配真实世界模式和本体,可能包含数百甚至数千个概念。系统的最终用户是复杂的领域专家,他们的需求推动了系统的设计和实现:他们需要一个响应迅速,功能强大且可扩展的框架来执行,评估和比较匹配方法。该系统包含多种匹配方法,可以解决匹配的组件(概念与结构)的不同粒度级别,他们需要的用户干预量(手动与自动),它们的使用(独立与组合),以及要考虑的组件类型(仅架构或架构和实例)。系统支持性能测量(召回率,准确率和运行时性能),以及这些方法提供的结果的加权组合。 AgreementMaker已在实际应用和Ontology Alignment Evaluation Initiative(OAEI)竞赛中使用和测试。我们在此报告其一些最先进的功能,包括其可扩展的体系结构,有助于各种匹配方法的集成和性能调整,评估,比较和组合匹配结果的能力,以及控制所有匹配方法和评估策略的用户界面和控制面板。

1. Introduction

1. 介绍

The issue of schema matching in databases [11], which has been investigated since the early 80’s, is fundamental to data integration, as is the closely-related issue of ontology alignment or matching [12]. The matching problem consists of defining mappings among schema or ontology elements that are semantically related. Such mappings are typically defined between two schemas or two ontologies at a time one being called the source and the other being called the target.

自80年代早期以来一直在研究的数据库[11]中的模式匹配问题是数据集成的基础,与本体对齐或匹配密切相关的问题也是如此[12]。匹配问题包括定义在语义上相关的 模式或本体元素之间 的映射。这种映射通常在两个模式或两个本体之间定义,一个被称为源本体,另一个被称为目标本体。

We have been developing the AgreementMaker matching system, whose name takes after agreement, the encoding of a mapping. The capabilities of our system have been driven by the real-world problems of end users who are sophisticated domain experts. We have considered a variety of domains and applications, including: geospatial [2], environmental [4], and biomedical [13]. The conceptual information for these applications is stored in the form of ontologies. However, as demonstrated by others, the same approach can be used for schema matching [1, 10]. To validate our approach, we competed against seven other systems in the biomedical track of the 2007 Ontology Alignment Evaluation Initiative (OAEI), to match ontologies describing the mouse adult anatomy of the Mouse Gene Expression Database Project (2744 classes) and the human anatomy of the National Cancer Institute (3304 classes). We came in third in terms of accuracy (F-measure) [5].


The AgreementMaker, which is currently in its third version, has been evolving to accommodate: (1) user requirements, as expressed by domain experts; (2) a wide range of input (ontology) and output (agreement file) formats; (3) a large choice of matching methods depending on the different granularity of the set of components being matched (local vs. global), on different features considered in the comparison (conceptual vs. structural), on the amount of intervention that they require from users (manual vs. automatic), on usage (stand-alone vs. composed), and on the types of components to consider (schema only or schema and instances); (4) improved performance, that is, accuracy (precision, recall, F-measure) and efficiency (execution time) for the automatic methods; (5) an extensible architecture to incorporate new methods easily and to tune their performance; (6) the capability to evaluate, compare, and combine different strategies and matching results; (7) a comprehensive user interface supporting both advanced visualization techniques and a control panel that drives all the matching methods and evaluation strategies.

目前处于第三版的AgreementMaker正在不断发展以适应:(1)领域专家表达的用户需求; (2)广泛的输入(本体)和输出(协议文件)格式; (3)根据不同粒度的组件集的匹配选项(本地与全局),在比较中考虑的不同特征(概念与结构),他们需要的来自用户的干预量(手动与自动),使用(独立与组合),以及要考虑的组件类型(仅架构或架构和实例); (4)改进性能,即自动方法的准确度(精确度,召回率,F测量值)和效率(执行时间); (5)可扩展的架构,可以轻松地整合新方法并调整其性能; (6)评估,比较和组合不同策略和匹配结果的能力; (7)全面的用户界面,支持高级可视化技术和控制面板,驱动所有匹配方法和评估策略。

In this demo paper, we focus on the most recent developments of the system, which has been almost completely redesigned in the last year. In particular, we describe: (1) the user interface with particular emphasis on the control panel and improved visualization and interaction capabilities; (2) the automatic matching methods and execution capabilities; and (3) the evaluation strategies for determining the efficiency of the matching methods and for performing the combination of results.

在本演示文章中,我们将重点介绍该系统的最新发展,该系统在去年几乎完全重新设计。特别是,我们描述:(1)用户界面,特别强调控制面板和改进的可视化和交互功能; (2)自动匹配方法和执行能力; (3)用于确定匹配方法的效率和执行结果组合的评估策略。



There are several notable systems related to ours, including Clio [6], COMA++ [1], Falcon-AO [7], and Ri MOM [14] (just to mention a few). Clio stands apart because of its single focus on database-specific constraints and operators (e.g., foreign keys, joins) to infer the mappings whereas constraints in ontologies (as implemented in the other three systems and in AgreementMaker) are of a different nature [12]. This different emphasis also permeates the remaining components of the various systems, as those that also support ontology matching implement a rich tool box of stringsimilarity and structural-based techniques and focus on performance. Consequently, some of these systems do not focus on user interaction: for example, Falcon-AO and Ri MOM provide simple interfaces that offer limited user interaction (e.g., no manual manipulation of the ontologies). However, what separates AgreementMaker from these other systems (including from COMA++, which has a more sophisticated user interface than the other two) is the degree to which it integrates the evaluation of the quality of the obtained mappings with the graphical user interface and therefore with the iterative matching process. This tight integration emerged from our work with domain experts, who required that the evaluation be an integral part of the matching process, not an “add on” capability.

有几个与我们相关的着名系统,包括Clio [6],COMA ++ [1],Falcon-AO [7]和Ri MOM [14](仅举几例)。 Clio之所以与众不同,是因为它专注于特定于数据库的约束和运算符(例如,外键,连接)来推断映射,而本体中的约束(在其他三个系统和AgreementMaker中实现)具有不同的性质[12 ]。这种不同的重点也渗透到各种系统的其余组件中,因为那些支持本体匹配的组件实现了丰富的相似性和基于结构的技术工具箱,并专注于性能。因此,这些系统中的一些不关注用户交互:例如,Falcon-AO和Ri MOM提供了限制用户交互的简单接口(例如,没有对本体的手动操纵)。然而,将AgreementMaker与其他系统(包括COMA ++,其具有比其他两个更复杂的用户界面)区别开来的是它将获得的映射的质量评估与图形用户界面集成的程度,因此迭代匹配过程(大意是可以直接看到评估结果的改进?)。这种紧密集成源于我们与领域专家的合作,他们要求评估是匹配过程中不可或缺的一部分,而不是“附加”功能。



The AgreementMaker supports a wide variety of methods or matchers. Our architecture (see Figure 1) allows for serial and parallel composition where, respectively, the output of one or more methods can be used as input to another one, or several methods can be used on the same input and then combined. A set of mappings may therefore be the result of a sequence of steps, called layers.


The matching process of a generic matcher (see Figure 2), can be divided into two main modules: (1) similarity computation in which each concept of the source ontology is compared with all the concepts of the target ontology, thus producing two similarity matrices (one for classes and the other one for properties), which contain a value for each pair of concepts; (2) mappings selection in which the matrix is scanned to select only the best mappings according to a given threshold and to the cardinality of the correspondences, for example, 1-1, 1-N, N-1, M-N

通用匹配器的匹配过程(见图2)可以分为两个主要模块:(1)相似度计算,其中源本体的每个概念与目标本体的所有概念进行比较,从而产生两个相似性矩阵(一个用于类,另一个用于属性),其中包含每对概念的值; (2)映射选择,扫描矩阵以根据给定阈值和对应关系的基数仅选择最佳映射,例如1-1,1-N,N-1,M-N

To enable extensibility, we adopted the object-oriented template pattern by defining the skeleton of the matching process in a generic matcher, which defers only a few operations to the concrete matcher extensions (see Figure 3). This abstraction minimizes development effort by completely decoupling the structure of a single method from the architecture of the whole system, thus allowing reuse or any possible composition of matching modules.


A first layer matcher produces the similarity matrices, while the second and third layer matchers extend the first layer matchers. In particular, a second layer matcher improves on the results of a first layer matcher using conceptual or structural information, depending on whether it considers one concept alone or a concept and its neighbors. Finally, a third layer matcher combines the results of two or more matchers from the previous layers, in order to obtain a final matching or alignment, that is, a set of mappings.




The source and target ontologies (in XML, RDFS, OWL, or N3) are visualized side by side using the familiar outline tree paradigm (see Figure 4). Agreements can be exported in different formats (e.g., XML, Excel). Because all the matching operations and their results are managed by this interface, we gave special consideration to its design [4]. We describe next two new features of the interface: the control panel and the visualization of non-hierarchical ontologies (e.g., due to multiple inheritance in OWL). The latter feature allows for specific subtrees to be visually duplicated. Because we adopt the Model-View-Control pattern, this duplication does not affect the underlying data structures. The control panel (see Figure 5) allows users to run and manage matching methods and their results. Users can select parameters common to all methods (such as threshold and cardinality) and method-specific parameters. When a method has run, a new row is dynamically added to the table that is part of the control panel at the same time that lines depicting the mappings between the concepts are added (see Figure 4). Each row is color coded and allows for its selection so that the corresponding mappings (of the same color) can be compared visually. Each row also displays the performance values for the associated methods, thus allowing for the comparison with those of other rows. In addition, users can modify at runtime the method parameters by changing directly their values in the table or by selecting previously calculated matchings as input to the methods to be applied next. Multiple matchings can also be combined manually or with an automatic combination matcher.




First layer matchers compare concept features (e.g., label, comments, annotations, and instances) and use a variety of methods including syntactic and lexical comparison algorithms as well as the use of a lexicon like Word Net. Of those methods some were proposed by others (e.g., edit distance, Jaro-Winkler) and some devised by us, including a substring-based comparison that favors the length of the common substrings and a concept document-based comparison containing a wide range of features. Those features are represented as TF-IDF vectors and use a cosine similarity metric (see Figure 6).

第一层匹配器比较概念特征(例如,标签,注释,注释和实例)并使用各种方法,包括句法和词汇比较算法以及Word Net等词典的使用。其中一些方法是由其他人提出的(例如,编辑距离,Jaro-Winkler)和我们设计的一些方法,包括基于子串的比较,这有利于公共子串的长度和基于文件的概念等方面进行广泛特征上的比较。这些特征表示为TF-IDF向量并使用余弦相似性度量(参见图6)。

Second layer matchers use structural properties of the ontologies. Our own methods include the Descendant’s Similarity Inheritance (DSI) and the Sibling’s Similarity Contribution (SSC) matchers [3].


Finally, third layer matchers combine the results of two or more matchers so as to obtain a unique final matching in two steps. In the first step, a similarity matrix is built for each pair of concepts, using our Linear Weighted Combination (LWC) matcher, which processes the weighted average for the different similarity results (see Figure 7). Weights can be assigned manually or automatically, the latter assignment being determined using our evaluation methods. The second step uses that similarity matrix and takes into account a threshold value and the desired cardinality. When the cardinality is 1-1, we adopt the Shortest Augmenting Path algorithm [9] to find the optimal solution for this optimization problem (namely the assignment problem reduced to the maximum weight matching in a bipartite graph) in polynomial time.




The design of optimal methods to find correct and complete mappings between real-world ontologies is a hard task for several reasons. First of all, an algorithm may be effective for a given scenario, but not for others. Even within the same scenario, the use of different parameters can change significantly the outcome. Moreover, in interviewing domain experts in the geospatial domain, we discovered that they do not trust automatic methods unless quality metrics are associated with the matching results. These observations have motivated a variety of evaluation techniques, that determine runtime and accuracy (precision, recall, and F-measure).


The most effective evaluation technique compares the mappings found by the system between the two ontologies with a reference matching or “gold standard,” which is a set of correct and complete mappings as built by domain experts. When a reference matching is available, the AgreementMaker can determine the quality of the found matching analytically or visually. A reference matching can also be used to tune algorithms by using a feedback mechanism provided by a succession of runs.


When a gold standard is not available, “inherent” quality measures need to be considered. Quality measures can be defined at two levels as associated with the two main modules of a matcher (see Figure 2): similarity or selection level. We can consider local quality as associated with a correspondence at the similarity level (or mapping at the selection level) or global quality as associated with all the correspondences at the similarity level (or with all possible mappings at the selection level). We have incorporated in our system a global-selection quality measure proposed by others [8] and a local-similarity quality measure that we have devised. Experiments have shown that our quality measure is usually effective in defining weights for the LWC matcher.




Our demo focuses on the matching methods and evaluation strategies for determining the efficiency of ontology matching methods. Due to the tight integration of the evaluation strategies with the graphical user interface, a unique feature of our system, all the steps will be performed through the interface. Users will start by uploading their own ontologies, load our own, or download ontologies from the web, thus taking advantage of the several standard formats supported. Users can then explore the interface freely or follow a walk-through, consisting of browsing the ontologies, expanding and contracting nodes, and customizing the display. They have access to the information associated with each concept to be aligned, including descriptions, annotations, and (context) relations, and they can use them to visually detect mappings.




  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,734评论 6 505
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,931评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,133评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,532评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,585评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,462评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,262评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,153评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,587评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,792评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,919评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,635评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,237评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,855评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,983评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,048评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,864评论 2 354
