10 Corpus tools

10.1 Toolset: The tools subset of Pepper, Atomic and Annis forms a complete corpus workflow toolchain in itself, which is based on Salt: 


Note that although all tools in the set can be used independently, their interoperability lets the user benefit most from corpus.tools.org when using all tools together. 

10.2 A common generic data model: SALT

Salt is a generic, graph-based meta model for linguistic data, implemented as an open source Java API for storing, manipulating and representing data. Salt is text-based. 

A syntactially annotated sentence modeled in Salt as given in Figure 2:


Nodes and Edges are placeholders, All Nodes and Edges belonging to morphological annotation, syntax annotation, information structure annotation can be bundled in seperate layers. 

10.3 Creating/migrating corpus resources for annotation: Pepper

Corpora and annotations exist in a multitude of different formats. In order to prepare them for further annotation, it is necessary to convert them into a format the annotation software cann process. This can be done via Pepper, a platformindependent, modular framwort for converting and processing linguistic data. 

Pepper supplies three types of modules: importers, manipulators and exporters, an unrestricted number of which can be combined into one single conversion workflow. Pepper has implemented multithreading in order to greatly reduce conversion times. 

In order to build multi-layer corpora we need to combine different kinds of annotation. Salt and Pepper allows for a combination of each set of importers with the merging  step. It can also extract metadata, structural and annotation-related information from existing corpora. Due to its plugin-based architecture, newly-developed modules can easily be added to Pepper at any time. 

Pepper comes in two flavours: as an interactive standalone command line tool and as an API library, which can be integrated in other software. 

10.4 Annotation: Atomic

To facilitate the creation of corpora with Atomic, for example, the software does provide some basic pre-processing tools - a tokenizer and a partitioning tool -, but more importantly also extension points for further, custom preprocessing steps. Any corpus processing step can thus be implemented as an Eslipse plugin and added to Atomic dynamically cia the respective extension point. Thus, atomic is in principle an annotation platform rather than simply an annotation tool. 


10.5 Query and analysis : ANNIS

Annis provides a browser-search and visualization architecture for complex multi-layer corpora. 

Annis also makes use of Salt as a data model. Annis provides the native query language AQL for complex search queries as well as different visualizations for corpus data, such as kwic views, dependecy trees, coreference and so on. It can be extended with new plugins:


Conclusion:

This toolset facilitates a complete workflow for multi-lay corpora, from creation and annotation to analysis and release. 


Quelle: corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora. 

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 又喝多了 最近情緒不穩 和老妹說我這不安的情緒和沒有生氣的生活讓我厭煩 感覺生無可戀 可最放不下的就是她了 想開個...
    愛找麻煩的寶貝老妹阅读 214评论 0 0
  • 天之枫平静的好像鱼水湖面上的水纹,相比之下枫都城繁华的过于噪杂。 来往的小贩和平民穿着朴素的衣裳,背着沉重的行囊在...
    付西阅读 269评论 0 0
  • 第一个孩子因为是老大,到了上学年龄就拼命让她多学,特别是二课,舞蹈、绘画、围棋、音乐、英语、书法…生怕输在起跑线上...
    茶舍花开阅读 249评论 0 0