北大邹磊:RDF Data Management(第三讲)

这篇笔记来自于北大邹磊教授的知识图谱讲座(第三讲)

RDF图数据管理(RDF Data Management)

视频地址:https://www.bilibili.com/video/BV1yg4y1v7aD

主要内容:

  • RDF Database Systems
    • Background
    • Existing Solutions
    • Graph-Based Approaches
  • Distributed System
    • Existing Solutions
    • gStore: a distributed SPARQL query engine

RDF and Semantic Web

  • RDF(Resource Description Framework) is a language for the conceptual modeling of information about web resources
  • A building block of semantic web(语义网的基石)
    • Facilitates exchange of informatioin
    • Search engines can retrieve more relevant information
    • Facilitates data integration (mashes)
  • Machine understandable(机器可理解)
    • Understand the information on the web and the interrelationships among them

1. RDF Database Systems

RDF Introduction

  • Everything is an uniquely named resource
  • Namespaces can be used to scope the names
  • Properties of resources can be defined
  • Relationships with other resources can be defined
  • Resources can be contributed by different people / groups and can be located anywhere in the web
    • Integrated web "database"

RDF Data Model (RDF 数据模型)

RDF Data Model

RDF本质上一个三元组的集合,可以用图来表示
RDF Graph

RDF Query Model (RDF 查询模型)

RDF Query Model

传统存储查询方式:

  • 用关系型数据库来存储三元组数据
  • 将SPARQL语句转换为Sql语句进行查询
  • 这种方式的缺点是,当数据量大的时候,查询效率非常低。
  • 已有的优化方法:
    • Property Tables (属性表)
    • Binary Tables
    • Exhaustive Indexing(完全索引)

Graph-Based Approaches (基于图的方法)

gStore

  • We work directly on the RDF graph and the SPARQL graph(利用图的方式来回答SPARQL的查询)
    • Answering SPARQL query = subgraph matching
    • Subgraph matching is computationally expensive
  • Use a signature-based encoding of each entity and classvertex to speed up matching
  • Filter-and-evaluate
    • Use a false posititve algorithm to prune nodes and obtain a set of candidates; then do more detailed evaluation on those
  • We develop an Index over the data signature graph for efficient pruning
    核心思想:用子图匹配的方式来回答SPARSQL查询

Subgraph Isomorphism

  • Edge-Join Strategy
  • Verte-Join Strategy

Distributed System

RDF Data Volumes

  • RDF Data Volumes are growing and fast
    • Linked data cloud currently consists of 325 datasets with > 25B triples
    • Size almost doubling every year
  • Linking Open Data cloud diagram

Large RDF Datasets

  • Now, RDF datasets become larger and larger
  • Yago, Freebase, DBpedia
  • Massive volumes of RDF data are growing beyond the capacity of RDF database systems operating on a single machine.

gStore: a distributed SPARQL query engine

System Framework: Overview
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容