这篇笔记来自于北大邹磊教授的知识图谱讲座(第三讲)
RDF图数据管理(RDF Data Management)
视频地址:https://www.bilibili.com/video/BV1yg4y1v7aD
主要内容:
- RDF Database Systems
- Background
- Existing Solutions
- Graph-Based Approaches
- Distributed System
- Existing Solutions
- gStore: a distributed SPARQL query engine
RDF and Semantic Web
- RDF(Resource Description Framework) is a language for the conceptual modeling of information about web resources
- A building block of semantic web(语义网的基石)
- Facilitates exchange of informatioin
- Search engines can retrieve more relevant information
- Facilitates data integration (mashes)
- Machine understandable(机器可理解)
- Understand the information on the web and the interrelationships among them
1. RDF Database Systems
RDF Introduction
- Everything is an uniquely named resource
- Namespaces can be used to scope the names
- Properties of resources can be defined
- Relationships with other resources can be defined
- Resources can be contributed by different people / groups and can be located anywhere in the web
- Integrated web "database"
RDF Data Model (RDF 数据模型)
RDF Data Model
RDF本质上一个三元组的集合,可以用图来表示
RDF Graph
RDF Query Model (RDF 查询模型)
RDF Query Model
传统存储查询方式:
- 用关系型数据库来存储三元组数据
- 将SPARQL语句转换为Sql语句进行查询
- 这种方式的缺点是,当数据量大的时候,查询效率非常低。
- 已有的优化方法:
- Property Tables (属性表)
- Binary Tables
- Exhaustive Indexing(完全索引)
Graph-Based Approaches (基于图的方法)
gStore
- We work directly on the RDF graph and the SPARQL graph(利用图的方式来回答SPARQL的查询)
- Answering SPARQL query = subgraph matching
- Subgraph matching is computationally expensive
- Use a signature-based encoding of each entity and classvertex to speed up matching
- Filter-and-evaluate
- Use a false posititve algorithm to prune nodes and obtain a set of candidates; then do more detailed evaluation on those
- We develop an Index over the data signature graph for efficient pruning
核心思想:用子图匹配的方式来回答SPARSQL查询
Subgraph Isomorphism
- Edge-Join Strategy
- Verte-Join Strategy
Distributed System
RDF Data Volumes
- RDF Data Volumes are growing and fast
- Linked data cloud currently consists of 325 datasets with > 25B triples
- Size almost doubling every year
- Linking Open Data cloud diagram
Large RDF Datasets
- Now, RDF datasets become larger and larger
- Yago, Freebase, DBpedia
- Massive volumes of RDF data are growing beyond the capacity of RDF database systems operating on a single machine.
gStore: a distributed SPARQL query engine
System Framework: Overview