此文为转载,原文链接:https://blog.csdn.net/hagen666/article/details/72801732
Freebase 元数据
上篇文章Freebase Datadump结构初探(一)我们提到,Freebase的Dump文件为三元组格式,某个MID有哪些谓词由其类型决定。在整个Dump文件中,还存在一些特殊的边,定义了Freebase属性(Properties)、类型(Types)、域(Domains)以及Namespace的相关信息,这些信息同样以三元组形式组织,对Freebase中其他数据起着“控制”作用。这些信息被称为“元数据”,即关于数据的数据。
Freebase 属性元信息
首先我们来看一下Freebase的属性(谓词/Property)元信息。
对于Freebase中的某个谓词(如<people.person.nationality>)在Freebase中有如下定义:
<people.person.nationality> <type.object.type> <type.property> .
<m.04nt> <type.object.type> <freebase.property_hints> .
<m.04nt> <type.object.type> <type.property> .
<m.04nt> <type.object.id> "/people/person/nationality" .
因而它具有<type.property>的属性和<freebase.property_hints>的属性:
<m.04nt> <type.property.unique> "false" .
<m.04nt> <type.property.schema> <m.04kr> .
<m.04nt> <http://www.w3.org/2000/01/rdf-schema#domain> <m.04kr> .
<m.04nt> <type.property.expected_type> <m.01mp> .
<m.04nt> <http://www.w3.org/2000/01/rdf-schema#range> <m.01mp> .
<m.04nt> <freebase.property_hints.disambiguator> "true" .
<m.04nt> <freebase.property_hints.display_none> "false" .
<m.04nt> <freebase.property_hints.deprecated> "false" .
<m.04nt> <freebase.property_hints.display_orientation> "horizontal"@en .
<m.04nt> <freebase.property_hints.inverse_description> "{name}: Nationality"@en .
<people.person.nationality> <type.property.unique> "false" .
<people.person.nationality> <type.property.expected_type> <location.country> .
<people.person.nationality> <http://www.w3.org/2000/01/rdf-schema#range> <location.country> .
<people.person.nationality> <type.property.schema> <people.person> .
<people.person.nationality> <http://www.w3.org/2000/01/rdf-schema#domain> <people.person> .
这些属性约束着<people.person.nationality>的行为。如其宾语类型必须为<location.country>,主语类型为<people.person>,是否允许多值等。<m.04nt>可以近似认为是<people.person.nationality>的MID,也包含着一系列和属性相关的信息。某个类型和属性的关系,由下面确定。
<m.02h> <type.type.properties> <m.02k6vs2> .
<m.02h> <type.type.properties> <m.03hd3j6> .
<m.02h> <type.type.properties> <m.03p3sqb> .
<m.02h> <type.type.properties> <m.03p3srk> .
<m.02h> <type.type.properties> <m.04dyr7w> .
<m.02h> <type.type.properties> <m.075> .
<m.02h> <type.type.properties> <m.07v> .
<m.02h> <type.type.properties> <m.08h> .
<m.02h> <type.type.properties> <m.094> .
<m.02h> <type.type.properties> <m.0gf> .
<m.02h> <type.type.properties> <m.0lcdm_h> .
<m.02h> <type.object.id> "/type/property" .
<m.075> <type.object.id> "/type/property/expected_type" .
<m.07v> <type.object.id> "/type/property/master_property" .
<m.08h> <type.object.id> "/type/property/schema" .
<m.094> <type.object.id> "/type/property/unique" .
<m.0gf> <type.object.id> "/type/property/reverse_property" .
<m.0lcdm_h> <type.object.id> "/type/property/authorities" .
<m.02k6vs2> <type.object.id> "/type/property/unit" .
<m.03hd3j6> <type.object.id> "/type/property/enumeration" .
<m.03p3sqb> <type.object.id> "/type/property/delegated" .
<m.03p3srk> <type.object.id> "/type/property/links" .
<m.04dyr7w> <type.object.id> "/type/property/requires_permission" .
我们可以看到某个谓词,有许多约束,如该谓词属于哪个类型(schema)、该谓词期望的宾语类型(expected_type)、该谓词的反向谓词(master_property or reverse_property)、对同一个主语,该谓词是否允许有多个宾语(unique)。
而对于<freebase.property_hints>而言:
<m.04sf> <type.type.properties> <m.04tf> .
<m.04sf> <type.type.properties> <m.04tp> .
<m.04sf> <type.type.properties> <m.04ty> .
<m.04sf> <type.type.properties> <m.050z1jy> .
<m.04sf> <type.type.properties> <m.0jvdq3y> .
<m.04sf> <type.type.properties> <m.01y195q> .
<m.04sf> <type.type.properties> <m.0hgdg4h> .
<m.04sf> <type.type.properties> <m.0116_z7d> .
<m.04sf> <type.type.properties> <m.0kvwnmb> .
<m.04sf> <type.type.properties> <m.04ch6qm> .
<m.04sf> <type.type.properties> <m.01170cgw> .
<m.04sf> <type.object.id> "/freebase/property_hints" .
<m.04tf> <type.object.id> "/freebase/property_hints/display_orientation" .
<m.04tp> <type.object.id> "/freebase/property_hints/disambiguator" .
<m.04ty> <type.object.id> "/freebase/property_hints/display_none" .
<m.050z1jy> <type.object.id> "/freebase/property_hints/inverse_description" .
<m.0jvdq3y> <type.object.id> "/freebase/property_hints/special_edit" .
<m.01y195q> <type.object.id> "/freebase/property_hints/enumeration" .
<m.0hgdg4h> <type.object.id> "/freebase/property_hints/deprecated" .
<m.0116_z7d> <type.object.id> "/freebase/property_hints/cardinality_two" .
<m.0kvwnmb> <type.object.id> "/freebase/property_hints/valid_bare_property" .
<m.04ch6qm> <type.object.id> "/freebase/property_hints/dont_display_in_weblinks" .
<m.01170cgw> <type.object.id> "/freebase/property_hints/required" .
这里面的某些属性目前没有搞懂。
在控制信息方面,Dump中使用属性的MID(如m.02h)而非ID(如/type/property),“属性”这一实体的ID表示为<type.property>,MID表示为<m.02h>。比如对于:
<m.04sf> <type.type.properties> <m.04tf> .
可以理解为
<freebase.property_hints> <type.type.properties> <freebase.property_hints.display_orientation> .
即某MID如果具有<freebase.property_hints>,那么它可以拥有谓词<freebase.property_hints.display_orientation>。
这些信息共同构成了Freebase关于属性的规则。
Freebase 类型元信息
同样,对于一个类型(type)而言,也有很多控制信息。比如<people.person>,其mid为<m.04kr>,其部分信息如下:
<m.04kr> <type.object.key> "/people/person" .
<m.04kr> <type.object.id> "/people/person" .
<m.04kr> <freebase.object_hints.best_hrid> "/people/person" .
<m.04kr> <freebase.type_hints.mediator> "false" .
<m.04kr> <freebase.type_profile.instance_count> "2225136" .
<m.04kr> <freebase.type_profile.property_count> "4127013" .
<m.04kr> <type.object.type> <type.type> .
<m.04kr> <type.object.type> <freebase.type_profile> .
<m.04kr> <type.type.domain> <m.01z0kpp> .
<m.04kr> <type.type.properties> <m.025d7wc> .
<m.04kr> <type.type.properties> <m.025d7w3> .
<m.04kr> <type.type.properties> <m.04m8> .
<m.04kr> <type.type.properties> <m.04nt> .
<m.04kr> <type.type.expected_by> <m.028xmhx> .
<m.04kr> <type.type.expected_by> <m.01z0kvv> .
<m.04kr> <freebase.type_hints.included_types> <m.01c5> .
<m.04kr> <freebase.type_profile.strict_included_types> <m.01c5> .
<m.04kr> <freebase.type_profile.published> <m.02hqglv> .
<m.04kr> <freebase.type_profile.equivalent_topic> <m.01g317> .
<m.04kr> <freebase.type_profile.strict_included_types> <m.0rhbwmv> .
<m.04kr> <freebase.type_profile.strict_included_types> <m.0rhbs7t> .
<m.04kr> <freebase.type_profile.strict_included_types> <m.0rhbqq0> .
<m.04kr> <base.ontologies.ontology_class.equivalent_classes> <m.04gf613> .
<m.04kr> <freebase.type_profile.kind> <m.06whrm1> .
在Dump文件中,也存在部分以<people.person>作主语和宾语的情况,此时多用来检索<people.person>的实例。
对其类型信息而言:
<people.person> <type.object.type> <type.type> .
<m.04kr> <type.object.type> <type.type> .
<m.04kr> <type.object.type> <freebase.type_profile> .
我们分别查看“/type/type”和“/freebase/type_profile”具有哪些属性。
/type/type:
<m.0j> <type.object.id> "/type/type" .
<m.0j> <type.type.properties> <m.098> .
<m.0j> <type.type.properties> <m.071> .
<m.0j> <type.type.properties> <m.0h9> .
<m.0j> <type.type.properties> <m.06v> .
<m.0j> <type.type.properties> <m.0hs> .
<m.0j> <type.type.properties> <m.0h0> .
<m.098> <type.object.id> "/type/type/extends" .
<m.071> <type.object.id> "/type/type/domain" .
<m.0h9> <type.object.id> "/type/type/instance" .
<m.06v> <type.object.id> "/type/type/default_property" .
<m.0hs> <type.object.id> "/type/type/expected_by" .
<m.0h0> <type.object.id> "/type/type/properties" .
/freebase/type_profile:
<m.01xxvxg> <type.object.id> "/freebase/type_profile" .
<m.01xxvxg> <type.type.properties> <m.075d9tr> .
<m.01xxvxg> <type.type.properties> <m.01z0rqz> .
<m.01xxvxg> <type.type.properties> <m.01xxvxx> .
<m.01xxvxg> <type.type.properties> <m.011nd5wd> .
<m.01xxvxg> <type.type.properties> <m.01z0c18> .
<m.01xxvxg> <type.type.properties> <m.02h8bpm> .
<m.01xxvxg> <type.type.properties> <m.0r61f7m> .
<m.01xxvxg> <type.type.properties> <m.06wrfpx> .
<m.01xxvxg> <type.type.properties> <m.042z6f1> .
<m.01xxvxg> <type.type.properties> <m.02ht4r9> .
<m.01xxvxg> <type.type.properties> <m.01xxvy4> .
<m.01xxvxg> <type.type.properties> <m.04kb3fc> .
<m.075d9tr> <type.object.id> "/freebase/type_profile/equivalent_topic" .
<m.01z0rqz> <type.object.id> "/freebase/type_profile/instance_count" .
<m.01xxvxx> <type.object.id> "/freebase/type_profile/featured_topics" .
<m.011nd5wd> <type.object.id> "/freebase/type_profile/ownership" .
<m.01z0c18> <type.object.id> "/freebase/type_profile/work_needed" .
<m.02h8bpm> <type.object.id> "/freebase/type_profile/published" .
<m.0r61f7m> <type.object.id> "/freebase/type_profile/strict_included_types" .
<m.06wrfpx> <type.object.id> "/freebase/type_profile/kind" .
<m.042z6f1> <type.object.id> "/freebase/type_profile/property_count" .
<m.02ht4r9> <type.object.id> "/freebase/type_profile/tasks" .
<m.01xxvy4> <type.object.id> "/freebase/type_profile/suggested_properties" .
<m.04kb3fc> <type.object.id> "/freebase/type_profile/delegated_type" .
对于类型元信息而言,主要有该类型具有哪些属性(<type.type.properties>)、该类型可以做哪些谓词的宾语(<type.type.expected_by>)、该类型属于哪个领域(<type.type.domain>)、该类型有哪些实例(<type.type.instance>,与<type.object.type>互为反向边)、该类型被哪些类型包含(<freebase.type_profile.strict_included_types>)等。
如<m.04kr>(<people.person>)属于领域<m.01z0kpp>(<people>),有属性<m.04nt>(<people.person.nationality>),被<m.01c5>(<common.topic>)包含,该类型的实体可以作<m.01z0kvv>(<people.marriage.spouse>)的宾语。
Freebase 领域(Domain)元信息
在类型之上,还有Domain,如<m.01z0kpp>(<people>),就是一个domain。该domain的部分信息如下:
<m.01z0kpp> <type.object.id> "/people"
<m.01z0kpp> <type.object.key> "/people"
<m.01z0kpp> <type.object.type> <type.domain>
<m.01z0kpp> <type.object.type> <freebase.domain_profile>
<m.01z0kpp> <freebase.domain_profile.candidates> "true"
<m.01z0kpp> <freebase.domain_profile.category> <m.0hmw4b_>
<m.01z0kpp> <freebase.domain_profile.category> <m.021yptf>
<m.01z0kpp> <freebase.domain_profile.category> <m.01xxw9n>
<m.01z0kpp> <freebase.domain_profile.expert_group> <m.0432scl>
<m.01z0kpp> <freebase.domain_profile.featured_topic> <m.0k_s>
<m.01z0kpp> <freebase.domain_profile.featured_views> <m.04nsmt3>
<m.01z0kpp> <freebase.domain_profile.featured_views> <m.04nsmv8>
<m.01z0kpp> <freebase.domain_profile.users> <m.0ngz0f1>
<m.01z0kpp> <freebase.domain_profile.users> <m.0j12d02>
<m.01z0kpp> <type.domain.owners> <m.01z0kpj>
<m.01z0kpp> <type.domain.types> <m.03bqmw0>
<m.01z0kpp> <type.domain.types> <m.03bqndy>
<m.01z0kpp> <type.domain.types> <m.01z0kxy>
<m.01z0kpp> <type.domain.types> <m.04kr>
<m.01z0kpp> <type.domain.types> <m.02t8590>
<m.01z0kpp> <type.domain.types> <m.0_hlq_b>
<m.01z0kpp> <type.domain.types> <m.063dfd6>
<m.01z0kpp> <type.domain.types> <m.0kps54>
<m.01z0kpp> <type.domain.types> <m.063dfg8>
这里仅摘取部分信息,Domain与Type主要通过谓词<type.domain.types>联系,在实际应用中用处不大,所以不作分析。
补充:Namespace、ID、Key、Name、Alias
MID
刚刚接触Freebase,大家一定会被MID代表实体这种方式给绕晕,这个MID什么意思,那个MID又是什么意思?不考虑实体合并和分裂时,Freebase中一个实体和一个MID是一一对应的,当考虑实体合并和分裂时,多个MID可能指代一个实体,但是只有一个MID是master,其他的MID通过一个特殊的谓词(<dataworld.gardening_hint.replaced_by>)指向这个MID[1]。
Key
Freebase为用户提供了一个人类可识别ID(human-readable ID),在Dump文件中通过<type.object.key>将MID和Key关联起来。一个MID可能有0到多个Key,如:
<m.01jzhl> <type.object.key> "/en/yao_ming"
<m.01jzhl> <type.object.key> "/wikipedia/zh-cn_title/姚明"
都指代篮球运动员姚明。
当大小写区分时,可以通过key来唯一地确定一个实体。即不存在两个实体,具有相同的key值。
Namespace
每个Key值都属于一个Namespace,如"/en/yao_ming"的Namespace为"/en","/people/person/nationality"的namespace为"/",某些Freebase用户存在以"/user/${username}"开头的namespace。
在实际的Dump中,对于非英文字符,Freebase对其进行了Unicode编码,以方便存储。
ID
对于某些实体(特别是指代Domain、Type、Predicate的实体),Freebase会从Key中选一个值,作为该实体的ID,此时连接MID和ID的谓词为<type.object.id>。对于<people.person>、<people.person.nationality>这样的实体表示法,即是Freebase对实体的ID表示。
Name
除此之外,每个实体可以有名字,名字通过谓词<type.object.name>表示,还有其他表示名字的谓词,在此不一一列举。注意,这个谓词的unique属性为true,但是,对于宾语中的每种语言标记,其都可以有一个名字。如姚明,在Freebase中的名字信息有:
<m.01jzhl> <type.object.name> "Yao Ming"@en
<m.01jzhl> <type.object.name> "Yao Ming"@lt
<m.01jzhl> <type.object.name> "Jao Mins"@lv
<m.01jzhl> <type.object.name> "יאו מינג"@iw
<m.01jzhl> <type.object.name> "Yao Ming"@id
<m.01jzhl> <type.object.name> "야오밍"@ko
<m.01jzhl> <type.object.name> "Yao Ming"@tr
<m.01jzhl> <type.object.name> "Јао Минг"@sr
<m.01jzhl> <type.object.name> "Γιάο Μινγκ"@el
<m.01jzhl> <type.object.name> "Diêu Minh"@vi
<m.01jzhl> <type.object.name> "Yao Ming"@it
<m.01jzhl> <type.object.name> "Yao Ming"@ca
<m.01jzhl> <type.object.name> "Yao Ming"@hr
<m.01jzhl> <type.object.name> "Yao Ming"@no
<m.01jzhl> <type.object.name> "Yao Ming"@fi
<m.01jzhl> <type.object.name> "Jao Ming"@cs
<m.01jzhl> <type.object.name> "Yao Ming"@nl
<m.01jzhl> <type.object.name> "Yao Ming"@pt
<m.01jzhl> <type.object.name> "Yao Ming"@sv
<m.01jzhl> <type.object.name> "Yao Ming"@ms
<m.01jzhl> <type.object.name> "姚明"@zh-Hant
<m.01jzhl> <type.object.name> "Yao Ming"@et
<m.01jzhl> <type.object.name> "เหยา หมิง"@th
<m.01jzhl> <type.object.name> "Яо Мин"@ru
<m.01jzhl> <type.object.name> "یائو مینگ"@fa
<m.01jzhl> <type.object.name> "ياو مينغ"@ar
<m.01jzhl> <type.object.name> "Yao Ming"@pl
<m.01jzhl> <type.object.name> "Yao Ming"@da
<m.01jzhl> <type.object.name> "Yao Ming"@es
<m.01jzhl> <type.object.name> "Yao Ming"@fr
<m.01jzhl> <type.object.name> "姚明"@zh
<m.01jzhl> <type.object.name> "Yao Ming"@de
<m.01jzhl> <type.object.name> "Jao Ming"@hu
<m.01jzhl> <type.object.name> "姚明"@ja
Alias
此外还有Also known as(<common.topic.alias>)等信息,来表示一个实体的名字。
此外可以参考官方文档https://developers.google.com/freebase/guide/basic_concepts中关于Namespace、Key和Topic ID的描述:
The file directory-like hierarchy of domain, type, and property IDs is just one application of a more general concept: namespaces and keys. A namespace is like a file directory, and a key is like a file name. Just as all file names within a particular file directory must be unique among themselves, all keys within a particular namespace must also be unique among themselves.
As a more specific example, /business is the namespace corresponding to the Business domain. Within it, Business-related types are given keys (e.g., company) that are unique among themselves. Each type's ID is formed by appending its key to the namespace's ID (e.g., /business/company).
There are several kinds of namespaces beside namespaces that correspond to domains and types. Most important and frequently encountered is the /en namespace. This is the English namespace in which most well-known topics can be given unique keys to form human-readable English IDs. For example, the prolific Bob Dylan is so well-known that his topic in Freebase is given the key bob_dylan in the /en namespace, and so the topic's ID is /en/bob_dylan. This ID allows you to access his topic in the web client with the simple URL
先写到这里,下一节主要分析Freebase的基本类型,以及CVT(Compound Value Type)和神奇的Mediator。
参考文献
[1]Machine ID - Freebase: http://wiki.freebase.com/wiki/Machine_ID(网页快照)
[2]Basic Concept: https://developers.google.com/freebase/guide/basic_concepts