1、实现关系型数据库中的三范式
三范式 --> 将每个数据实体拆分为一个独立的数据表,同时使用主外键关联关系将多个数据表关联起来 --> 确保没有任何冗余的数据。
冗余数据,就是说,将可能会进行搜索的条件和要搜索的数据,放在一个doc中
无冗余数据优点和缺点
优点:数据不冗余,维护方便
缺点:应用层join,如果关联数据过多,导致查询过大,性能很差
有冗余数据优点和缺点
优点:性能高,不需要执行两次搜索
缺点:数据冗余,维护成本高 --> 每次如果你的username变化了,同时要更新user type和blog type
(1)、构造更多测试数据
PUT /website/users/3
{
"name": "黄药师",
"email": "huangyaoshi@sina.com",
"birthday": "1970-10-24"
}
PUT /website/blogs/3
{
"title": "我是黄药师",
"content": "我是黄药师啊,各位同学们!!!",
"userInfo": {
"userId": 1,
"userName": "黄药师"
}
}
PUT /website/users/2
{
"name": "花无缺",
"email": "huawuque@sina.com",
"birthday": "1980-02-02"
}
PUT /website/blogs/4
{
"title": "花无缺的身世揭秘",
"content": "大家好,我是花无缺,所以我的身世是。。。",
"userInfo": {
"userId": 2,
"userName": "花无缺"
}
}
(2)、对每个用户发表的博客进行分组
GET /website/blogs/_search
{
"size": 0,
"aggs": {
"group_by_username": {
"terms": {
"field": "userInfo.username.keyword"
},
"aggs": {
"top_blogs": {
"top_hits": {
"_source": {
"include": "title"
},
"size": 5
}
}
}
}
}
}
2、对类似文件系统这种的有多层级关系的数据进行建模
(1)、path_hierarchy:对文本是文件目录形式的进行目录分词
例:
PUT /fs
{
"settings": {
"analysis": {
"analyzer": {
"paths":{
"tokenizer":"path_hierarchy"
}
}
}
}
}
测试:
GET /fs/_analyze
{
"analyzer": "paths",
"text": "a/b/c"
}
结果:
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "a/b",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "a/b/c",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
(2)实例操作
PUT /fs/_mapping/file
{
"properties": {
"name": {
"type": "keyword"
},
"path": {
"type": "keyword",
"fields": {
"tree": {
"type": "text",
"analyzer": "paths"
}
}
}
}
}
PUT /fs/file/1
{
"name": "README.txt",
"path": "/workspace/projects/helloworld",
"contents": "这是我的第一个elasticsearch程序"
}
PUT /fs/file/2
{
"name": "README.txt",
"path": "/workspace/projects/helloworld2",
"contents": "这是我的第一个elasticsearch程序"
}
文件搜索需求:查找一份,内容包括elasticsearch,在/workspace/projects/hellworld这个目录下的文件,以下查询,是以不分形式去查
GET /fs/file/_search
{
"query": {
"bool": {
"must": [
{"match": {
"contents": "elasticsearch"
}},
{"constant_score": {
"filter": {
"term": {
"path": "/workspace/projects/helloworld"
}
}
}}
]
}
}
}
搜索/workspace目录下,内容包含elasticsearch的所有的文件,如果用上面的,path改成到workspace,是查不到数据的,使用path_hierarchy的分词就可以查出来
GET /fs/file/_search
{
"query": {
"bool": {
"must": [
{"match": {
"contents": "elasticsearch"
}},
{"constant_score": {
"filter": {
"term": {
"path.tree": "/workspace"
}
}
}}
]
}
}
}
这样两个都可以查出来
3、全局锁实现悲观锁并发控制,就是用_create语法
PUT /fs/lock/global/_create
{}
fs: 你要上锁的那个index
lock: 就是你指定的一个对这个index上全局锁的一个type
global: 就是你上的全局锁对应的这个doc的id
_create:强制必须是创建,如果/fs/lock/global这个doc已经存在,那么创建失败,报错
另外一个线程同时尝试上锁会报错
PUT /fs/lock/global/_create
{}
全局锁的优点和缺点
优点:操作非常简单,非常容易使用,成本低
缺点:你直接就把整个index给上锁了,这个时候对index中所有的doc的操作,都会被block住,导致整个系统的并发能力很低
上锁解锁的操作不是频繁,然后每次上锁之后,执行的操作的耗时不会太长,用这种方式,方便
上了锁之后,另一个还是可以进行操作,是不是有问题?
这种锁只对create启作用,修改新增还是没有用,只能靠version来按制
4、document锁实现悲观锁并发控制
(1)、document锁,是用脚本进行上锁
document锁,顾名思义,每次就锁你要操作的,你要执行增删改的那些doc,doc锁了,其他线程就不能对这些doc执行增删改操作了
POST /fs/lock/1/_update
{
"upsert": { "process_id": 123 },
"script": "if ( ctx._source.process_id != process_id ) { assert false }; ctx.op = 'noop';"
"params": {
"process_id": 123
}
}
/fs/lock,是固定的,就是说fs下的lock type,专门用于进行上锁
/fs/lock/id,比如1,id其实就是你要上锁的那个doc的id,代表了某个doc数据对应的lock(也是一个doc)
params,里面有个process_id,是你的要执行增删改操作的进程的唯一id,很重要,会在lock中,设置对对应的doc加锁的进程的id,这样其他进程过来的时候,才知道,这条数据已经被别人给锁了
assert false,不是当前进程加锁的话,则抛出异常
ctx.op='noop',不做任何修改
(2)document锁的完整实验过程
上锁:
POST /fs/lock/1/_update
{
"upsert": {"process_id":321},
"script": {
"lang": "groovy",
"file": "judge-lock",
"params": {"paocess_id":321}
}
}
释放锁:
POST /fs/_refresh
好像没啥用只是同时不能_update而已,其他线程也可以增册改????????
共享锁,就是用_update语法,只是上锁数据不能一样
5、基于nested object实现博客与评论嵌套关系
(1)、为什么需要nested object
冗余数据方式的来建模,其实用的就是object类型,我们这里又要引入一种新的object类型,nested object类型
PUT /website/blogs/6
{
"title": "花无缺发表的一篇帖子",
"content": "我是花无缺,大家要不要考虑一下投资房产和买股票的事情啊。。。",
"tags": [ "投资", "理财" ],
"comments": [
{
"name": "小鱼儿",
"comment": "什么股票啊?推荐一下呗",
"age": 28,
"stars": 4,
"date": "2016-09-01"
},
{
"name": "黄药师",
"comment": "我喜欢投资房产,风,险大收益也大",
"age": 31,
"stars": 5,
"date": "2016-10-22"
}
]
}
例,查出博客评论是黄药师并且年龄是28的
GET /website/blogs/_search
{
"query": {
"bool": {
"must": [
{"match": {
"comments.name": "黄药师"
}},
{
"match": {
"comments.age": 28
}
}
]
}
}
}
按理是不应该出来的
(2)、object类型数据结构的底层存储。。。
{
"title": [ "花无缺", "发表", "一篇", "帖子" ],
"content": [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ],
"tags": [ "投资", "理财" ],
"comments.name": [ "小鱼儿", "黄药师" ],
"comments.comment": [ "什么", "股票", "推荐", "我", "喜欢", "投资", "房产", "风险", "收益", "大" ],
"comments.age": [ 28, 31 ],
"comments.stars": [ 4, 5 ],
"comments.date": [ 2016-09-01, 2016-10-22 ]
}
object类型底层数据结构,会将一个json数组中的数据,进行扁平化,所以这样一找,整个贴子都出来了
(3)、引入nested object类型,来解决object类型底层数据结构导致的问题
修改mapping,将comments的类型从object设置为nested
PUT /website
{
"mappings": {
"blogs": {
"properties": {
"comments": {
"type": "nested",
"properties": {
"name": { "type": "string" },
"comment": { "type": "string" },
"age": { "type": "short" },
"stars": { "type": "short" },
"date": { "type": "date" }
}
}
}
}
}
}
底部是单独存储
{
"comments.name": [ "小鱼儿" ],
"comments.comment": [ "什么", "股票", "推荐" ],
"comments.age": [ 28 ],
"comments.stars": [ 4 ],
"comments.date": [ 2014-09-01 ]
}
{
"comments.name": [ "黄药师" ],
"comments.comment": [ "我", "喜欢", "投资", "房产", "风险", "收益", "大" ],
"comments.age": [ 31 ],
"comments.stars": [ 5 ],
"comments.date": [ 2014-10-22 ]
}
{
"title": [ "花无缺", "发表", "一篇", "帖子" ],
"body": [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ],
"tags": [ "投资", "理财" ]
}
GET /website/blogs/_search
{
"query": {
"bool": {
"must": [
{"match": {
"title": "花无缺"
}},
{
"nested": {
"path": "comments",
"query": {
"bool": {
"must": [
{"match": {
"comments.name":"黄药师"
}},
{
"match": {
"comments.age": 28
}
}
]
}
}
}
}
]
}
}
}
这样就查不出来了
(4)、聚合数据分析的需求1:按照评论日期进行bucket划分,然后拿到每个月的评论的评分的平均值
GET /website/blogs/_search
{
"size": 0,
"aggs": {
"comments_path": {
"nested": {
"path": "comments"
},
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "comments.date",
"interval": "month",
"format": "yyyy-MM-dd"
},
"aggs": {
"avg_stars": {
"avg": {
"field": "comments.stars"
}
}
}
}
}
}
}
}
(5)、reverse_nested,可以在聚合后使用外层的buckets进行聚合
GET /website/blogs/_search
{
"size": 0,
"aggs": {
"comments_path": {
"nested": {
"path": "comments"
},
"aggs": {
"group_age": {
"histogram": {
"field": "comments.age",
"interval": 10
},
"aggs": {
"reverse_path": {
"reverse_nested": {},
"aggs": {
"group_tags": {
"terms": {
"field": "tags.keyword"
}
}
}
}
}
}
}
}
}
}
6、及父子关系数据建模
Object及nested object的建模,有个不好的地方,就是采取的是类似冗余数据的方式,将多个数据都放在一起了,维护成本就比较高
parent child建模方式,采取的是类似于关系型数据库的三范式类的建模,多个实体都分割开来,每个实体之间都通过一些关联方式,进行了父子关系的关联,各种数据不需要都放在一起,父doc和子doc分别在进行更新的时候,都不会影响对方
(1)、案例背景:研发中心员工管理案例,一个IT公司有多个研发中心,每个研发中心有多个员工
建立关系映射:父子关系建模的核心,多个type之间有父子关系,用_parent指定父type
PUT /company
{
"mappings": {
"rd_center": {},
"employee": {
"_parent": {
"type": "rd_center"
}
}
}
}
POST /company/rd_center/_bulk
{ "index": { "_id": "1" }}
{ "name": "北京研发总部", "city": "北京", "country": "中国" }
{ "index": { "_id": "2" }}
{ "name": "上海研发中心", "city": "上海", "country": "中国" }
{ "index": { "_id": "3" }}
{ "name": "硅谷人工智能实验室", "city": "硅谷", "country": "美国" }
shard路由的时候,id=1的rd_center doc,默认会根据id进行路由,到某一个shard
PUT /company/employee/1?parent=1
{
"name": "张三",
"birthday": "1970-10-24",
"hobby": "爬山"
}
维护父子关系的核心,parent=1,指定了这个数据的父doc的id
POST /company/employee/_bulk
{ "index": { "_id": 2, "parent": "1" }}
{ "name": "李四", "birthday": "1982-05-16", "hobby": "游泳" }
{ "index": { "_id": 3, "parent": "2" }}
{ "name": "王二", "birthday": "1979-04-01", "hobby": "爬山" }
{ "index": { "_id": 4, "parent": "3" }}
{ "name": "赵五", "birthday": "1987-05-11", "hobby": "骑马" }
(2)验证
搜索有1980年以后出生的员工的研发中心
GET /company/rd_center/_search
{
"query": {
"has_child": {
"type": "employee",
"query": {
"range": {
"birthday": {
"gte": "1980-01-01"
}
}
}
}
}
}
搜索有名叫张三的员工的研发中心
GET /company/rd_center/_search
{
"query": {
"has_child": {
"type": "employee",
"query": {
"match": {
"name":"张三"
}
}
}
}
}
搜索有至少2个以上员工的研发中心
GET /company/rd_center/_search
{
"query": {
"has_child": {
"type": "employee",
"min_children": 2,
"query": {
"match_all": {}
}
}
}
}
搜索在中国的研发中心的员工
GET /company/employee/_search
{
"query": {
"has_parent": {
"parent_type": "rd_center",
"query": {
"term": {
"country.keyword": {
"value": "中国"
}
}
}
}
}
}
统计每个国家的有多少个员工,有那些爱好
GET /company/rd_center/_search
{
"size": 0,
"aggs": {
"group_country": {
"terms": {
"field": "country.keyword"
},
"aggs": {
"group_employee": {
"children": {
"type": "employee"
},
"aggs": {
"group_hobby": {
"terms": {
"field": "hobby.keyword"
}
}
}
}
}
}
}
}
7、祖孙三层关系的数据建模,搜索
PUT /company
{
"mappings": {
"country": {},
"rd_center": {
"_parent": {
"type": "country"
}
},
"employee": {
"_parent": {
"type": "rd_center"
}
}
}
}
country -> rd_center -> employee,祖孙三层数据模型
POST /company/country/_bulk
{ "index": { "_id": "1" }}
{ "name": "中国" }
{ "index": { "_id": "2" }}
{ "name": "美国" }
POST /company/rd_center/_bulk
{ "index": { "_id": "1", "parent": "1" }}
{ "name": "北京研发总部" }
{ "index": { "_id": "2", "parent": "1" }}
{ "name": "上海研发中心" }
{ "index": { "_id": "3", "parent": "2" }}
{ "name": "硅谷人工智能实验室" }
PUT /company/employee/1?parent=1&routing=1
{
"name": "张三",
"dob": "1970-10-24",
"hobby": "爬山"
}
routing参数的讲解,必须跟grandparent相同,否则有问题
country,用的是自己的id去路由; rd_center,parent,用的是country的id去路由; employee,如果也是仅仅指定一个parent,那么用的是rd_center的id去路由,这就导致祖孙三层数据不会在一个shard上,孙子辈儿,要手动指定routing,指定为爷爷辈儿的数据的id
搜索有爬山爱好的员工所在的国家
GET /company/country/_search
{
"query": {
"has_child": {
"type": "rd_center",
"query": {
"has_child": {
"type": "employee",
"query": {
"match": {
"hobby": "爬山"
}
}
}
}
}
}
}