对Elasticsearch字段进行去重,结果保存为文件

前情提要

据任务要求:从ES集群中查询出ip字段,对ip字段去重,并且将纯净的ip保存到文件中。
这里基于某个字段去重,其实就是wordcount问题

1. 首先通过python制造样例数据

# -*- coding: utf-8 -*-
# 生成ip列字段
ip = []
for i in range(1, 50):
    ip.append("192.168.100." + bytes(i))

# 将样例数据写入json文件
with open("data.json", "w") as f:
    i = 1
    for ipp in ip:
        for j in range(i, i + 100):
            line = '{"index":{"_index":"data","_type":"log","_id":'+bytes(j)+'}}\n{"color":"green","state":"open","address":"'+ipp+'","time":"2018-06-11"}\n'
            f.write(line)
        i = i + 100

部分样例数据:

{"index":{"_index":"data","_type":"log","_id":1}}
{"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
{"index":{"_index":"data","_type":"log","_id":2}}
{"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
{"index":{"_index":"data","_type":"log","_id":3}}
{"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
{"index":{"_index":"data","_type":"log","_id":4}}
{"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}

2. 将样例数据批量导入到ES中

# 导入数据
curl -PUT localhost:9200/_bulk --data-binary @data.json

此时ES中已经有样例数据了

curl -X GET localhost:9200/data/log/101
###
{"_index":"data","_type":"log","_id":"101","_version":1,"found":true,"_source":{"color":"green","state":"open",}
###

3. ES的去重并保存为文件

结果处理有两种方式

  • 利用jq工具,将结果保存为csv文件
# wordcount and save results as csv
curl -X GET 'http://localhost:9200/data/log/_search' -d '
{
    "size": 0,
    "aggs": {
        "group_by_state": {
            "terms": {
                "field": "address", # 指定字段为address
                "size": 0 # 0,返回所有结果
            }
        }
    }
}' | jq -r '.aggregations|.group_by_state|.buckets[]|[.key, .doc_count]|@csv' >> result.csv

结果样例

"192.168.100.1",100
"192.168.100.10",100
"192.168.100.11",100
etc ...
  • 利用grep的正则表达式对结果进行解析
# wordcount and save results as txt
curl -X GET 'http://localhost:9200/data/log/_search' -d '
{
    "size": 0,
    "aggs": {
        "group_by_state": {
            "terms": {
                "field": "address",
                "size": 0
            }
        }
    }
}' | grep -Po 'key[" :]+\K[^"]+' >> result

样例结果

192.168.100.1
192.168.100.10
192.168.100.11
192.168.100.12
192.168.100.13
192.168.100.14
192.168.100.15
192.168.100.16
192.168.100.17
192.168.100.18
192.168.100.19
etc ...
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • feisky云计算、虚拟化与Linux技术笔记posts - 1014, comments - 298, trac...
    不排版阅读 9,384评论 0 5
  • A.J费里克在艾丽丝小岛有一家书店。 他的妮可死了,他对一切都不关心,吃着不健康的速食喝的醉醺醺,他不去改变别人也...
    15冶宁阅读 1,643评论 0 0
  • 下面为大家分享我整理的四渎,五岳,五星,六曜的相关图片。
    闻人崋阅读 8,316评论 0 2
  • 一、你今天最感恩的三件事是什么? 1、晚上看了《幸福的勇气》,说:不追求别人的认同,要自我认同,要有做芸芸众生的勇...
    杨培雯阅读 1,252评论 0 0

友情链接更多精彩内容