字符过滤器
用于在将字符流传递给 分词器
之前对其进行预处理。
字符过滤器将原始文本作为字符流接收,并可以通过添加、删除或更改字符来转换流。例如,字符过滤器可用于将 印度数字 (٠١٢٣٤٥٦٧٨٩)
转换为 阿拉伯-拉丁数字 (0123456789)
,或从流中去除 HTML 元素<b>
。
Elasticsearch 有许多内置的 字符过滤器
,可用于构建 自定义分析器
。
-
HTML 带字符过滤器
html_strip字符过滤器
去除 HTML 元素,例如<b>
。 解码HTML 实体,例如&
。GET /_analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I'm so <b>happy</b>!</p>" } [ \nI'm so happy!\n ]
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_custom_html_strip_char_filter" ] } }, "char_filter": { "my_custom_html_strip_char_filter": { "type": "html_strip", "escaped_tags": [ "b" ] } } } } }
-
映射字符过滤器
字符过滤器用指定的mapping替换替换任何出现的指定字符串。GET /_analyze { "tokenizer": "keyword", "char_filter": [ { "type": "mapping", "mappings": [ "٠ => 0", "١ => 1", "٢ => 2", "٣ => 3", "٤ => 4", "٥ => 5", "٦ => 6", "٧ => 7", "٨ => 8", "٩ => 9" ] } ], "text": "My license plate is ٢٥٠١٥" } [ My license plate is 25015 ]
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": [ "my_mappings_char_filter" ] } }, "char_filter": { "my_mappings_char_filter": { "type": "mapping", "mappings": [ ":) => _happy_", ":( => _sad_" ] } } } } }
GET /my-index-000001/_analyze { "tokenizer": "keyword", "char_filter": [ "my_mappings_char_filter" ], "text": "I'm delighted about it :(" } [ I'm delighted about it _sad_ ]
-
模式替换字符过滤器
字符过滤器将匹配正则表达式的pattern_replace任何字符替换为指定的替换。PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" } [ My, credit, card, is, 123_456_789 ]