Elasticsearch8.1 -- 24. 文本分析-字符过滤器

字符过滤器 用于在将字符流传递给 分词器 之前对其进行预处理。

字符过滤器将原始文本作为字符流接收,并可以通过添加、删除或更改字符来转换流。例如,字符过滤器可用于将 印度数字 (٠١٢٣٤٥٦٧٨٩) 转换为 阿拉伯-拉丁数字 (0123456789) ,或从流中去除 HTML 元素<b>

Elasticsearch 有许多内置的 字符过滤器,可用于构建 自定义分析器

  • HTML 带字符过滤器
    html_strip字符过滤器 去除 HTML 元素,例如 <b>。 解码HTML 实体,例如 &amp;
    GET /_analyze
    {
      "tokenizer": "keyword",
      "char_filter": [
        "html_strip"
      ],
      "text": "<p>I&apos;m so <b>happy</b>!</p>"
    }
    
    [ \nI'm so happy!\n ]
    
    PUT my-index-000001
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "keyword",
              "char_filter": [
                "my_custom_html_strip_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_custom_html_strip_char_filter": {
              "type": "html_strip",
              "escaped_tags": [
                "b"
              ]
            }
          }
        }
      }
    }
    
  • 映射字符过滤器
    字符过滤器用指定的mapping替换替换任何出现的指定字符串。

      GET /_analyze
      {
        "tokenizer": "keyword",
        "char_filter": [
          {
            "type": "mapping",
            "mappings": [
              "٠ => 0",
              "١ => 1",
              "٢ => 2",
              "٣ => 3",
              "٤ => 4",
              "٥ => 5",
              "٦ => 6",
              "٧ => 7",
              "٨ => 8",
              "٩ => 9"
            ]
          }
        ],
        "text": "My license plate is ٢٥٠١٥"
      }
    
      [ My license plate is 25015 ]
    
      PUT /my-index-000001
      {
        "settings": {
          "analysis": {
            "analyzer": {
              "my_analyzer": {
                "tokenizer": "standard",
                "char_filter": [
                  "my_mappings_char_filter"
                ]
              }
            },
            "char_filter": {
              "my_mappings_char_filter": {
                "type": "mapping",
                "mappings": [
                  ":) => _happy_",
                  ":( => _sad_"
                ]
              }
            }
          }
        }
      }
    
    GET /my-index-000001/_analyze
    {
      "tokenizer": "keyword",
      "char_filter": [ "my_mappings_char_filter" ],
      "text": "I'm delighted about it :("
    }
    
    [ I'm delighted about it _sad_ ]
    
  • 模式替换字符过滤器
    字符过滤器将匹配正则表达式的pattern_replace任何字符替换为指定的替换。

    PUT my-index-000001
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "char_filter": [
                "my_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "pattern_replace",
              "pattern": "(\\d+)-(?=\\d)",
              "replacement": "$1_"
            }
          }
        }
      }
    }
    
    POST my-index-000001/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "My credit card is 123-456-789"
    }
    
    [ My, credit, card, is, 123_456_789 ]
    
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容