analyzer执行将输入字符流分解为token的过程，它一般发生在两个场合：

在indexing的时候，也即在建立索引的时候
在searching的时候，也即在搜索时，分析需要搜索的词语

analysis?

分析是Elasticsearch在文档发送之前对文档正文执行的过程，以添加到反向索引中（inverted index）。在将文档添加到索引之前，Elasticsearch会为每个分析的字段执行许多步骤：

Character filtering (字符过滤器): 使用字符过滤器转换字符

Breaking text into tokens (把文字转化为标记): 将文本分成一组一个或多个标记

Token filtering：使用标记过滤器转换每个标记

Token indexing：把这些标记存于index中

standard analyzer是Elasticsearch的缺省分析器：

没有 Char Filter
使用standard tokonizer
把字符串变为小写，同时有选择地删除一些stop words等。默认的情况下stop words为_none_，也即不过滤任何stop words。

总体说来一个analyzer可以分为如下的几个部分：

0个或1个以上的character filter
1个tokenizer
0个或1个以上的token filter

Analyze API

GET /_analyze
POST /_analyze
GET /<index>/_analyze
POST /<index>/_analyze

安装IK中文分词器

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.6.2/elasticsearch-analysis-ik-7.6.2.zip

查看安装的插件列表
elasticsearch-plugin list

安装完成需要 重新启动一下我们的Elasticsearch，以便这个plugin能装被加载

安装icu
elasticsearch-plugin install analysis-icu

#创建索引
PUT chinese

#指定analyzer
PUT /chinese/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_smart"
    }
  }
}

#测试分词
GET /chinese/_analyze
{
  "text": "我爱北京天安门",
  "analyzer": "ik_max_word"
}

    PUT /chinese/_doc/1
    {
      "content":"我爱北京天安门"
    }
     
    PUT  /chinese/_doc/2
    {
      "content": "北京，你好"
    }
    
    GET /chinese/_search
    {
      "query": {
        "match": {
          "content": "北京"
        }
      }
    }
    
#测试搜索
GET /chinese/_search
{
  "query": {
    "match": {
      "content": "天安门"
    }
  }
}

    GET /chinese/_search
    {
      "query": {
        "match": {
          "content": "北京天安门"
        }
      }
    }

学习Elasticsearch中的analyzer

Analyze API

发表回复取消回复

Analyze API

发表回复 取消回复

发表回复取消回复