sách gpt4 ai đã đi

elasticsearch - ElasticSearch 中现有字段的补全建议

In lại 作者:行者123 更新时间:2023-12-04 07:41:13 28 4
mua khóa gpt4 Nike

在我的 elasticsearch 索引中,我索引了一堆工作。为简单起见,我们只说它们是一堆职位。当人们在我的搜索引擎中输入职位时,我想“自动完成”可能的匹配。

我在这里调查了完成建议:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html

但是,我发现的所有示例都涉及在索引上创建一个新字段,并在编制索引/河流时手动填充该字段。

有没有办法在现有领域有一个完成建议?即使这意味着重新索引数据也没关系。例如,当我想保留原始的 not_analysis 文本时,我可以在映射中执行以下操作:

"JobTitle": {
"type": "string",
"fields": {
"Original": {
"type": "string",
"index": "not_analyzed"
}
}
}

这可能与建议者有关吗?

如果没有,是否可以进行非空白标记化/N-Gram 搜索来获取这些字段?虽然它会更慢,但我认为这会奏效。

1 Câu trả lời

好的,这是使用 prefix queries(可能或)可能无法扩展的简单方法.

我将使用 "fields" 创建一个索引你提到的技术,以及我找到的一些方便的职位描述数据 đây :

DELETE /test_index

PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}

PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"experienced bra fitter", "desc":"I bet they had trouble finding candidates for this one."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PlayStation Brand Ambassador", "desc":"please report to your residence in the United States of Nintendo."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Eyebrow Threading", "desc":"I REALLY hope this has something to do with dolls."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Administraive/ Secretary", "desc":"ok, ok, we get it. It’s clear where you need help."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Finish Carpenter", "desc":"for when the Start Carpenter gets tired."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Helpdesk Technician @ Pentagon", "desc":"“Uh, hello? I’m having a problem with this missile…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Nail Tech", "desc":"so nails can be pretty complicated…"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Remedy Engineer", "desc":"aren’t those called “doctors”?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Saltlick Cashier", "desc":"new trend in the equestrian industry. Ok, enough horsing around."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Molecular Biologist II", "desc":"when Molecular Biologist I gets promoted."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Breakfast Sandwich Maker", "desc":"we also got one of these recently."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Hotel Housekeepers", "desc":"why can’t they just say ‘hotelkeepers’?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Preschool Teacher #4065", "desc":"either that’s a really big school or they’ve got robot teachers."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"glacéau drop team", "desc":"for a new sport at the Winter Olympics: ice-water spilling."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PLUMMER/ELECTRICIAN", "desc":"get a dictionary/thesaurus first."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"DoodyCalls Technician", "desc":"they really shouldn’t put down janitors like that."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Golf Staff", "desc":"and here I thought they were called clubs."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Pressure Washers", "desc":"what’s next, heat cleaners?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Sandwich Artist", "desc":"another “Jesus in my food” wannabe."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Self Storage Manager", "desc":"this is for self storage?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Qualified Infant Caregiver", "desc":"too bad for all the unqualified caregivers on the list."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Ground Support", "desc":"but there’s just more dirt under there."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Gymboree Teacher", "desc":"the hardest part is not burning your hands sliding down the pole."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"COMMERCIAL space hunter", "desc":"so they did find animals further out in the cosmos? Who knew."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"JOB COACH", "desc":"if they’re unemployed when they get to you, what does that say about them?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"KIDS KAMP INSTRUCTOR!", "desc":"no spelling ability required."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"POOLS SUPERVISOR", "desc":"“yeah, they’re still wet…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"HOUSE MANAGER/TEEN SUPERVISOR", "desc":"see the dictionary under P, for Parent."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Licensed Seamless Gutter Contractor", "desc":"just sounds bad."}

然后我可以轻松运行前缀查询:
POST /test_index/_search
{
"query": {
"prefix": {
"title": {
"value": "san"
}
}
}
}
...
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "mcRfqtwzTyWE7ZNsKFvwEg",
"_score": 1,
"_source": {
"title": "Breakfast Sandwich Maker",
"desc": "we also got one of these recently."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}

或者,如果我想对匹配更加小心,我可以使用未分析的字段:
POST /test_index/_search
{
"query": {
"prefix": {
"title.raw": {
"value": "San"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}

这是最简单的方法。 Ngrams 有点复杂,但并不难。我稍后会在另一个答案中添加它。

这是我使用的代码:

http://sense.qbox.io/gist/4e066d051d7dab5fe819264b0f4b26d958d115a9

编辑:Ngram 版本

từ this blog post 借用分析仪(无耻的插件),我可以设置索引如下:
DELETE /test_index

PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}

请注意,我使用不同的分析器进行索引和搜索;这很重要,因为如果将搜索查询分解为 ngram,我们可能会获得比我们想要的更多的点击量。

使用上面使用的相同数据集填充,我可以使用简单的 match 进行查询查询以获得我期望的结果:
POST /test_index/_search
{
"query": {
"match": {
"title": "sup"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.8631258,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "4pcAOmPNSYupjz7lSes8jw",
"_score": 1.8631258,
"_source": {
"title": "Ground Support",
"desc": "but there’s just more dirt under there."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "DVFOC6DsTa6eH_a-RtbUUw",
"_score": 1.8631258,
"_source": {
"title": "POOLS SUPERVISOR",
"desc": "“yeah, they’re still wet…”"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "klleY_bnQ4uFmCPF94sLOw",
"_score": 1.4905007,
"_source": {
"title": "HOUSE MANAGER/TEEN SUPERVISOR",
"desc": "see the dictionary under P, for Parent."
}
}
]
}
}

这是代码:

http://sense.qbox.io/gist/b0e77bb7f05a4527de5ab4345749c793f923794c

关于elasticsearch - ElasticSearch 中现有字段的补全建议,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28312421/

28 4 0
行者123
Hồ sơ cá nhân

Tôi là một lập trình viên xuất sắc, rất giỏi!

Nhận phiếu giảm giá Didi Taxi miễn phí
Mã giảm giá Didi Taxi
Giấy chứng nhận ICP Bắc Kinh số 000000
Hợp tác quảng cáo: 1813099741@qq.com 6ren.com