正文

lucene的suggest(搜索提示功能的实现）(代码片段)

zp-uestc  zp-uestc  2023-03-03  307

关键词：

1.首先引入依赖

<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-suggest -->
<!-- 搜索提示 -->
<dependency>
　　<groupId>org.apache.lucene</groupId>
　　<artifactId>lucene-suggest</artifactId>
　　<version>7.2.1</version>
</dependency>

2.既然要进行智能联想，那么我们需要为提供联想的数据建立一个联想索引(而不是使用原来的数据索引)，既然要建立索引，那么我们需要知道建立索引的数据来源。我们使用一个扩展自InputIterator的类来定义数据来源。首先我们看看被扩展的类InputIterator

public interface InputIterator extends BytesRefIterator 
    InputIterator EMPTY = new InputIterator.InputIteratorWrapper(BytesRefIterator.EMPTY);

    long weight();

    BytesRef payload();

    boolean hasPayloads();

    Set<BytesRef> contexts();

    boolean hasContexts();

    public static class InputIteratorWrapper implements InputIterator 
        private final BytesRefIterator wrapped;

        public InputIteratorWrapper(BytesRefIterator wrapped) 
            this.wrapped = wrapped;
        

        public long weight() 
            return 1L;
        

        public BytesRef next() throws IOException 
            return this.wrapped.next();
        

        public BytesRef payload() 
            return null;
        

        public boolean hasPayloads() 
            return false;
        

        public Set<BytesRef> contexts() 
            return null;
        

        public boolean hasContexts() 
            return false;

weight():此方法设置某个term的权重，设置的越高suggest的优先级越高；

payload():每个suggestion对应的元数据的二进制表示，我们在传输对象的时候需要转换对象或对象的某个属性为BytesRef类型，相应的suggester调用lookup的时候会返回payloads信息；

hasPayload()：判断iterator是否有payloads；

contexts():获取某个term的contexts,用来过滤suggest的内容，如果suggest的列表为空，返回null

hasContexts():获取iterator是否有contexts;

lucene suggest提供了几个InputIteratior的默认实现

BufferedInputIterator：对二进制类型的输入进行轮询； 
DocumentInputIterator：从索引中被store的field中轮询； 
FileIterator：从文件中每次读出单行的数据轮询，以	进行间隔（且	的个数最多为2个）； 
HighFrequencyIterator：从索引中被store的field轮询，忽略长度小于设定值的文本； 
InputIteratorWrapper：遍历BytesRefIterator并且返回的内容不包含payload且weight均为1； 
SortedInputIterator：二进制类型的输入轮询且按照指定的comparator算法进行排序；

3.既然指定了数据源，下一步就是如何建立suggest索引

RAMDirectory indexDir = new RAMDirectory();
            StandardAnalyzer analyzer = new StandardAnalyzer();
            AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(indexDir, analyzer);

            // 创建索引,根据InputIterator的具体实现决定数据源以及创建索引的规则
            suggester.build(new InputIterator);

4.索引建立完毕即可在索引上进行查询，输入模糊的字符，Lucene suggest的内部算法会根据索引的建立规则提出suggest查询的内容。

private static void lookup(AnalyzingInfixSuggester suggester, String name,
                               String region) throws IOException 
        HashSet<BytesRef> contexts = new HashSet<BytesRef>();
        //使用Contexts域对suggest结果进行过滤
        contexts.add(new BytesRef(region.getBytes("UTF8")));
        //num决定了返回几条数据，参数四表明是否所有TermQuery是否都需要满足，参数五表明是否需要高亮显示
        List<Lookup.LookupResult> results = suggester.lookup(name, contexts, 2, true, false);
        System.out.println("-- "" + name + "" (" + region + "):");
        for (Lookup.LookupResult result : results) 
            System.out.println(result.key);//result.key中存储的是根据用户输入内部算法进行匹配后返回的suggest内容

5.下面提供一个实例说明完整的suggest索引创建，查询过程
实体类

package com.cfh.study.lucence_test6;

import java.io.Serializable;

/**
 * @Author: cfh
 * @Date: 2018/9/17 10:18
 * @Description: 用来测试suggest功能的pojo类
 */
public class Product implements Serializable 
    /** 产品名称 */
    private String name;
    /** 产品图片 */
    private String image;
    /** 产品销售地区 */
    private String[] regions;
    /** 产品销售量 */
    private int numberSold;

    public Product() 
    

    public Product(String name, String image, String[] regions, int numberSold) 
        this.name = name;
        this.image = image;
        this.regions = regions;
        this.numberSold = numberSold;
    

    public String getName() 

        return name;
    

    public void setName(String name) 
        this.name = name;
    

    public String getImage() 
        return image;
    

    public void setImage(String image) 
        this.image = image;
    

    public String[] getRegions() 
        return regions;
    

    public void setRegions(String[] regions) 
        this.regions = regions;
    

    public int getNumberSold() 
        return numberSold;
    

    public void setNumberSold(int numberSold) 
        this.numberSold = numberSold;

指定数据源,这里的数据源是传入的一个product集合的迭代器，可以根据实际情况更换数据源为文件或者数据库等。

package com.cfh.study.lucence_test6;

import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.util.BytesRef;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import java.io.UnsupportedEncodingException;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

/**
 * @Author: cfh
 * @Date: 2018/9/17 10:21
 * @Description: 这个类是核心，决定了你的索引是如何创建的，决定了最终返回的提示关键词列表数据及其排序
 */
public class ProductIterator implements InputIterator 
    private Iterator<Product> productIterator;
    private Product currentProduct;

    ProductIterator(Iterator<Product> productIterator) 
        this.productIterator = productIterator;
    

    /**
     * 设置是否启用Contexts域
     * @return
     */
    public boolean hasContexts() 
        return true;
    

    /**
     * 是否有设置payload信息
     */
    public boolean hasPayloads() 
        return true;
    

    public Comparator<BytesRef> getComparator() 
        return null;
    

    /**
    * next方法的返回值指定的其实就是就是可能返回给我们的suggest的值的结果集合（LookUpResult.key),这里我们选择了商品名。
    */
    public BytesRef next() 
        if (productIterator.hasNext()) 
            currentProduct = productIterator.next();
            try 
                //返回当前Project的name值，把product类的name属性值作为key
                return new BytesRef(currentProduct.getName().getBytes("UTF8"));
             catch (UnsupportedEncodingException e) 
                throw new RuntimeException("Couldn‘t convert to UTF-8",e);
            
         else 
            return null;
        
    

    /**
     * 将Product对象序列化存入payload
     * [这里仅仅是个示例，其实这种做法不可取,一般不会把整个对象存入payload,这样索引体积会很大，浪费硬盘空间]
     */
    public BytesRef payload() 
        try 
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            ObjectOutputStream out = new ObjectOutputStream(bos);
            out.writeObject(currentProduct);
            out.close();
            return new BytesRef(bos.toByteArray());
         catch (IOException e) 
            throw new RuntimeException("Well that‘s unfortunate.");
        
    

    /**
     * 把产品的销售区域存入context，context里可以是任意的自定义数据，一般用于数据过滤
     * Set集合里的每一个元素都会被创建一个TermQuery，你只是提供一个Set集合，至于new TermQuery
     * Lucene底层API去做了，但你必须要了解底层干了些什么
     */
    public Set<BytesRef> contexts() 
        try 
            Set<BytesRef> regions = new HashSet<BytesRef>();
            for (String region : currentProduct.getRegions()) 
                regions.add(new BytesRef(region.getBytes("UTF8")));
            
            return regions;
         catch (UnsupportedEncodingException e) 
            throw new RuntimeException("Couldn‘t convert to UTF-8");
        
    

    /**
     * 返回权重值，这个值会影响排序
     * 这里以产品的销售量作为权重值，weight值即最终返回的热词列表里每个热词的权重值
     * 怎么设计返回这个权重值，发挥你们的想象力吧
     */
    public long weight() 
        return currentProduct.getNumberSold();

最后当然是测试suggest的结果啦，可以看到我们根据product的name进行了suggest并使用product的region域对suggest结果进行了过滤

private static void lookup(AnalyzingInfixSuggester suggester, String name,
                               String region) throws IOException 
        HashSet<BytesRef> contexts = new HashSet<BytesRef>();
        //先根据region域进行suggest再根据name域进行suggest
        contexts.add(new BytesRef(region.getBytes("UTF8")));
        //num决定了返回几条数据，参数四表明是否所有TermQuery是否都需要满足，参数五表明是否需要高亮显示
        List<Lookup.LookupResult> results = suggester.lookup(name, contexts, 2, true, false);
        System.out.println("-- "" + name + "" (" + region + "):");
        for (Lookup.LookupResult result : results) 
            System.out.println(result.key);//result.key中存储的是根据用户输入内部算法进行匹配后返回的suggest内容
            //从载荷（payload）中反序列化出Product对象(实际生产中出于降低内存占用考虑一般不会在载荷中存储这么多内容)
            BytesRef bytesRef = result.payload;
            ObjectInputStream is = new ObjectInputStream(new ByteArrayInputStream(bytesRef.bytes));
            Product product = null;
            try 
                product = (Product)is.readObject();
             catch (ClassNotFoundException e) 
                // TODO Auto-generated catch block
                e.printStackTrace();
            
            System.out.println("product-Name:" + product.getName());
            System.out.println("product-regions:" + product.getRegions());
            System.out.println("product-image:" + product.getImage());
            System.out.println("product-numberSold:" + product.getNumberSold());
        
        System.out.println();

当然也可以参考原博主的github:study-lucene

原文博主：https://blog.csdn.net/m0_37556444/article/details/82734959

flask-web搜索系统项目实际应用suggest查询实现联想提示自动补全的实现(代码片段)

一、项目全文检索实现elasticsearchpython客户端使用https://elasticsearch-py.readthedocs.io/en/master/>pipinstallelasticsearch对于elasticsearch5.x版本需要按以下方式导入fromelasticsearch5importElasticsearch#elasticsearch集群服务器的地址E 查看详情

搜索引擎之全文搜索算法功能实现（基于lucene）

之前做去转盘网的时候，我已经公开了非全文搜索的代码，需要的朋友希望能够前去阅读我的博客。本文主要讨论如何进行全文搜索，由于本人花了很长时间设计了新作：观点，观点对全文搜索的要求还是很高的，所以我又花了... 查看详情

搜索引擎之全文搜索算法功能实现（基于lucene）

利用solr实现商品的搜索功能

...一个顶级开源项目，采用Java开发，它是基于Lucene的全文搜索服务器。Solr提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展，并对索引、搜索性能进行了优化。Solr是一个全文检索服务器，只需要进行配置就可以实现... 查看详情

sitecore8.2.netsolr搜索实现(代码片段)

基于Sitecore8.2&Solr5.1.0实现搜索功能，实现前台输入关键词返回相应结果。Lucene是SitecoreCMS中用于索引和搜索网站内容的开源搜索引擎。Sitecore为拥有自己API的Lucene引擎实现了一个包装器。原API(Lucene.Net)和SitecoreAPI(搜索。)希望扩... 查看详情

搜索技术---solr

solr企业站内搜索技术选型在一些大型门户网站、电子商务网站等都需要站内搜索功能，使用传统的数据库查询方式实现搜索无法满足一些高级的搜索需求，比如：搜索速度要快、搜索结果按相关度排序、搜索内容格式不固定等... 查看详情

solr入门之搜索建议的几种实现方式和最终选取实现思路

上篇博客中我简单的讲了下solr自身的suggest模块来实现搜索建议.但是今天研究了下在solr自身的suggest中添加进去拼音来智能推荐时不时很方便.在次从网上搜集和整理思考了下该问题的解决. http://www.cnblogs.com/huangfox/p/4146970... 查看详情

9个基于java的搜索引擎

1、Java全文搜索引擎框架 Lucene毫无疑问，Lucene是目前最受欢迎的Java全文搜索框架，准确地说，它是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎。Lucene为开发人员提供了相当完整的工具包... 查看详情

搜索引擎基础概念——构建单词词典

　　Lucene单词词典　　　　使用lucene进行查询不可避免都会使用到其提供的单词词典功能，即根据给定的term找到该term所对应的倒排文档id列表等信息。实际上lucene索引文件后缀名为tim和tip的文件实现的就是lucene的单词词典功能... 查看详情

lucene学习之入门

今天开始接触Lucene搜索，Lucene是一个全文检索的框架，主要适用于搜索，这里的搜索不同于数据库的查询。Lucene是建立索引然后存在你设置的路径或者内存中，然后当你输入条件的时候就会去索引文件检索查询。Lucene能... 查看详情

Lucene 文本搜索以非字母结尾的名称失败

】Lucene文本搜索以非字母结尾的名称失败【英文标题】：Lucenetextsearchfailingfornamesendingwithanon-alphabet【发布时间】：2012-11-2201:54:24【问题描述】：在我的Webmethods应用程序中，我需要实现一个搜索功能，并且我已经使用Lucene完成了... 查看详情

1.lucene简介

...一个基于Java的全文信息检索工具包，它不是一个完整的搜索应用程序，而是为你的应用程序提供索引和搜索功能　　Lucene是开源项目。它是可扩展的，高性能的库用于索引和搜索几乎任何类型的文本，Lucene库提供了所需的任何... 查看详情

lucene全文检索学习入门

今天开始接触Lucene搜索，Lucene是一个全文检索的框架，主要适用于搜索，这里的搜索不同于天龙八部私服数据库的查询。Lucene是建立索引然后存在你设置的路径或者内存中，然后当你输入条件的时候就会去索引文件检索查询... 查看详情

lucene全文检索学习入门

lucene和solr学习总结

我们使用的百度搜索和电商网站的搜索功能一般都是基于Lucene实现的，Solr就是对Lucene进行的封装，就像Servlet和Struts2，SpringMvc一样说的专业点就是全文检索实现全文检索的流程的大致操作如下这张图表现的很清晰，网上扒下... 查看详情

lucene介绍与使用

...net/xiangxizhishi/article/details/74581950Lucene是开放源代码的全文搜索引擎工具包，凭借着其强劲的搜索功能和简单易用的实现，在国内已经很普及，甚至一度出现了言搜索必称Lucene的盛景。上个月Lucene的开发团队发布了JavaLucene2.3.1，相... 查看详情

elasticsearch简介与实战(代码片段)

...Elasticsearch???Elasticsearch是一个开源的分布式、RESTful风格的搜索和数据分析引擎，它的底层是开源库ApacheLucene。??Lucene可以说是当下最先进、高性能、全功能的搜索引擎库——无论是开源还是私有，但它也仅仅只是一个库。为了充... 查看详情

elasticsearch

...ch也使用Java开发并使用Lucene作为其核心来实现所有索引和搜索的功能，但是它的目的是通过简单的RESTfulAPI来隐藏Lucene的复杂性，从而让全文搜索变得简单。1.2Lucene与ES关系？1）Lucene只是一个库。想要使用它，你必须使用Java来作... 查看详情