正文

数据挖掘：提取百度知道qa中的影视信息(代码片段)

mengrennwpu  mengrennwpu  2022-11-24  418

关键词：

1. 背景

网站上爬取了部分关于影视的百度知道QA，为了后续提高影视的搜索效果，需要基于百度知道QA的答案抽取相关的影视信息。

2. 流程

目前已有基础的媒资视频库信息，基于媒资视频库中的视频名称，构建分词字典，结合使用AC双数组，然后针对百度的QA进行分词。针对分词后的结果，可以结合视频热度与评分进行筛选。

3. 代码实现

(1) 基于文本(格式为每行一条视频名称)，结合AC双数组，构建分词

package com.test.model.act;

import com.google.common.collect.Lists;
import com.test.util.IOUtil;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static com.test.model.act.AhoCorasickDoubleArrayTrie.*;
import java.io.*;
import java.util.Iterator;
import java.util.List;
import java.util.TreeMap;

/**
 * @author test
 * @date 2018/11/1
 */
public class Act 

    private static Logger logger = LoggerFactory.getLogger(Act.class);

    private static Act instance = null;
    private static String path = "act";
    private AhoCorasickDoubleArrayTrie<Resource> act = new AhoCorasickDoubleArrayTrie<>();

    public static Act getInstance() throws IOException, ClassNotFoundException 
        if(null == instance)
            instance = new Act();
        
        return instance;
    

    public Act() throws IOException, ClassNotFoundException 
        this.initTrie();
    

    /**
     * AC自动机初始化
     * @throws IOException
     * @throws ClassNotFoundException
     */
    private void initTrie() throws IOException, ClassNotFoundException 
        if(new File(path).exists())
            FileInputStream fis = new FileInputStream(path);
            ObjectInputStream ois = new ObjectInputStream(fis);
            long curTime = System.currentTimeMillis()/1000;
            act.load(ois);
            logger.info("load act cost: " + (System.currentTimeMillis()/1000 - curTime));
        else
            TreeMap<String, Resource> treeMap = new TreeMap<>();
            List<String> datas = IOUtil.getPreprocessedData("videoNames.txt");
            for(String data : datas)
                data = data.trim();
                if(!treeMap.containsKey(data))
                    Resource resource = new Resource(data);
                    treeMap.put(data, resource);
                
            
            long curTime = System.currentTimeMillis()/1000;
            act.build(treeMap);
            logger.info("build act cost: " + (System.currentTimeMillis()/1000 - curTime));

            curTime = System.currentTimeMillis()/1000;
            act.save(new ObjectOutputStream(new FileOutputStream(path)));
            logger.info("save act cost: " + (System.currentTimeMillis()/1000 - curTime));
        
    

    /**
     * AC字段树最长匹配分词
     * @param queryText
     * @return
     */
    public List<Term<Resource>> parse(String queryText)
        final List<Term<Resource>> terms = Lists.newArrayList();
        act.parseText(queryText, new AhoCorasickDoubleArrayTrie.IHit<Resource>()
            @Override
            public void hit(int begin, int end, Resource value) 
                Iterator<Term<Resource>> iterator = terms.iterator();
                int length = end - begin;
                boolean isSubStr = false;
                while (iterator.hasNext()) 
                    Term<Resource> current = iterator.next();
                    // 相交且小于当前，移除
                    if (current.end >= begin && length > current.getLength()) 
                        iterator.remove();
                    
                    if(current.getValue().getValue().contains(value.getValue()))
                        isSubStr = true;
                    
                
                if(!isSubStr)
                    terms.add(new Term<Resource>(begin, end, value));
                
            
        );
        return terms;
    

    public List<String> neatSplitResult(List<Term<Resource>> terms)
        List<String> dupResults = Lists.newArrayList();
        for(int j = terms.size() - 1; j > 0; j --)
            String termJ = terms.get(j).getValue().getValue();
            if(!terms.get(j-1).getValue().getValue().endsWith(termJ))
                dupResults.add(termJ);
            
        
        dupResults.add(terms.get(0).getValue().getValue());

        List<String> results = Lists.newArrayList();
        for(int j = dupResults.size() - 1; j >= 0; j--)
            results.add(dupResults.get(j));
        
        return results;

View Code

(2) 引用的AhoCorasickDoubleArrayTrie

package com.test.model.act;

import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.util.*;
import java.util.concurrent.LinkedBlockingDeque;

/**
 * An implementation of Aho Corasick algorithm based on Double Array Trie
 *
 * @author hankcs
 */
public class AhoCorasickDoubleArrayTrie<V> implements Serializable 
    /**
     * check array of the Double Array Trie structure
     */
    protected int check[];
    /**
     * base array of the Double Array Trie structure
     */
    protected int base[];
    /**
     * fail table of the Aho Corasick automata
     */
    protected int fail[];
    /**
     * output table of the Aho Corasick automata
     */
    protected int[][] output;
    /**
     * outer value array
     */
    protected V[] v;

    /**
     * the length of every key
     */
    protected int[] l;

    /**
     * the size of base and check array
     */
    protected int size;

    /**
     * Parse text
     *
     * @param text The text
     * @return a list of outputs
     */
    public List<Hit<V>> parseText(String text) 
        int position = 1;
        int currentState = 0;
        List<Hit<V>> collectedEmits = new LinkedList<Hit<V>>();
        for (int i = 0; i < text.length(); ++i) 
            currentState = getState(currentState, text.charAt(i));
            storeEmits(position, currentState, collectedEmits);
            ++position;
        

        return collectedEmits;
    

    /**
     * Parse text
     *
     * @param text      The text
     * @param processor A processor which handles the output
     */
    public void parseText(String text, IHit<V> processor) 
        int position = 1;
        int currentState = 0;
        for (int i = 0; i < text.length(); ++i) 
            currentState = getState(currentState, text.charAt(i));
            int[] hitArray = output[currentState];
            if (hitArray != null) 
                for (int hit : hitArray) 
                    processor.hit(position - l[hit], position, v[hit]);
                
            
            ++position;
        
    

    /**
     * Parse text
     *
     * @param text      The text
     * @param processor A processor which handles the output
     */
    public void parseText(String text, IHitCancellable<V> processor) 
        int currentState = 0;
        for (int i = 0; i < text.length(); i++) 
            final int position = i + 1;
            currentState = getState(currentState, text.charAt(i));
            int[] hitArray = output[currentState];
            if (hitArray != null) 
                for (int hit : hitArray) 
                    boolean proceed = processor.hit(position - l[hit], position, v[hit]);
                    if (!proceed) 
                        return;
                    
                
            
        
    

    /**
     * Parse text
     *
     * @param text      The text
     * @param processor A processor which handles the output
     */
    public void parseText(char[] text, IHit<V> processor) 
        int position = 1;
        int currentState = 0;
        for (char c : text) 
            currentState = getState(currentState, c);
            int[] hitArray = output[currentState];
            if (hitArray != null) 
                for (int hit : hitArray) 
                    processor.hit(position - l[hit], position, v[hit]);
                
            
            ++position;
        
    

    /**
     * Parse text
     *
     * @param text      The text
     * @param processor A processor which handles the output
     */
    public void parseText(char[] text, IHitFull<V> processor) 
        int position = 1;
        int currentState = 0;
        for (char c : text) 
            currentState = getState(currentState, c);
            int[] hitArray = output[currentState];
            if (hitArray != null) 
                for (int hit : hitArray) 
                    processor.hit(position - l[hit], position, v[hit], hit);
                
            
            ++position;
        
    


    /**
     * Save
     *
     * @param out An ObjectOutputStream object
     * @throws IOException Some IOException
     */
    public void save(ObjectOutputStream out) throws IOException 
        out.writeObject(base);
        out.writeObject(check);
        out.writeObject(fail);
        out.writeObject(output);
        out.writeObject(l);
        out.writeObject(v);
    

    /**
     * Load
     *
     * @param in An ObjectInputStream object
     * @throws IOException
     * @throws ClassNotFoundException
     */
    public void load(ObjectInputStream in) throws IOException, ClassNotFoundException 
        base = (int[]) in.readObject();
        check = (int[]) in.readObject();
        fail = (int[]) in.readObject();
        output = (int[][]) in.readObject();
        l = (int[]) in.readObject();
        v = (V[]) in.readObject();
    

    /**
     * Get value by a String key, just like a map.get() method
     *
     * @param key The key
     * @return
     */
    public V get(String key) 
        int index = exactMatchSearch(key);
        if (index >= 0) 
            return v[index];
        

        return null;
    

    /**
     * Pick the value by index in value array <br>
     * Notice that to be more efficiently, this method DONOT check the parameter
     *
     * @param index The index
     * @return The value
     */
    public V get(int index) 
        return v[index];
    

    /**
     * Processor handles the output when hit a keyword
     */
    public interface IHit<V> 
        /**
         * Hit a keyword, you can use some code like text.substring(begin, end) to get the keyword
         *
         * @param begin the beginning index, inclusive.
         * @param end   the ending index, exclusive.
         * @param value the value assigned to the keyword
         */
        void hit(int begin, int end, V value);
    

    /**
     * Processor handles the output when hit a keyword, with more detail
     */
    public interface IHitFull<V> 
        /**
         * Hit a keyword, you can use some code like text.substring(begin, end) to get the keyword
         *
         * @param begin the beginning index, inclusive.
         * @param end   the ending index, exclusive.
         * @param value the value assigned to the keyword
         * @param index the index of the value assigned to the keyword, you can use the integer as a perfect hash value
         */
        void hit(int begin, int end, V value, int index);
    

    /**
     * Callback that allows to cancel the search process.
     */
    public interface IHitCancellable<V> 
        /**
         * Hit a keyword, you can use some code like text.substring(begin, end) to get the keyword
         *
         * @param begin the beginning index, inclusive.
         * @param end   the ending index, exclusive.
         * @param value the value assigned to the keyword
         * @return Return true for continuing the search and false for stopping it.
         */
        boolean hit(int begin, int end, V value);
    

    /**
     * A result output
     *
     * @param <V> the value type
     */
    public class Hit<V> 
        /**
         * the beginning index, inclusive.
         */
        public final int begin;
        /**
         * the ending index, exclusive.
         */
        public final int end;
        /**
         * the value assigned to the keyword
         */
        public final V value;

        public Hit(int begin, int end, V value) 
            this.begin = begin;
            this.end = end;
            this.value = value;
        

        @Override
        public String toString() 
            return String.format("[%d:%d]=%s", begin, end, value);
        
    

    /**
     * transmit state, supports failure function
     *
     * @param currentState
     * @param character
     * @return
     */
    private int getState(int currentState, char character) 
        // 先按success跳转
        int newCurrentState = transitionWithRoot(currentState, character);
        // 跳转失败的话，按failure跳转
        while (newCurrentState == -1)
        
            currentState = fail[currentState];
            newCurrentState = transitionWithRoot(currentState, character);
        
        return newCurrentState;
    

    /**
     * store output
     *
     * @param position
     * @param currentState
     * @param collectedEmits
     */
    private void storeEmits(int position, int currentState, List<Hit<V>> collectedEmits) 
        int[] hitArray = output[currentState];
        if (hitArray != null) 
            for (int hit : hitArray) 
                collectedEmits.add(new Hit<V>(position - l[hit], position, v[hit]));
            
        
    

    /**
     * transition of a state
     *
     * @param current
     * @param c
     * @return
     */
    protected int transition(int current, char c) 
        int b = current;
        int p;

        p = b + c + 1;
        if (b == check[p]) 
            b = base[p];

         else 
            return -1;
        
        p = b;
        return p;
    

    /**
     * transition of a state, if the state is root and it failed, then returns the root
     *
     * @param nodePos
     * @param c
     * @return
     */
    protected int transitionWithRoot(int nodePos, char c) 
        int b = base[nodePos];
        int p;

        p = b + c + 1;
        if (b != check[p]) 
            if (nodePos == 0) 
                return 0;
            
            return -1;
        

        return p;
    


    /**
     * Build a AhoCorasickDoubleArrayTrie from a map
     *
     * @param map a map containing key-value pairs
     */
    public void build(Map<String, V> map) 
        new Builder().build(map);
    


    /**
     * match exactly by a key
     *
     * @param key the key
     * @return the index of the key, you can use it as a perfect hash function
     */
    public int exactMatchSearch(String key) 
        return exactMatchSearch(key, 0, 0, 0);
    

    /**
     * match exactly by a key
     *
     * @param key
     * @param pos
     * @param len
     * @param nodePos
     * @return
     */
    private int exactMatchSearch(String key, int pos, int len, int nodePos) 
        if (len <= 0)
            len = key.length();
        
        if (nodePos <= 0)
            nodePos = 0;
        

        int result = -1;

        char[] keyChars = key.toCharArray();

        int b = base[nodePos];
        int p;

        for (int i = pos; i < len; i++) 
            p = b + (int) (keyChars[i]) + 1;
            if (b == check[p]) 
                b = base[p];
             else 
                return result;
            
        

        p = b;
        int n = base[p];
        if (b == check[p] && n < 0) 
            result = -n - 1;
        
        return result;
    

    /**
     * match exactly by a key
     *
     * @param keyChars the char array of the key
     * @param pos      the begin index of char array
     * @param len      the length of the key
     * @param nodePos  the starting position of the node for searching
     * @return the value index of the key, minus indicates null
     */
    private int exactMatchSearch(char[] keyChars, int pos, int len, int nodePos) 
        int result = -1;

        int b = base[nodePos];
        int p;

        for (int i = pos; i < len; i++) 
            p = b + (int) (keyChars[i]) + 1;
            if (b == check[p])
                b = base[p];
             else 
                return result;
            
        

        p = b;
        int n = base[p];
        if (b == check[p] && n < 0) 
            result = -n - 1;
        
        return result;
    

    /**
     * Get the size of the keywords
     *
     * @return
     */
    public int size() 
        return v.length;
    

    /**
     * A builder to build the AhoCorasickDoubleArrayTrie
     */
    private class Builder 
        /**
         * the root state of trie
         */
        private State rootState = new State();
        /**
         * whether the position has been used
         */
        private boolean used[];
        /**
         * the allocSize of the dynamic array
         */
        private int allocSize;
        /**
         * a parameter controls the memory growth speed of the dynamic array
         */
        private int progress;
        /**
         * the next position to check unused memory
         */
        private int nextCheckPos;
        /**
         * the size of the key-pair sets
         */
        private int keySize;

        /**
         * Build from a map
         *
         * @param map a map containing key-value pairs
         */
        @SuppressWarnings("unchecked")
        public void build(Map<String, V> map) 
            // 把值保存下来
            v = (V[]) map.values().toArray();
            l = new int[v.length];
            Set<String> keySet = map.keySet();
            // 构建二分trie树
            addAllKeyword(keySet);
            // 在二分trie树的基础上构建双数组trie树
            buildDoubleArrayTrie(keySet.size());
            used = null;
            // 构建failure表并且合并output表
            constructFailureStates();
            rootState = null;
            loseWeight();
        

        /**
         * fetch siblings of a parent node
         *
         * @param parent   parent node
         * @param siblings parent node\'s child nodes, i . e . the siblings
         * @return the amount of the siblings
         */
        private int fetch(State parent, List<Map.Entry<Integer, State>> siblings) 
            if (parent.isAcceptable()) 
                State fakeNode = new State(-(parent.getDepth() + 1));  // 此节点是parent的子节点，同时具备parent的输出
                fakeNode.addEmit(parent.getLargestValueId());
                siblings.add(new AbstractMap.SimpleEntry<Integer, State>(0, fakeNode));
            
            for (Map.Entry<Character, State> entry : parent.getSuccess().entrySet()) 
                siblings.add(new AbstractMap.SimpleEntry<Integer, State>(entry.getKey() + 1, entry.getValue()));
            
            return siblings.size();
        

        /**
         * add a keyword
         *
         * @param keyword a keyword
         * @param index   the index of the keyword
         */
        private void addKeyword(String keyword, int index) 
            State currentState = this.rootState;
            for (Character character : keyword.toCharArray()) 
                currentState = currentState.addState(character);
            
            currentState.addEmit(index);
            l[index] = keyword.length();
        

        /**
         * add a collection of keywords
         *
         * @param keywordSet the collection holding keywords
         */
        private void addAllKeyword(Collection<String> keywordSet) 
            int i = 0;
            for (String keyword : keywordSet) 
                addKeyword(keyword, i++);
            
        

        /**
         * construct failure table
         */
        private void constructFailureStates() 
            fail = new int[size + 1];
            fail[1] = base[0];
            output = new int[size + 1][];
            Queue<State> queue = new LinkedBlockingDeque<State>();

            // 第一步，将深度为1的节点的failure设为根节点
            for (State depthOneState : this.rootState.getStates()) 
                depthOneState.setFailure(this.rootState, fail);
                queue.add(depthOneState);
                constructOutput(depthOneState);
            

            // 第二步，为深度 > 1 的节点建立failure表，这是一个bfs
            while (!queue.isEmpty()) 
                State currentState = queue.remove();

                for (Character transition : currentState.getTransitions()) 
                    State targetState = currentState.nextState(transition);
                    queue.add(targetState);

                    State traceFailureState = currentState.failure();
         人口普查分析：利用python+百度文字识别提取图片中的表格数据(代码片段)
...pdf格式（网上应该有）。之前就一直想实现从pdf提取表格数据，输出为excel。正好这次有公开数据，因此打算用来练个手。尝试了两种方法：1.python的pdfplumber包：利用pdfpumber中的extract_table()方法，  查看详情  
                
基于百度ocr提取图像中的文本(代码片段)
从图片或者扫描版的pdf文件中提取出文本信息的需求在日常工作和学习中经常遇到。扫描版的pdf文件可以使用adobeacrobat将文本数据提取出来，不过adobeacrobat安装文件较大且收费。部分网站也提供在线OCR服务，这些网站在不注册的...  查看详情  
                
基于百度ocr提取图像中的文本(代码片段)
从图片或者扫描版的pdf文件中提取出文本信息的需求在日常工作和学习中经常遇到。扫描版的pdf文件可以使用adobeacrobat将文本数据提取出来，不过adobeacrobat安装文件较大且收费。部分网站也提供在线OCR服务，这些网站在不注册的...  查看详情  
                
人口普查分析：利用python+百度文字识别提取图片中的表格数据(代码片段)
...pdf格式（网上应该有）。之前就一直想实现从pdf提取表格数据，输出为excel。正好这次有公开数据，因此打算用来练个手。尝试了两种方法：1.python的pdfplumber包：利用pdfpumber中的extract_table()方法，可以...  查看详情  
                
爬取360影视排行榜-总榜(代码片段)
...素对页面HTML进行分析，分析完成后用beautifulsoup库获取并提取所要爬的内容信息，最后保存到CSV文件中，并进行数据清洗，数据分析及可视化，绘制图表，数据拟合分析。 二、主题页面的结构特征分析我们需要爬作品的内容...  查看详情  
                
python提取pdf简历中的信息，写入excel(代码片段)
...#xff0c;想把他人投递的PDF简历资料里的关键信息数据，提取到excel表中汇总。目标资料背景：是由求职者自行编制的简历材料，投递到人力资源部。由于其数据格式的不确定，对数据信息的采集带来了一定困难。我...  查看详情  
                
svd(代码片段)
...，本次介绍另外一种方法，即SVD。SVD可以用于简化数据，提取出数据的重要特征，而剔除掉数据中的噪声和冗余信息。SVD在现实中可以应用于推荐系统用于提升性能，也可以用于图像压缩，节省内存。二，利用pyt  查看详情  
                
数据分析⚠️走进数据分析3⚠️beautifulsoup提取页面信息(代码片段)
【数据分析】⚠️走进数据分析3⚠️BeautifulSoup提取页面信息概述BeautifulSoup提取页面信息概述数据分析(DataAnalyze)可以在工作中的各个方面帮助我们.本专栏为量化交易专栏下的子专栏,主要讲解一些数据分析的基础知识.BeautifulSoupB...  查看详情  
                
python从linux中的/proc中提取进程统计信息(代码片段)
  查看详情  
                
使用awk&&sed提取日志中的有效信息(代码片段)
日志信息：源数据：1.1.1.1--[08/Aug/2018:00:00:14+0800]"GEThttp://www.test.test/test1/test2/tes3HTTP/1.1"2001306"http://a.b.cn/test/test?form""Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleW  查看详情  
                
使用r在不同数据中提取xml中的多个条目(代码片段)
我有一组XML文件正在阅读，并想知道处理以下内容的最佳方法：<MyDecision><Decision><DecisionID>X1234</DecisionID><DecisionReasonsxmlns:a="http://schemas.datacontract.org/2004/07/Contracts"><a:Reason  查看详情  
                
快速提取文件中的汉字（或者有用的信息）(代码片段)
...击直接资料领取导读有时候，我们需要从一长串字符串中提取中文，比如如下这样的：我们可以看到，中文的长度参差不齐，在字符串中的位置也不固定。因此无论是用Excel自带的left,right,mid函数，还是使用分列都无能为力。下...  查看详情  
                
java通过百度ai开发平台提取身份证图片中的文字信息
废话不多说，直接上代码。。。　　IdCardDemo.java1packagecom.wulss.baidubce;23importjava.io.BufferedReader;4importjava.io.InputStreamReader;5importjava.net.HttpURLConnection;6importjava.net.URL;7importjava.net.URLEncod  查看详情  
                
如何从pyopengl中的单个数组中提取glvertexpointer（）和glcolorpointer（）的数据？(代码片段)
...his教程。在某一点上，教师创建了一个单一的数组，从中提取信息以构建一个三角形：vetices及其颜色（我在这里添加了Numpy线）：#-----------|-Verticespos--|---Colors----|-----------vertices=[-0.5,-0.5,0.0,1.0,0.0,0.0,0.5,-0.5,0.0,0.0,1.0,0.0  查看详情  
                
piesdk打开自定义矢量数据(代码片段)
 1.数据介绍  信息提取和解译的过程中，经常会生成一部分中间临时矢量数据，这些数据在执行完对应操作后就失去了存在的价值，针对这种情况，PIE增加了内存矢量数据集，来协助用户完成对自定义矢量数据的读取...  查看详情  
                
jenkins安装(代码片段)
...n-jdk命令：rpm-qa|grepjavarpm-qa|grepjdkrpm-qa|grepgcj如果没有输入信息表示没有安装。如果安装可以使用rpm -qa | grep&n  查看详情  
                
提取cookie中的值(代码片段)
//提取cookie中的值CloudShareCommon.prototype.getCookie=function(name)varcookieStr=document.cookie;if(cookieStr.length>0)varcookieArr=cookieStr.split(";");//将cookie信息转换成数组for(vari=0;i<cookieArr.leng  查看详情  
                
将贝宝电子邮件中的数据提取到 PDF 中？
】将贝宝电子邮件中的数据提取到PDF中？【英文标题】：PulldatafrompaypalemailintoPDF?【发布时间】：2016-05-2612:26:38【问题描述】：我在网上查看过，但找不到任何相关信息，尽管我觉得这将是一个常规问题，我现在有一家商店，它...  查看详情