关键词:
1. 背景
网站上爬取了部分关于影视的百度知道QA,为了后续提高影视的搜索效果,需要基于百度知道QA的答案抽取相关的影视信息。
2. 流程
目前已有基础的媒资视频库信息,基于媒资视频库中的视频名称,构建分词字典,结合使用AC双数组,然后针对百度的QA进行分词。针对分词后的结果,可以结合视频热度与评分进行筛选。
3. 代码实现
(1) 基于文本(格式为每行一条视频名称),结合AC双数组,构建分词
package com.test.model.act; import com.google.common.collect.Lists; import com.test.util.IOUtil; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import static com.test.model.act.AhoCorasickDoubleArrayTrie.*; import java.io.*; import java.util.Iterator; import java.util.List; import java.util.TreeMap; /** * @author test * @date 2018/11/1 */ public class Act private static Logger logger = LoggerFactory.getLogger(Act.class); private static Act instance = null; private static String path = "act"; private AhoCorasickDoubleArrayTrie<Resource> act = new AhoCorasickDoubleArrayTrie<>(); public static Act getInstance() throws IOException, ClassNotFoundException if(null == instance) instance = new Act(); return instance; public Act() throws IOException, ClassNotFoundException this.initTrie(); /** * AC自动机初始化 * @throws IOException * @throws ClassNotFoundException */ private void initTrie() throws IOException, ClassNotFoundException if(new File(path).exists()) FileInputStream fis = new FileInputStream(path); ObjectInputStream ois = new ObjectInputStream(fis); long curTime = System.currentTimeMillis()/1000; act.load(ois); logger.info("load act cost: " + (System.currentTimeMillis()/1000 - curTime)); else TreeMap<String, Resource> treeMap = new TreeMap<>(); List<String> datas = IOUtil.getPreprocessedData("videoNames.txt"); for(String data : datas) data = data.trim(); if(!treeMap.containsKey(data)) Resource resource = new Resource(data); treeMap.put(data, resource); long curTime = System.currentTimeMillis()/1000; act.build(treeMap); logger.info("build act cost: " + (System.currentTimeMillis()/1000 - curTime)); curTime = System.currentTimeMillis()/1000; act.save(new ObjectOutputStream(new FileOutputStream(path))); logger.info("save act cost: " + (System.currentTimeMillis()/1000 - curTime)); /** * AC字段树最长匹配分词 * @param queryText * @return */ public List<Term<Resource>> parse(String queryText) final List<Term<Resource>> terms = Lists.newArrayList(); act.parseText(queryText, new AhoCorasickDoubleArrayTrie.IHit<Resource>() @Override public void hit(int begin, int end, Resource value) Iterator<Term<Resource>> iterator = terms.iterator(); int length = end - begin; boolean isSubStr = false; while (iterator.hasNext()) Term<Resource> current = iterator.next(); // 相交且小于当前,移除 if (current.end >= begin && length > current.getLength()) iterator.remove(); if(current.getValue().getValue().contains(value.getValue())) isSubStr = true; if(!isSubStr) terms.add(new Term<Resource>(begin, end, value)); ); return terms; public List<String> neatSplitResult(List<Term<Resource>> terms) List<String> dupResults = Lists.newArrayList(); for(int j = terms.size() - 1; j > 0; j --) String termJ = terms.get(j).getValue().getValue(); if(!terms.get(j-1).getValue().getValue().endsWith(termJ)) dupResults.add(termJ); dupResults.add(terms.get(0).getValue().getValue()); List<String> results = Lists.newArrayList(); for(int j = dupResults.size() - 1; j >= 0; j--) results.add(dupResults.get(j)); return results;
(2) 引用的AhoCorasickDoubleArrayTrie
package com.test.model.act; import java.io.IOException; import java.io.ObjectInputStream; import java.io.ObjectOutputStream; import java.io.Serializable; import java.util.*; import java.util.concurrent.LinkedBlockingDeque; /** * An implementation of Aho Corasick algorithm based on Double Array Trie * * @author hankcs */ public class AhoCorasickDoubleArrayTrie<V> implements Serializable /** * check array of the Double Array Trie structure */ protected int check[]; /** * base array of the Double Array Trie structure */ protected int base[]; /** * fail table of the Aho Corasick automata */ protected int fail[]; /** * output table of the Aho Corasick automata */ protected int[][] output; /** * outer value array */ protected V[] v; /** * the length of every key */ protected int[] l; /** * the size of base and check array */ protected int size; /** * Parse text * * @param text The text * @return a list of outputs */ public List<Hit<V>> parseText(String text) int position = 1; int currentState = 0; List<Hit<V>> collectedEmits = new LinkedList<Hit<V>>(); for (int i = 0; i < text.length(); ++i) currentState = getState(currentState, text.charAt(i)); storeEmits(position, currentState, collectedEmits); ++position; return collectedEmits; /** * Parse text * * @param text The text * @param processor A processor which handles the output */ public void parseText(String text, IHit<V> processor) int position = 1; int currentState = 0; for (int i = 0; i < text.length(); ++i) currentState = getState(currentState, text.charAt(i)); int[] hitArray = output[currentState]; if (hitArray != null) for (int hit : hitArray) processor.hit(position - l[hit], position, v[hit]); ++position; /** * Parse text * * @param text The text * @param processor A processor which handles the output */ public void parseText(String text, IHitCancellable<V> processor) int currentState = 0; for (int i = 0; i < text.length(); i++) final int position = i + 1; currentState = getState(currentState, text.charAt(i)); int[] hitArray = output[currentState]; if (hitArray != null) for (int hit : hitArray) boolean proceed = processor.hit(position - l[hit], position, v[hit]); if (!proceed) return; /** * Parse text * * @param text The text * @param processor A processor which handles the output */ public void parseText(char[] text, IHit<V> processor) int position = 1; int currentState = 0; for (char c : text) currentState = getState(currentState, c); int[] hitArray = output[currentState]; if (hitArray != null) for (int hit : hitArray) processor.hit(position - l[hit], position, v[hit]); ++position; /** * Parse text * * @param text The text * @param processor A processor which handles the output */ public void parseText(char[] text, IHitFull<V> processor) int position = 1; int currentState = 0; for (char c : text) currentState = getState(currentState, c); int[] hitArray = output[currentState]; if (hitArray != null) for (int hit : hitArray) processor.hit(position - l[hit], position, v[hit], hit); ++position; /** * Save * * @param out An ObjectOutputStream object * @throws IOException Some IOException */ public void save(ObjectOutputStream out) throws IOException out.writeObject(base); out.writeObject(check); out.writeObject(fail); out.writeObject(output); out.writeObject(l); out.writeObject(v); /** * Load * * @param in An ObjectInputStream object * @throws IOException * @throws ClassNotFoundException */ public void load(ObjectInputStream in) throws IOException, ClassNotFoundException base = (int[]) in.readObject(); check = (int[]) in.readObject(); fail = (int[]) in.readObject(); output = (int[][]) in.readObject(); l = (int[]) in.readObject(); v = (V[]) in.readObject(); /** * Get value by a String key, just like a map.get() method * * @param key The key * @return */ public V get(String key) int index = exactMatchSearch(key); if (index >= 0) return v[index]; return null; /** * Pick the value by index in value array <br> * Notice that to be more efficiently, this method DONOT check the parameter * * @param index The index * @return The value */ public V get(int index) return v[index]; /** * Processor handles the output when hit a keyword */ public interface IHit<V> /** * Hit a keyword, you can use some code like text.substring(begin, end) to get the keyword * * @param begin the beginning index, inclusive. * @param end the ending index, exclusive. * @param value the value assigned to the keyword */ void hit(int begin, int end, V value); /** * Processor handles the output when hit a keyword, with more detail */ public interface IHitFull<V> /** * Hit a keyword, you can use some code like text.substring(begin, end) to get the keyword * * @param begin the beginning index, inclusive. * @param end the ending index, exclusive. * @param value the value assigned to the keyword * @param index the index of the value assigned to the keyword, you can use the integer as a perfect hash value */ void hit(int begin, int end, V value, int index); /** * Callback that allows to cancel the search process. */ public interface IHitCancellable<V> /** * Hit a keyword, you can use some code like text.substring(begin, end) to get the keyword * * @param begin the beginning index, inclusive. * @param end the ending index, exclusive. * @param value the value assigned to the keyword * @return Return true for continuing the search and false for stopping it. */ boolean hit(int begin, int end, V value); /** * A result output * * @param <V> the value type */ public class Hit<V> /** * the beginning index, inclusive. */ public final int begin; /** * the ending index, exclusive. */ public final int end; /** * the value assigned to the keyword */ public final V value; public Hit(int begin, int end, V value) this.begin = begin; this.end = end; this.value = value; @Override public String toString() return String.format("[%d:%d]=%s", begin, end, value); /** * transmit state, supports failure function * * @param currentState * @param character * @return */ private int getState(int currentState, char character) // 先按success跳转 int newCurrentState = transitionWithRoot(currentState, character); // 跳转失败的话,按failure跳转 while (newCurrentState == -1) currentState = fail[currentState]; newCurrentState = transitionWithRoot(currentState, character); return newCurrentState; /** * store output * * @param position * @param currentState * @param collectedEmits */ private void storeEmits(int position, int currentState, List<Hit<V>> collectedEmits) int[] hitArray = output[currentState]; if (hitArray != null) for (int hit : hitArray) collectedEmits.add(new Hit<V>(position - l[hit], position, v[hit])); /** * transition of a state * * @param current * @param c * @return */ protected int transition(int current, char c) int b = current; int p; p = b + c + 1; if (b == check[p]) b = base[p]; else return -1; p = b; return p; /** * transition of a state, if the state is root and it failed, then returns the root * * @param nodePos * @param c * @return */ protected int transitionWithRoot(int nodePos, char c) int b = base[nodePos]; int p; p = b + c + 1; if (b != check[p]) if (nodePos == 0) return 0; return -1; return p; /** * Build a AhoCorasickDoubleArrayTrie from a map * * @param map a map containing key-value pairs */ public void build(Map<String, V> map) new Builder().build(map); /** * match exactly by a key * * @param key the key * @return the index of the key, you can use it as a perfect hash function */ public int exactMatchSearch(String key) return exactMatchSearch(key, 0, 0, 0); /** * match exactly by a key * * @param key * @param pos * @param len * @param nodePos * @return */ private int exactMatchSearch(String key, int pos, int len, int nodePos) if (len <= 0) len = key.length(); if (nodePos <= 0) nodePos = 0; int result = -1; char[] keyChars = key.toCharArray(); int b = base[nodePos]; int p; for (int i = pos; i < len; i++) p = b + (int) (keyChars[i]) + 1; if (b == check[p]) b = base[p]; else return result; p = b; int n = base[p]; if (b == check[p] && n < 0) result = -n - 1; return result; /** * match exactly by a key * * @param keyChars the char array of the key * @param pos the begin index of char array * @param len the length of the key * @param nodePos the starting position of the node for searching * @return the value index of the key, minus indicates null */ private int exactMatchSearch(char[] keyChars, int pos, int len, int nodePos) int result = -1; int b = base[nodePos]; int p; for (int i = pos; i < len; i++) p = b + (int) (keyChars[i]) + 1; if (b == check[p]) b = base[p]; else return result; p = b; int n = base[p]; if (b == check[p] && n < 0) result = -n - 1; return result; /** * Get the size of the keywords * * @return */ public int size() return v.length; /** * A builder to build the AhoCorasickDoubleArrayTrie */ private class Builder /** * the root state of trie */ private State rootState = new State(); /** * whether the position has been used */ private boolean used[]; /** * the allocSize of the dynamic array */ private int allocSize; /** * a parameter controls the memory growth speed of the dynamic array */ private int progress; /** * the next position to check unused memory */ private int nextCheckPos; /** * the size of the key-pair sets */ private int keySize; /** * Build from a map * * @param map a map containing key-value pairs */ @SuppressWarnings("unchecked") public void build(Map<String, V> map) // 把值保存下来 v = (V[]) map.values().toArray(); l = new int[v.length]; Set<String> keySet = map.keySet(); // 构建二分trie树 addAllKeyword(keySet); // 在二分trie树的基础上构建双数组trie树 buildDoubleArrayTrie(keySet.size()); used = null; // 构建failure表并且合并output表 constructFailureStates(); rootState = null; loseWeight(); /** * fetch siblings of a parent node * * @param parent parent node * @param siblings parent node\'s child nodes, i . e . the siblings * @return the amount of the siblings */ private int fetch(State parent, List<Map.Entry<Integer, State>> siblings) if (parent.isAcceptable()) State fakeNode = new State(-(parent.getDepth() + 1)); // 此节点是parent的子节点,同时具备parent的输出 fakeNode.addEmit(parent.getLargestValueId()); siblings.add(new AbstractMap.SimpleEntry<Integer, State>(0, fakeNode)); for (Map.Entry<Character, State> entry : parent.getSuccess().entrySet()) siblings.add(new AbstractMap.SimpleEntry<Integer, State>(entry.getKey() + 1, entry.getValue())); return siblings.size(); /** * add a keyword * * @param keyword a keyword * @param index the index of the keyword */ private void addKeyword(String keyword, int index) State currentState = this.rootState; for (Character character : keyword.toCharArray()) currentState = currentState.addState(character); currentState.addEmit(index); l[index] = keyword.length(); /** * add a collection of keywords * * @param keywordSet the collection holding keywords */ private void addAllKeyword(Collection<String> keywordSet) int i = 0; for (String keyword : keywordSet) addKeyword(keyword, i++); /** * construct failure table */ private void constructFailureStates() fail = new int[size + 1]; fail[1] = base[0]; output = new int[size + 1][]; Queue<State> queue = new LinkedBlockingDeque<State>(); // 第一步,将深度为1的节点的failure设为根节点 for (State depthOneState : this.rootState.getStates()) depthOneState.setFailure(this.rootState, fail); queue.add(depthOneState); constructOutput(depthOneState); // 第二步,为深度 > 1 的节点建立failure表,这是一个bfs while (!queue.isEmpty()) State currentState = queue.remove(); for (Character transition : currentState.getTransitions()) State targetState = currentState.nextState(transition); queue.add(targetState); State traceFailureState = currentState.failure();人口普查分析:利用python+百度文字识别提取图片中的表格数据(代码片段)
...pdf格式(网上应该有)。之前就一直想实现从pdf提取表格数据,输出为excel。正好这次有公开数据,因此打算用来练个手。尝试了两种方法:1.python的pdfplumber包:利用pdfpumber中的extract_table()方法, 查看详情
基于百度ocr提取图像中的文本(代码片段)
从图片或者扫描版的pdf文件中提取出文本信息的需求在日常工作和学习中经常遇到。扫描版的pdf文件可以使用adobeacrobat将文本数据提取出来,不过adobeacrobat安装文件较大且收费。部分网站也提供在线OCR服务,这些网站在不注册的... 查看详情
基于百度ocr提取图像中的文本(代码片段)
从图片或者扫描版的pdf文件中提取出文本信息的需求在日常工作和学习中经常遇到。扫描版的pdf文件可以使用adobeacrobat将文本数据提取出来,不过adobeacrobat安装文件较大且收费。部分网站也提供在线OCR服务,这些网站在不注册的... 查看详情
人口普查分析:利用python+百度文字识别提取图片中的表格数据(代码片段)
...pdf格式(网上应该有)。之前就一直想实现从pdf提取表格数据,输出为excel。正好这次有公开数据,因此打算用来练个手。尝试了两种方法:1.python的pdfplumber包:利用pdfpumber中的extract_table()方法,可以... 查看详情
爬取360影视排行榜-总榜(代码片段)
...素对页面HTML进行分析,分析完成后用beautifulsoup库获取并提取所要爬的内容信息,最后保存到CSV文件中,并进行数据清洗,数据分析及可视化,绘制图表,数据拟合分析。 二、主题页面的结构特征分析我们需要爬作品的内容... 查看详情
python提取pdf简历中的信息,写入excel(代码片段)
...#xff0c;想把他人投递的PDF简历资料里的关键信息数据,提取到excel表中汇总。目标资料背景:是由求职者自行编制的简历材料,投递到人力资源部。由于其数据格式的不确定,对数据信息的采集带来了一定困难。我... 查看详情
svd(代码片段)
...,本次介绍另外一种方法,即SVD。SVD可以用于简化数据,提取出数据的重要特征,而剔除掉数据中的噪声和冗余信息。SVD在现实中可以应用于推荐系统用于提升性能,也可以用于图像压缩,节省内存。二,利用pyt 查看详情
数据分析⚠️走进数据分析3⚠️beautifulsoup提取页面信息(代码片段)
【数据分析】⚠️走进数据分析3⚠️BeautifulSoup提取页面信息概述BeautifulSoup提取页面信息概述数据分析(DataAnalyze)可以在工作中的各个方面帮助我们.本专栏为量化交易专栏下的子专栏,主要讲解一些数据分析的基础知识.BeautifulSoupB... 查看详情
python从linux中的/proc中提取进程统计信息(代码片段)
使用awk&&sed提取日志中的有效信息(代码片段)
日志信息:源数据:1.1.1.1--[08/Aug/2018:00:00:14+0800]"GEThttp://www.test.test/test1/test2/tes3HTTP/1.1"2001306"http://a.b.cn/test/test?form""Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleW 查看详情
使用r在不同数据中提取xml中的多个条目(代码片段)
我有一组XML文件正在阅读,并想知道处理以下内容的最佳方法:<MyDecision><Decision><DecisionID>X1234</DecisionID><DecisionReasonsxmlns:a="http://schemas.datacontract.org/2004/07/Contracts"><a:Reason 查看详情
快速提取文件中的汉字(或者有用的信息)(代码片段)
...击直接资料领取导读有时候,我们需要从一长串字符串中提取中文,比如如下这样的:我们可以看到,中文的长度参差不齐,在字符串中的位置也不固定。因此无论是用Excel自带的left,right,mid函数,还是使用分列都无能为力。下... 查看详情
java通过百度ai开发平台提取身份证图片中的文字信息
废话不多说,直接上代码。。。 IdCardDemo.java1packagecom.wulss.baidubce;23importjava.io.BufferedReader;4importjava.io.InputStreamReader;5importjava.net.HttpURLConnection;6importjava.net.URL;7importjava.net.URLEncod 查看详情
如何从pyopengl中的单个数组中提取glvertexpointer()和glcolorpointer()的数据?(代码片段)
...his教程。在某一点上,教师创建了一个单一的数组,从中提取信息以构建一个三角形:vetices及其颜色(我在这里添加了Numpy线):#-----------|-Verticespos--|---Colors----|-----------vertices=[-0.5,-0.5,0.0,1.0,0.0,0.0,0.5,-0.5,0.0,0.0,1.0,0.0 查看详情
piesdk打开自定义矢量数据(代码片段)
1.数据介绍 信息提取和解译的过程中,经常会生成一部分中间临时矢量数据,这些数据在执行完对应操作后就失去了存在的价值,针对这种情况,PIE增加了内存矢量数据集,来协助用户完成对自定义矢量数据的读取... 查看详情
jenkins安装(代码片段)
...n-jdk命令:rpm-qa|grepjavarpm-qa|grepjdkrpm-qa|grepgcj如果没有输入信息表示没有安装。如果安装可以使用rpm -qa | grep&n 查看详情
提取cookie中的值(代码片段)
//提取cookie中的值CloudShareCommon.prototype.getCookie=function(name)varcookieStr=document.cookie;if(cookieStr.length>0)varcookieArr=cookieStr.split(";");//将cookie信息转换成数组for(vari=0;i<cookieArr.leng 查看详情
将贝宝电子邮件中的数据提取到 PDF 中?
】将贝宝电子邮件中的数据提取到PDF中?【英文标题】:PulldatafrompaypalemailintoPDF?【发布时间】:2016-05-2612:26:38【问题描述】:我在网上查看过,但找不到任何相关信息,尽管我觉得这将是一个常规问题,我现在有一家商店,它... 查看详情