正文

论坛标题网页爬虫

 2023-03-08  144

关键词：

【中文标题】论坛标题网页爬虫【英文标题】：Forum title web scraper 【发布时间】：2017-01-04 02:26:50 【问题描述】：

我正在编写一个简单的网络爬虫，它可以从论坛中提取帖子标题、用户名和上次发帖时间。

问题是刮板只提取表中的最后一个条目。

例如：如果表格的结构是这样的：

<tbody>
<tr class="">
  <td class="title">    
    <a href="/forums/marketplace/8827" title="View full post details">Title number 1</a>
  </td>
  <td class="author"><a href="/members/pursu" title="View member, pursu">pursu</a></td>
  <td class="count">0</td>
  <td class="last_post">9 minutes ago</td>
</tr>
<tr class="color2">
  <td class="title">

    <a href="/forums/marketplace/8826" title="View full post details">Title number 2</a>
  </td>
  <td class="author"><a href="/members/colinatx" title="View member, colinatx">colinatx</a></td>
  <td class="count">0</td>
  <td class="last_post">9 minutes ago</td>
</tr>
<tr class="">
  <td class="title">    
    <a href="/forums/marketplace/8785" title="View full post details">Title number 3</a>
  </td>
  <td class="author"><a href="/members/Object117" title="View member, Object117">Object117</a></td>
  <td class="count">11</td>
  <td class="last_post">about 1 hour ago</td>
</tr>
</tbody>

将写入.json输出文件的结果是这个


    "title": "Title number 3",
    "author": "Object117",
    "lastpost": "about 1 hour ago"

应该是这样的：


    "title": "Title number 1",
    "author": "pursu",
    "lastpost": "9 minutes ago"


    "title": "Title number 2",
    "author": "colinatx",
    "lastpost": "9 minutes ago"


    "title": "Title number 3",
    "author": "Object117",
    "lastpost": "about 1 hour ago"

我的 JavaScript：

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res)

    //This is the URL to pull data from
    url = 'http://www.pedalroom.com/forums/marketplace';

    // The first parameter is our URL

    // The callback function takes 3 parameters, an error, response status code and the html
     request(url, function(error, response, html)
          if(!error)

              //pulling HTML
            var $ = cheerio.load(html);

              //Variables that capture data
            var title, author, lastpost;
            var json =  title : "", author : "", lastpost : "";

            $('.title').filter(function()

                var data = $(this);

                title = data.children().first().text();

                json.title = title;
            )
             $('.author').filter(function()

                var data = $(this);

                author = data.children().first().text();

                json.author = author;
            )
             $('.last_post').filter(function()

                var data = $(this);

                lastpost = data.text();

                json.lastpost = lastpost;
            )
     
         fs.writeFile('output.json', JSON.stringify(json, null, 4), function(err)

             console.log('File successfully written! - Check your project directory for the output.json file');

         )

         // Finally, we'll just send out a message to the browser reminding you that this app does not have a UI.
         res.send('Check your console!')

     );
)

app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;

是我需要以某种方式循环代码还是其他什么？

【问题讨论】：

【参考方案1】：

在您的代码中，您只捕获第一行的第一个元素，因为您没有在每一行上循环。

这是工作代码：

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res)

    //This is the URL to pull data from
    url = 'http://www.pedalroom.com/forums/marketplace';

    // The first parameter is our URL

    // The callback function takes 3 parameters, an error, response status code and the html
    request(url, function(error, response, html)
        if(!error)

            //pulling HTML
            var $ = cheerio.load(html);

            var data = [];

            /**
             * New code starts here
             */
            // For each row of the table
            $('.topics tr').each(function(index, element)

                // If title is present on this line, write it into the json
                if($(this).find('.title a').length > 0)
                    data.push(
                        title: $(this).find('.title a').html(),
                        author: $(this).find('.author a').html(),
                        lastpost: $(this).find('.last_post').html()
                    );
            );
            /**
             * Ends here :D
             */
        
        fs.writeFile('output.json', JSON.stringify(data, null, 4), function(err)

            console.log('File successfully written! - Check your project directory for the output.json file');

        )

        // Finally, we'll just send out a message to the browser reminding you that this app does not have a UI.
        res.send('Check your console!')

    );
)

app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;

【讨论】：

学习爬虫前对网页进行认识

...，日常中我们看到这些网页，可以看到很多图片，很多的标题以及很多的文字信息，实际上他们都是在浏览器渲染后的结果，我们可以吧浏览器理解为一个翻译官，它把这些原始的信息，原始的网页的代码翻译成一些我们可视化... 查看详情

java实现网页爬虫

...着上面一篇对爬虫需要的java知识，这一篇目的就是在于网页爬虫的实现，对数据的获取，以便分析。-----> 目录： 1、爬虫原理2、本地文件数据提取及分析3、单网页数据的读取4、运用正则表达式完成超连接的连接匹配... 查看详情

【scrapy爬虫实战】discuz论坛版块全部帖子信息爬取

参考技术ADiscuz是一款由PHP编写的开源论坛Discuz官方论坛:https://www.discuz.net/forum.php要爬取的页面地址:DiscuzBUG与问题交流板块;https://www.discuz.net/forum-70-1.html应该打开创建项目命令生成的那个目录如果选择再下层目录就不能导模块了... 查看详情

python网络爬虫学习手记——爬虫基础

1、爬虫基本概念网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。--------百度百科简单的说，爬虫就是获取目标网页源... 查看详情

爬虫-----html解析

...一点，可能需要查看网页加载的JavaScript文件。虽然网页标题经常会用到，但是这个信息也许可以从网页查看详情

网络爬虫

... 网络爬虫是一个自动提取网页的程序，它为搜索引擎从万维网上下载网页，是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从... 查看详情

网页爬虫

一.前言近期要測试改动一个反爬虫代码，之前一直没接触过反爬虫，仅仅闻其声不见其人。既然要反爬虫。肯定要理解爬虫的思维方式，见招拆招，只是遗憾的是仅仅要你想爬没啥爬不到的，比方控制下... 查看详情

java爬虫系列——爬取动态网页

java爬虫系列（二）——爬取动态网页Mr_OOO 2018-01-0115:59:40 11440 收藏 11分类专栏：爬虫入门专栏最简单的java爬虫文章标签： java 爬虫 seimiagent seimicrawler动态网页版权&nbs 查看详情

python爬虫如何分析一个将要爬取的网站？

...网站，你会清楚这个网站是属于什么类型的网站（新闻，论坛，贴吧等等）。你会清楚你需要哪部分的数据。你需要去想需要的数据你将如何编写表达式去解析。你会碰到各种反爬措施，无非就是各种百度各种解决。当爬取成本... 查看详情

网页爬虫：零基础用爬虫爬取网页内容

网络上有许多用Python爬取网页内容的教程，但一般需要写代码，没有相应基础的人要想短时间内上手，还是有门槛的。其实绝大多数场景下，用WebScraper（一个Chrome插件）就能迅速爬到目标内容，重要的... 查看详情

爬虫相关

网络爬虫：就是抓取网页数据的程序。网页三大特征：1.网页都有自己的URL（统一资源定位符）来进行定位，每个网页都有一个唯一的URL2.网页都用HTML（超文本标记语言）来描述页面信息。3.网页都用HTTP/HTTPS（超文本传输协议）... 查看详情

网页爬虫之二手车价格爬虫(代码片段)

今天学习了爬虫技术简单来说就是利用pyhon连续的访问网页，自动的将网页中我们用到的信息存储起来的过程。需要我们的看懂简单的网页代码，能够写一些简单的python语句下面我们举一个一个需要两步爬虫的例子：... 查看详情

网页爬虫之二手车价格爬虫(代码片段)

初识爬虫

何为爬虫　　网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。　　我们平时的上网就是浏览器提交请求->下载网页代... 查看详情

最基础网页爬虫

第一个网页文本爬虫程序（没有添加下载器）：1importrequests2frombs4importBeautifulSoup3importos45headers={‘User-Agent‘:"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.1(KHTMl,likeGecko)Chrome/22.0.1207.1Safari/537.1"}6url_b 查看详情

爬虫开坑之路

...一种用来自动浏览万维网的网络机器人(bots)。爬虫是通过网页的链接地址来寻找网页，从网站某一个页面开始，读取网页的内容，找到在网页中的其它链接地址，然后通过这些链接地址寻找下一个网页，这样一直循环下去，直到... 查看详情

9.3.2网页爬虫(代码片段)

　　网页爬虫常用来在互联网上爬取感兴趣的页面或文件，结合数据处理与分析技术可以得到更深层次的信息。下面的代码实现了网页爬虫，可以抓取指定网页中的所有链接，并且可以指定关键字和抓取深度。 1importsys2importm... 查看详情

爬虫简介(代码片段)

...载数据或者内容能自动在网络上流窜爬虫的三大步骤下载网页提取正确的信息根据一定的规则自动跳到另外的网页上执行上两步爬虫的分类通用爬虫专用爬虫（聚焦爬虫）爬虫的结构Python爬虫架构主要由五个部分组成，分别是调... 查看详情