近日想爬一些某二次元小说网站的完结小说，但网站比较小众，不像某网站一样资源一找一大片，找到的网站也不支持txt下载，因此使用爬虫下载到本地。这里以trxs.org / trxs.cc和www.xklxsw.com两个网站作为样例。

Scrapy

安装

参考链接：https://scrapy-docs.readthedocs.io/zh/latest/intro/install.html

总之一句话，装个anaconda环境，使用conda create -n scrapy webspider安装就行了，直接安装不仅麻烦，会有各种奇奇怪怪的问题，还强奸path，卸载也需要小心翼翼。

创建爬虫

通过Scrapy创建爬虫有两种方式，一种是建立一个project，然后在project里面创建爬虫；另一种是直接创建爬虫。我选用了project的方式，但爬小说这种小项目直接建立爬虫也是可以的，可以参考https://docs.scrapy.org/en/latest/topics/spiders.html，parse完之后通过closed()函数写到文件里就可以。

在命令行中输入scrapy startproject novel,然后cd到文件夹中，输入scrapy genspider trxs，一个爬虫就建好了。

然后，在./novel/novel/items.py中，将class NovelItem修改为：

class NovelItem(scrapy.Item):
    name = scrapy.Field()
    chapter = scrapy.Field()
    id = scrapy.Field()
    content = scrapy.Field()

在./novel/novel/settings.py中，去掉USER_AGENT、ROBOTSTXT_OBEY、ITEM_PIPELINES（这个必须）的注释。

./novel/novel/spiders/trxs.py为：

import scrapy
from .. import items #直接import items不知道为什么会炸
import re

class TrxsSpider(scrapy.Spider):
    name = 'trxs'
    #allowed_domains = ['']
    start_urls = ['https://www.trxs.cc/tongren/****.html']
    #trxs.org的章节并不是实际章节，第**章只是顺序代号
    def parse(self, response):
        chapter_list = response.xpath('//div[@class="book_list clearfix"]/ul/li/a')
        for i in chapter_list:
            url = 'https://www.trxs.cc' + i.xpath('./@href').extract_first()
            chapter = i.xpath('string(.)').extract_first()
            request = scrapy.Request(url, callback=self.parse_content,dont_filter=True)
            #request.meta["name"]=''
            #request是并行乱序的，所以得有id
            request.meta["id"]=int(re.findall(r'\d+',chapter)[0])
            request.meta["chapter"]=chapter
            yield request
    
    def parse_content(self, response):
        content_list = response.xpath('//div[@class="read_chapterDetail"]/text()').extract()
        content = '\n'.join(content_list)
        item = items.NovelItem()
        item['name'] = ''
        item['id']=response.meta['id']
        item['chapter']=response.meta['chapter']
        item['content']=content
        yield item

./novel/novel/pipelines.py为：

from itemadapter import ItemAdapter

class NovelPipeline:
    m={}
    def open_spider(self, spider):
        self.file = open('pipeline.txt', 'w' , encoding='utf-8')
    
    def close_spider(self, spider):
        for i in range(1,len(self.m)+1):
            self.file.write(self.m[i])
        self.file.close()
    
    def process_item(self, item, spider):
        name=item["name"]
        chapter=item["chapter"]
        id=int(item["id"])
        content=item["content"]
        self.m[id]=content
        return item

然后在命令行中输入scrapy crawl trxs即可运行爬虫。

Web Scraper

xklxsw.com是有反爬虫措施的网站，用scrapy爬会直接403。如果懒得调试，与反爬虫措施斗智斗勇可以使用Web Scraper，直接在浏览器中执行，完全仿人工操作。

安装

直接在Chrome应用商店中搜索安装即可。

创建爬虫

F12，在开发者工具上边栏找到Web Scraper，点进去就可以了。然后在小说章节页面Create new Sitemap，然后Add new seletor，将类型设置为Link，勾上Multiple，在Seletor那里选择Select element就可以选择链接了。很人性化的一点是Web Scraper并不需要手动输入xpath、CSS选择器什么的代码，直接在页面点击就行，选择两个相同类型的自动全选。

创建完Selector之后点进去，继续创建子Seletor，选择类型为Text，选择小说内容页面就行。然后点上方的Sitemap菜单，选择Scrape，慢慢等即可。爬完之后Export，导出成csv就可以交给python来parse，去广告，合并了。

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Zrj's blog

Scrapy与Web Scraper使用小记

Scrapy

安装

创建爬虫

Web Scraper

安装

创建爬虫

暂无评论

发表回复取消回复

Scrapy

安装

创建爬虫

Web Scraper

安装

创建爬虫

发表回复 取消回复

发表回复取消回复