1200字范文 > 【Python 爬虫学习笔记】三 Scrapy框架的简单示例

【Python 爬虫学习笔记】三 Scrapy框架的简单示例

时间：2023-04-06 07:13:30

一、网站分析

目标任务：爬取/下的名言、作者、标签

网站结构：

可以看到网站结构比较简单，获取class为quote的div节点，然后访问其子节点就能够获取到文本。

二、项目构建

在前文构建好基本的项目框架基础上（Scrapy框架的基本使用），添加步骤

在items.py文件中构建容器

import scrapyclass ScrapyexampleItem(scrapy.Item):# define the fields for your item here like:text = scrapy.Field()author = scrapy.Field()tags = scrapy.Field()

在自定义爬虫spidertest.py文件中获取数据

import scrapyfrom scrapyExample.items import ScrapyexampleItemclass SpidertestSpider(scrapy.Spider):# 爬虫名称name = 'spidertest'allowed_domains = ['']start_urls = ['/']def parse(self, response):quote = response.xpath('//div[@class="quote"]')item = ScrapyexampleItem()for i in quote:# 在路径前面加" . "直接指向当前路径，否则指向顶层xpath路径item['text'] = i.xpath('.//span[@class="text"]/text()')[0].extract()item['author'] = i.xpath('.//small[@class="author"]/text()')[0].extract()item['tags'] = i.xpath('.//div[@class="tags"]//a/text()').extract()# 将获取的数据交给pipeline,不会结束循环yield item

这里使用Xpath的时候千万注意，要加 “.”

在settings.py 文件中设置管道优先级

# Configure item pipelines# See /en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {'scrapyExample.pipelines.ScrapyexamplePipeline': 300,}

设置管道pipelines.py文件，使用json保存到本地

import jsonclass ScrapyexamplePipeline:# 创建一个文件用来保存数据def __init__(self):self.f = open("quote.json","w")def process_item(self, item, spider):content = json.dumps(dict(item),ensure_ascii=False) + ", \n"content.encode(encoding='GBK')self.f.write(content)return itemdef close_spider(self,spider):self.f.close()

在命令行使用爬虫名称运行爬虫程序

scrapy crawl spidertest

保存的文本如下

{"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]}, {"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"]}, {"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]}, {"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]}, {"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]}, {"text": "“Try not to become a man of success. Rather become a man of value.”", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]}, {"text": "“It is better to be hated for what you are than to be loved for what you are not.”", "author": "André Gide", "tags": ["life", "love"]}, {"text": "“I have not failed. I've just found 10,000 ways that won't work.”", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]}, {"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]}, {"text": "“A day without sunshine is like, you know, night.”", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]},

三、使用管道同时保存到MongoDB数据库

首先要安装：pymongo

pip install pymongo

配置mongodb连接信息，和设置管道优先级

在settings.py文件中配置：

# Configure item pipelinesITEM_PIPELINES = {'scrapyExample.pipelines.ScrapyexamplePipeline': 300,'scrapyExample.pipelines.MongoPipeline':200}# 连接参数MONGO_HOST = '127.0.0.1' # IP地址MONGO_PORT = 27017 # 端口号MONGO_DB = 'spiderTest' # 库名MONGO_USER = 'user'MONGO_PSW = '123456'

设置pipeline.py文件中管道

from scrapy.utils.project import get_project_settingssettings = get_project_settings()class MongoPipeline(object):def __init__(self):client = pymongo.MongoClient(host=settings['MONGO_HOST'],port=settings['MONGO_PORT'],username=settings['MONGO_USER'],password=settings['MONGO_PWD'])self.db = client[settings['MONGO_DB']]self.coll = self.db[settings['MONGO_COLL']]def process_item(self, item, spider):postItem = dict(item)self.coll.insert_one(postItem) # 向数据库插入一条记录return item

在命令行使用爬虫名称运行爬虫程序

scrapy crawl spidertest

在robo3T中查看结果：

注意事项（踩坑之路）

本文使用pymongo版本为4.0，如果使用其他版本连接时配置参数的位置可能会有所改变。

本人参考他人代码时就遇到过诸多问题。

authenticate 方法，4.0版不支持，报如下错误：

TypeError: ‘Collection‘ object is not callable. If you meant to call the ‘authenticate‘ method on a ‘Collection’ object it is failing because no such method exists.

解决方法：直接在MongoClient中配置变量

client = pymongo.MongoClient(host=settings['MONGO_HOST'],port=settings['MONGO_PORT'],username=settings['MONGO_USER'],password=settings['MONGO_PWD'])

insert方法

TypeError: ‘Collection’ object is not callable. If you meant to call the ‘insert’ method on a ‘Collection’ object it is failing because no such method exists.

4.0版本不能够使用insert方法。

解决方法：可以使用insert_one()和insert_many()方法插入数据

引用settings.py配置问题

没有保存但没有数据插入，经过排查发现数据库名错误

官方文档中直接使用client.test 这种写法，然而如果使用settings获取配置这么写就会把数据库的名字设置为变量名，而不是变量的值。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。