1200字范文 > 爬虫框架scrapy 爬取豆瓣电影top250

爬虫框架scrapy 爬取豆瓣电影top250

时间：2021-11-08 03:06:54

1 . 新建项目

进入打算存储代码的目录，命令行运行如下语句

scrapy startproject tutorial

2 . 定义Item

import scrapyclass DoubanItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field() #用来存储豆瓣电影标题star=scrapy.Field() #用来存储豆瓣电影评分

3 . 新建spyder

import scrapy#from douban.items import DoubanItemclass DoubanSpider(scrapy.Spider):name = "douban"#爬虫名称allowed_domains = ["/"]start_urls = ["/top250"]def parse(self, response):for sel in response.xpath('//div[@class="info"]'):#item = DmozItem()title = sel.xpath('div[@class="hd"]/a/span/text()').extract()[0] #不加[0]会变成Unicode形式star= sel.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0] print star,title

4 . 防止爬虫被屏蔽，伪装user_agent

USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5’

否则会报403错误

-12-31 23:20:16 [scrapy] INFO: Spider opened-12-31 23:20:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)-12-31 23:20:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023-12-31 23:20:17 [scrapy] DEBUG: Crawled (403) <GET /robots.txt> (referer: None)-12-31 23:20:17 [scrapy] DEBUG: Crawled (403) <GET /top250> (referer: None)-12-31 23:20:17 [scrapy] DEBUG: Ignoring response <403 /top250>: HTTP status code is not handled or not allowed-12-31 23:20:17 [scrapy] INFO: Closing spider (finished)

5 . 爬取

进入项目根目录，执行以下命令爬取

scrapy crawl douban

在命令行内容如下

-12-31 22:56:42 [scrapy] INFO: Spider opened-12-31 22:56:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)-12-31 22:56:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023-12-31 22:56:43 [scrapy] DEBUG: Crawled (200) <GET /robots.txt> (referer: None)-12-31 22:56:43 [scrapy] DEBUG: Crawled (200) <GET /top250> (referer: None)9.6 肖申克的救赎9.4 这个杀手不太冷9.5 霸王别姬9.4 阿甘正传9.5 美丽人生9.2 千与千寻9.4 辛德勒的名单9.2 泰坦尼克号9.2 海上钢琴师9.2 盗梦空间9.3 机器人总动员9.1 三傻大闹宝莱坞9.2 放牛班的春天9.2 忠犬八公的故事9.1 大话西游之大圣娶亲9.1 龙猫9.2 教父9.2 乱世佳人9.0 楚门的世界9.1 天堂电影院8.9 当幸福来敲门9.0 搏击俱乐部9.1 触不可及9.3 十二怒汉9.1 指环王3：王者无敌-12-31 22:56:43 [scrapy] INFO: Closing spider (finished)-12-31 22:56:43 [scrapy] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 612,。。。。。。

变种1：将title和star加入到item文件

爬虫文件修改如下

import scrapyfrom douban.items import DoubanItemclass DoubanSpider(scrapy.Spider):name = "douban"allowed_domains = ["/"]start_urls = ["/top250"]def parse(self, response):for sel in response.xpath('//div[@class="info"]'):item = DoubanItem()item['title'] = sel.xpath('div[@class="hd"]/a/span/text()').extract()[0] #此处运行时报错'unindent does not match any outer indentation level',是由于TAB键和空格混搭使用。所以我在此处先消除空格，在tab缩进，就不会报错item['star']= sel.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]yield item

结果如下：

-12-31 23:39:05 [scrapy] INFO: Spider opened-12-31 23:39:05 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)-12-31 23:39:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023-12-31 23:39:05 [scrapy] DEBUG: Crawled (200) <GET /robots.txt> (referer: None)-12-31 23:39:06 [scrapy] DEBUG: Crawled (200) <GET /top250> (referer: None)-12-31 23:39:06 [scrapy] DEBUG: Scraped from <200 /top250>{'star': u'9.6', 'title': u'\u8096\u7533\u514b\u7684\u6551\u8d4e'}-12-31 23:39:06 [scrapy] DEBUG: Scraped from <200 /top250>{'star': u'9.4', 'title': u'\u8fd9\u4e2a\u6740\u624b\u4e0d\u592a\u51b7'}-12-31 23:39:06 [scrapy] DEBUG: Scraped from <200 /top250>{'star': u'9.5', 'title': u'\u9738\u738b\u522b\u59ec'}-12-31 23:39:06 [scrapy] DEBUG: Scraped from <200 /top250>{'star': u'9.4', 'title': u'\u963f\u7518\u6b63\u4f20'}。。。。。。

6、保存爬取到的数据

scrapy crawl douban -o items.json

变种2：

在管道管道(pipelines)里处理爬取到的数据

import sys reload(sys) sys.setdefaultencoding('utf8') class DoubanPipeline(object):def __init__(self):self.file=open('douban_top250.txt',mode='wb')def process_item(self, item, spider):line='the top250 movie list:'for i in range(1,len(item['star'])-1):title=item['title']star=item['star']line=line+' '+title+' 'line=line+star+'\n'self.file.write(line)def close_spider(self,spider):self.file.close()

接着你需要把一些东西写进settings.py来告诉scrapy你将用什么pipeline：

在settings.py后面加上一句：

ITEM_PIPELINES={ ‘douban.pipelines.DoubanPipeline’:300,}

最后生成结果如下：

the top250 movie list: 肖申克的救赎 9.6the top250 movie list: 这个杀手不太冷 9.4the top250 movie list: 霸王别姬 9.5the top250 movie list: 阿甘正传 9.4the top250 movie list: 美丽人生 9.5the top250 movie list: 千与千寻 9.2the top250 movie list: 辛德勒的名单 9.4the top250 movie list: 泰坦尼克号 9.2the top250 movie list: 盗梦空间 9.2the top250 movie list: 海上钢琴师 9.2the top250 movie list: 机器人总动员 9.3the top250 movie list: 三傻大闹宝莱坞 9.1the top250 movie list: 放牛班的春天 9.2the top250 movie list: 忠犬八公的故事 9.2the top250 movie list: 大话西游之大圣娶亲 9.1the top250 movie list: 龙猫 9.1the top250 movie list: 教父 9.2the top250 movie list: 乱世佳人 9.2the top250 movie list: 楚门的世界 9.0the top250 movie list: 天堂电影院 9.1the top250 movie list: 当幸福来敲门 8.9the top250 movie list: 搏击俱乐部 9.0the top250 movie list: 触不可及 9.1the top250 movie list: 十二怒汉 9.3the top250 movie list: 指环王3：王者无敌 9.1

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。