1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > scrapy爬虫之抓取京东机械键盘评论量并画图展示

scrapy爬虫之抓取京东机械键盘评论量并画图展示

时间:2020-02-07 11:06:28

相关推荐

scrapy爬虫之抓取京东机械键盘评论量并画图展示

简介

最近想了解一下机械键盘,因此使用scrapy抓取了京东机械键盘

并使用python根据店铺名和评论量进行图片分析。

分析

在写爬虫前,我们需要先分析下京东机械键盘的是怎么访问的。

1.进入京东,搜索机械键盘

#页面url/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=fdac35af19ef4c7bbe23defb205b1b59

2.查看网页源代码

通过源代码发现,默认情况下只显示30条信息,但是在浏览器中向下滚动到30条以后,页面通过ajax会自动加载后30条信息,

通过开发者工具查看:

通过上图可发现,页面通过ajax异步加载的url:

#后30条/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=2&s=27&scrolling=y&log_id=1517196404.59517&tpl=1_M&show_items=3378484,6218105,3204859,2629440,3491212,2991278,1832316,4103095,5028795,2694404,3034311,1543721098,3606368,1792545,4911552,10494209225,2818591,2155852,1882111,3491218,584773,2942614,4285176,4873773,4106737,3204891,1495945,5259880,12039586866,3093295

注意:

url中的”page=2”

url中的show_items值为源代码中前30条信息的”data-sku”

待ajax异步加载后30条内容后,此页的全部内容则全部加载完毕。

3.分析翻页

点击第二页查看url

#第二页,前30条/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=3&s=57&click=0#第二页,后30条/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=4&s=84&scrolling=y&log_id=1517225828.64245&tpl=1_M&show_items=14689611523,1365181,3890366,3086129,5455802,4237668,3931658,3491228,1654797409,2361918,5442762,4237678,5225170,4960228,4237662,3931616,3491188,5009394,10151123711,4838698,4911578,1543721097,3093301,4838762,1836476,5910288,1135833,4277018,5028785,1324969

点击第三页查看url

#第三页,前30条/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=5&s=110&click=0#第三页,后30条/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=6&s=137&scrolling=y&log_id=1517225931.50937&tpl=1_M&show_items=5965870,3093297,14758401114,4825074,1247140,4911566,3634890,3212216,2329142,5155156,5225170,1812788,613970,5391428,1836460,1771658520,1308971,2512327,15428123588,2512333,3176567,6039820,10048750474,3093303,3724961,338871,10235508261,2144773,1939376,1543721095

通过以上我们可以看到,page是按3、5奇数方式增长的,而ajax加载的后30条信息中page是按2、4、6偶数方式增长的。

通过以上,我们的爬虫方案也就有了,先爬取当前页的前30条item,然后获取data-sku,模拟ajax请求异步加载获取后30条item;当前页全部抓取完毕后,翻页俺上面的方式继续爬取,直至最后。

实现

1.定义item

vim items.py#将评论量转化由字符串为float,并将万按单位计算,便于后续分析计算def filter_comment(x):str = x.strip('+')if str[-1] == u'万':return float(str[:-1])*10000else:return float(str)class KeyboardItem(scrapy.Item):#店铺名shopname = scrapy.Field(input_processor=MapCompose(unicode.strip),output_processor=TakeFirst())#产品名band = scrapy.Field(output_processor=TakeFirst())#价格price = scrapy.Field(output_processor=TakeFirst())#评价量comment = scrapy.Field(input_processor=MapCompose(filter_comment),output_processor=TakeFirst())

其中:

filter_comment函数,是将评论量转化由字符串为float,并将万按单位计算,便于后续分析计算。因为评论量有的以万为单位,如1.5万。

MapCompose(unicode.strip),去掉空格

output_processor=TakeFirst(),获取shopname的第一个字段,否则我们获得的shopname、price、band、comment都为列表。

如果不经过已经处理,我们最终生成的json文件为一下:

[{"comment": [1.2万+], "band": ["新盟游戏", "机械键盘"], "price": ["129.00"], "shopname": [罗技G官方旗舰店"]},......]

经过处理后

[{"comment": 120000.0, "band": "新盟游戏", "price": "129.00", "shopname": 罗技G官方旗舰店"},......]

这种格式更方便我们通过python的pandas进行科学计算。

爬虫实现

1.编写爬虫

vim keyboard.py# -*- coding: utf-8 -*-#京东搜索机械键盘import scrapyfrom jingdong.items import KeyboardItemfrom scrapy.loader import ItemLoaderclass KeyboardSpider(scrapy.Spider):name = 'keyboard'allowed_domains = ['']#start_urls = ['/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=361c7116408b4a10b5e769e3fd25bbbf']headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"}def start_requests(self):#重写,增加headersyield scrapy.Request(url='/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=361c7116408b4a10b5e769e3fd25bbbf', meta={'pagenum':1}, headers=self.headers, callback=self.parse_first30)def parse_first30(self, response):#爬取前30条 pagenum = response.meta['pagenum']print '进入机械键盘第' + str(pagenum) + '页,显示前30条'for eachitem in response.xpath('//li[@class="gl-item"]'): load = ItemLoader(item=KeyboardItem(),selector=eachitem)info = load.nested_xpath('div')info.add_xpath('shopname', 'div[@class="p-shop"]/span/a/@title')info.add_xpath('band', 'div[@class="p-name p-name-type-2"]/a/em/text()')info.add_xpath('price', 'div[@class="p-price"]/strong/i/text()')info.add_xpath('comment', 'div[@class="p-commit"]/strong/a/text()')yield load.load_item()#获取前30条记录的skuskulist = response.xpath('//li[@class="gl-item"]/@data-sku').extract()skustring = ','.join(skulist)#后30条为偶数页pagenum_more = pagenum*2baseurl = '/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&&s=28&scrolling=y&log_id=1517052655.49883&tpl=1_M&' #ajax加载的后30条urlajaxurl = baseurl + 'page=' + str(pagenum_more) + '&show_items'+ skustring.encode('utf-8')yield scrapy.Request(ajaxurl, meta={'pagenum':pagenum},headers=self.headers, callback=self.parse_next30)def parse_next30(self, response):#爬取后30条pagenum = response.meta['pagenum']print '进入机械键盘第' + str(pagenum) + '页,显示后30条'for eachitem in response.xpath('//li[@class="gl-item"]'): load = ItemLoader(item=KeyboardItem(),selector=eachitem)info = load.nested_xpath('div')info.add_xpath('shopname', 'div[@class="p-shop"]/span/a/@title')info.add_xpath('band', 'div[@class="p-name p-name-type-2"]/a/em/text()')info.add_xpath('price', 'div[@class="p-price"]/strong/i/text()')info.add_xpath('comment', 'div[@class="p-commit"]/strong/a/text()')yield load.load_item()#获取后30条记录的skuskulist = response.xpath('//li[@class="gl-item"]/@data-sku').extract()pagenum = pagenum+1#下一页的实际数字nextreal_num = pagenum*2-1#下一页urlnext_page = '/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&s=56&click=0&page=' + str(nextreal_num)yield scrapy.Request(next_page, meta={'pagenum':pagenum}, headers=self.headers, callback=self.parse_first30)

注意:我们将访问的第n页通过meta进行传递。例如:

第一页,pagenum=1,只显示前30条pagenum_more = pagenum*2=2 ,ajax加载的后30条url中的page值第二页nextreal_num = pagenum*2-1=3,下一页url中的page值

2.运行

scrapy crawl keyboard -o keyboard.json[{"comment": 120000.0, "band": "新盟游戏", "price": "129.00"},{},{},{"comment": 15000.0, "band": "罗技(Logitech)G610 Cherry轴全尺寸背光", "price": "599.00", "shopname": "罗技G官方旗舰店"},{"comment": 9900.0, "band": "ikbc c104 樱桃轴", "price": "389.00", "shopname": "ikbc京东自营旗舰店"},{"comment": 11000.0, "band": "美商海盗船(USCorsair)Gaming系列 K70 LUX RGB 幻彩背光", "price": "1299.00", "shopname": "美商海盗船京东自营旗舰店"},{"comment": 34000.0, "band": "达尔优(dareu)108键", "price": "199.00", "shopname": "达尔优京东自营旗舰店"},{"comment": 74000.0, "band": "雷柏(Rapoo) V700S合金版 混光", "price": "189.00", "shopname": "雷柏京东自营官方旗舰店"},{"comment": 8100.0, "band": "罗技(Logitech)G610 Cherry轴全尺寸背光", "price": "599.00", "shopname": "罗技G官方旗舰店"},{"comment": 26000.0, "band": "雷蛇(Razer)BlackWidow X 黑寡妇蜘蛛X幻彩版 悬浮式游戏", "price": "799.00", "shopname": "雷蛇RAZER京东自营旗舰店"},{"comment": 74000.0, "band": "雷柏(Rapoo) V500PRO 混光", "price": "169.00", "shopname": "雷柏京东自营官方旗舰店"},{"comment": 150000.0, "band": "前行者游戏背光发光牧马人", "price": "65.00", "shopname": "敏涛数码专营店"},{"comment": 11000.0, "band": "樱桃(Cherry)MX-BOARD 2.0 G80-3800 游戏办", "price": "389.00"},{"comment": 12000.0, "band": "美商海盗船(USCorsair)STRAFE 惩戒者 ", "price": "699.00", "shopname": "美商海盗船京东自营旗舰店"},{"comment": 6700.0, "band": "罗技(Logitech)G413", "price": "449.00", "shopname": "罗技G官方旗舰店"},{"comment": 120000.0, "band": "新盟游戏", "price": "89.00", "shopname": "敏涛数码专营店"},{"comment": 26000.0, "band": "雷蛇(Razer)BlackWidow X 黑寡妇蜘蛛X 竞技版87键 悬浮式游戏", "price": "299.00", "shopname": "雷蛇RAZER京东自营旗舰店"},{"comment": 110000.0, "band": "达尔优(dareu)108键", "price": "199.00", "shopname": "达尔优京东自营旗舰店"},{"comment": 61000.0, "band": "狼蛛(AULA)F混光跑马 ", "price": "129.00", "shopname": "狼蛛外设京东自营官方旗舰店"},.......]

科学计算

通过scrapy爬取到数据后,我们使用python科学计算进行分析

店铺名的评论量并画图展示。

vim keyboard_analyse.py#!/home/yanggd/miniconda2/envs/science/bin/python# -*- coding: utf-8 -*-import matplotlib.pyplot as pltimport pandas as pdfrom pandas import DataFrameimport jsonfilename= 'keyboard.json'#从json文件生成DataFramewith open(filename) as f:pop_data = json.load(f)df =DataFrame(pop_data)group_shopname = df.groupby('shopname')group =group_shopname.mean()#print group#字体设置plt.rcParams['font.family'] = 'sans-serif'plt.rcParams['font.sans-serif'] = ['simhei']plt.rcParams['axes.unicode_minus'] = False#柱状图group.plot(kind='bar')plt.xlabel(u"店铺名")plt.ylabel(u"评论量")plt.show()#运行python keyboard_analyse.py

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。