1200字范文 > Scrapy爬虫实战二：获取天气信息

Scrapy爬虫实战二：获取天气信息

时间：2024-05-08 20:21:43

相关推荐

Scrapy爬虫实战二：获取天气信息

本文项目采用python3.6版本语言，利用scrapy框架进行爬取。

该项目实现的功能是爬取某城市的天气以及往后预报一周的天气，并将爬取到的信息保存为.txt文件和写入mysql数据库。利用scrapy爬虫就像是做填空题，

只需要在相应的文件里填入相应的内容，连文件名都不用该。下面是本次项目的目录结构：

----weather

----spiders

__init__.py

wuhanSpider.py

__init__.py

items.py

pipelines.py

settings.py

scrapy.cfg

上述目录结构中，没有后缀名的为文件夹，有后缀的为文件。我们需要修改只有wuhanSpider.py、items.py、pipelines.py、settings.py这四个文件。

其中items.py决定爬取哪些项目，wuhanSpider.py决定怎么爬，setting.py决定由谁去处理爬取的内容，pipelines决定爬取后内容怎样处理，这里的

pipelines是将爬取的信息保存在.txt文件中，后面还会提供一个pipelines2mysql.py文件，这个文件是将信息保存到mysql数据库中，小伙伴们可以用

pipelines2mysql文件的内容直接替换pipeLines里内容，也可以将两个文件放在一起，调用的时候改一下名称即可。

1、选择爬取的项目items.py

#决定爬取哪些项目import scrapyclass WeatherItem(scrapy.Item):cityDate=scrapy.Field()week=scrapy.Field()img=scrapy.Field()temperature=scrapy.Field()weather=scrapy.Field()wind=scrapy.Field()

2、定义怎样爬取wuhanSpider.py

#定义如何爬取import scrapyfrom weather.items import WeatherItemclass WuHanSpider(scrapy.Spider):name="wuHanSpider"allowed_domains=['']citys=['wuhan','shanghai']start_urls=[]for city in citys:start_urls.append('http://'+city+'.')def parse(self,response):subSelector=response.xpath('//div[@class="tqshow1"]')items=[]for sub in subSelector:item=WeatherItem()cityDates=''for cityDate in sub.xpath('./h3//text()').extract():cityDates+=cityDateitem['cityDate']=cityDatesitem['week']=sub.xpath('./p//text()').extract()[0]item['img']=sub.xpath('./ul/li[1]/img/@src').extract()[0]temps=''for temp in sub.xpath('./ul/li[2]//text()').extract():temps+=tempitem['temperature']=tempsitem['weather']=sub.xpath('./ul/li[3]//text()').extract()[0]item['wind']=sub.xpath('./ul/li[4]//text()').extract()[0]items.append(item)return items

这部分就是项目的核心了，本项目爬取的网站为 / ,采取的是xpath选择器。经常我们需要爬取的信息来自于多个url地址，这个时候

我们需要寻找url的规律，试验可以发现上海的天气url为：/ ，本文只爬取了武汉和上海两个地区的天气，读者也可以在上面citys

列表中多添加几个城市。

打开网页源代码，如下图所示：

可以发现天气信息都在<div class="tqshow1">标签下，读者重点看下代码里是如何一层一层的找到我们需要爬取的信息的。

3.1、保存爬取的结果为.txt文件pipelines.py

#保存爬取结果import timeimport os.pathfrom urllib import requestclass WeatherPipeline(object):def process_item(self,item,spider):today=time.strftime('%Y-%m-%d',time.localtime())fileName=today+'.txt'with open(fileName,'a') as fp:fp.write((item['cityDate']+'\t'))fp.write(item['week']+'\t')imgName=os.path.basename(item['img'])fp.write(imgName+'\t')if os.path.exists(imgName):passelse:with open(imgName,'wb') as fp:response=request.urlopen(item['img'])fp.write(response.read())fp.write(item['temperature']+'\t')fp.write(item['weather']+'\t')fp.write(item['wind']+'\t\n')time.sleep(1)return item

3.2、保存爬取结果进mysql数据库

本项目保存进的mysql数据库名scrapyDB，建表的代码如下：

create table weather(id int auto_increment,cityDate char(24),week char(6),img char(20),temperature char(12),weather char(20),wind char(20),PRIMARY KEY(id));

pipelines2mysql.py代码为：

import pymysqlimport os.pathclass WeatherPipeline(object):def process_item(self,item,spider):cityDate=item['cityDate']week=item['week']img=os.path.basename(item['img'])temperature=item['temperature']weather=item['weather']wind=item['wind']conn=pymysql.connect(host='localhost',port=3306,user='root',passwd='yourPassword',db='scrapyDB',charset='utf8')cur=conn.cursor()cur.execute("insert into weather(cityDate,week,img,temperature,weather,wind) values (%s,%s,%s,%s,%s,%s)",(cityDate,week,img,temperature,weather,wind))cur.close()mit()conn.close()return item

4、分派任务的settings.py

BOT_NAME='weather'SPIDER_MODULES=['weather.spiders']NEWSPIDER_MODULE='weather.spiders'ITEM_PIPELINES={'weather.pipelines.WeatherPipeline':1,'weather.pipelines2mysql.WeatherPipeline':2}

说明一下， ITEM_PIPELINES中的数字只是一个值，填多少都可以，数字越小越先被执行。

5、配置文件scrapy.cfg

[settings]default=weather.settings[deploy]project=weather

配置文件里的信息说明项目名称以及指定默认分配任务的文件，另外项目里的两个__inti__.py文件都是空文件，保留这两个文件主要是为了让他们所在

的文件夹可以作为python的模块使用。

6、怎么运行

cmd->cd 将文件调到我们项目所在的这一层文件，也就是上面目录结构中scrapy.cfg所在的这一层文件夹，然后输入命令：scrapy crawl wuHanSpider

执行结束后，会在项目根目录下产生“-06-01.txt”文件，里面保存的就是近一周的天气预报,也会下载下来相应天气的图片，同时也会保存进mysql

的数据库。这里的wuHanSpider是我们WuHanSpider类中name="wuHanSpider"的值，更改name的值输入的命令也将更改。

本博客有参考《Python网络爬虫实战》一书，该书采用的是python2.x在Linux系统下运行的，采用python3.x在windows下运行的可以参考本博客。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论