1200字范文 > python3 [爬虫入门实战]爬虫之scrapy爬取传智播客讲师初体验

python3 [爬虫入门实战]爬虫之scrapy爬取传智播客讲师初体验

时间：2024-01-24 21:21:19

心得：

学scrapy估计耽误又耽误了，之前是图文教程，看了两三遍，一部一部的踩过来，经过昨晚看了一晚上的黑马程序员的部分scrapy框架的学习，才慢慢懂得，如何用一个scrapy去进行爬取网上的数据，个人建议如果实在是不能体会的，还是稍微看下视频：能理解的好一些

先上爬取截图吧：

总的来说，对于我这脑子学习scrapy过程还是蛮难的，尽管现在只是一个开头，后面的坑还很多。

爬取的内容：姓名，讲师类型，讲师描述

根据我们所要爬取的目标，

第一步：在item.py 下写个类，继承scrapy.Item
如下：

import scrapyclass TeacherItem(scrapy.Item):# define the fields for your item here like:# 讲师名字teacherName = scrapy.Field()# 讲师类型（高级..）teacherType = scrapy.Field()# 讲师描述teacherDesc = scrapy.Field()# 讲师头像teacherImg = scrapy.Field()

第二部：写爬虫的类继承scrapy.Spider 主要逻辑都在这。

上面的图都在那，可以看一下

贴上代码：

#encoding=utf8import scrapyfrom myterminalproject.items import TeacherItemclass GuoKeSpider(scrapy.Spider):# 启动爬虫时需要的参数name = "myItcast"# 爬取域范围，allowed_domains = [""] # 只在这个域名进行爬取# 爬虫第一个url地址start_urls = ['/channel/teacher.shtml']# start_urls = ['/ask/hottest/?page={}'.format(n) for n in range(1, 8)] + [#'/ask/highlight/?page={}'.format(m) for m in range(1, 101)]def parse(self, response):print(response.body)node_list = response.xpath("//div[@class='li_txt']")# 用来存储所有的item字段items = []for node in node_list:# 每个for创建一个item ,用来存储信息item = TeacherItem()# .extract() 讲xpath对象转换位 Unicode字符串name = node.xpath("./h3/text()").extract()title = node.xpath("./h4/text()").extract()info = node.xpath("./p/text()").extract()item['teacherName'] = name[0]item['teacherType'] = title[0]item['teacherDesc'] = info[0]items.append(item)return items # 这里返回给引擎了。

这里暂时没有用到下载管道

settings里面的文件也没有进行改动