1200字范文 > Python高并发爬虫测评(2): 多进程多线程和异步协程哪个快?

Python高并发爬虫测评(2): 多进程多线程和异步协程哪个快?

时间：2018-10-17 17:47:42

在Python爬虫下一代网络请求库httpx和parsel解析库测评一文中我们对比了requests的同步爬虫和httpx的异步协程爬虫爬取链家二手房信息所花的时间(如下所示：一共580条记录)，结果httpx同步爬虫花了16.1秒，而httpx异步爬虫仅花了2.5秒。

那么问题来了。实现爬虫的高并发不仅仅只有协程异步这一种手段，传统的同步爬虫结合多进程和多线程也能大大提升爬虫工作效率，那么多进程, 多线程和异步协程爬虫到底谁更快呢? 当然对于现实中的爬虫，爬得越快，被封的可能性也越高。本次测评使用httpx爬取同样链家网数据，不考虑反爬因素，测评结果可能因个人电脑和爬取网站对象而异。

在我们正式开始前，你能预测下哪种爬虫更快吗？可能结果会颠覆你的观点。

传统爬虫 vs 协程异步爬虫

传统Python爬虫程序都是运行在单进程和单线程上的，包括httpx异步协程爬虫。如果你不清楚进程和线程的区别，以及Python如何实现多进程和多线程编程，请阅读下面这篇知乎上收藏过1000的文章。

一文看懂Python多进程与多线程编程(工作学习面试必读)

一个传统的web爬虫代码可能如下所示，先用爬虫获取目标页面中显示的最大页数，然后循环爬取每个单页数据并解析。单进程、单线程同步爬虫的请求是阻塞的，在一个请求处理完全结束前不会发送一个新的请求，中间浪费了很多等待时间。

httpx异步协程爬虫虽然也是运行在单进程单线程上的，但是所有异步任务都会加到事件循环(loop)中运行,可以一次有上百或上千个活跃的任务，一旦某个任务需要等待，loop会快速切换到下面一个任务，所以协程异步要快很多。

要把上面的同步爬虫变为异步协程爬虫，我们首先要使用async将单个页面的爬取和解析过程包装成异步任务，使用httpx提供的AsyncClient发送异步请求。

接着我们使用asyncio在主函数parse_page里获取事件循环(loop), 并将爬取单个页面的异步任务清单加入loop并运行。

多进程爬虫

对于多线程爬虫，我们首先定义一个爬取并解析单个页面的同步任务。

接下来我们在主函数parse_page里用multiprocessing库提供的进程池Pool来管理多进程任务。池子里进程的数量，一般建议为CPU的核数，这是因为一个进程需要一个核，你设多了也没用。我们使用map方法创建了多进程任务，你还可以使用apply_async方法添加多进程任务。任务创建好后，任务的开始和结束都由进程池来管理，你不需要进行任何操作。这样我们一次就有4个进程同时在运行了，一次可以同时处理4个请求。

那用这个多进程爬虫爬取链家580条数据花了多长时间呢? 答案是7.6秒，比单进程单线程的httpx同步爬虫16.1秒还是要快不少的。

项目完整代码如下所示:

from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpxfrom multiprocessing import Pool, cpu_count, Queue, Managerclass HomeLinkSpider(object):def __init__(self):# 因为多进程之间不能共享内存，需使用队列Queue共享数据进行通信# 每个进程爬取的数据都存入这个队列，不能使用self.data列表# 子进程获取不到self.headers这个变量，需要直接生成# self.ua = UserAgent()# self.headers = {"User-Agent": self.ua.random}self.q = Manager().Queue()self.path = "浦东_三房_500_800万.csv"self.url = "/ershoufang/pudong/a3p5/"def get_max_page(self):response = httpx.get(self.url, headers={"User-Agent": UserAgent().random})if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return None# 解析单页面，需传入单页面url地址def parse_single_page(self, url):print("子进程开始爬取:{}".format(url))response = httpx.get(url, headers={"User-Agent": UserAgent().random})selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = pile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = pile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万，匹配650price_pattern = pile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米，匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.q.put(detail)def parse_page(self):max_page = self.get_max_page()print("CPU内核数:{}".format(cpu_count()))# 使用进程池管理多进程任务with Pool(processes=4) as pool:urls = ['/ershoufang/pudong/pg{}a3p5/'.format(i) for i in range(1, max_page + 1)]# 也可以使用pool.apply_async(self.parse_single_page, args=(url,))pool.map(self.parse_single_page, urls)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)# 如果队列不为空，写入每行数据while not self.q.empty():item = self.q.get()if item:row_data = []for k in keys:row_data.append(item[k])writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时：{}秒".format(end-start))

注意：多个进程之间内存是不共享的，需要使用Python多进程模块提供的Manager.Queue()实现多个进程的数据共享，比如把不同进程爬取的数据存到一个地方。

多线程爬虫

爬取解析单个页面的函数和多进程爬虫里的代码是一样的，不同的是我们在parse_page主函数里使用threading模块提供的方法创建多线程任务，如下所示：

我们也不需要使用Queue()类存储各个线程爬取的数据，因为各个线程内存是可以共享的。多线程同步爬虫运行结果如下所示，爬取580条数据总共耗时只有短短的2.2秒，几乎秒开，甚至比httpx异步协程的还快!

结果为什么是这样呢？其实也不难理解。对于爬虫这种任务，大部分消耗时间其实是等等待时间，在等待时间中CPU是不需要工作的，那你在此期间提供双核或4核CPU进行多进程编程是没有多大帮助的。那么为什么多线程会对爬虫代码有用呢？这时因为Python碰到等待会立即释放GIL供新的线程使用，实现了线程间的快速切换，这跟协程异步任务的切换一个道理，只不过多线程任务的切换由操作系统进行，而协程异步任务的切换由loop进行。

多线程完整代码如下所示：

from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpximport threadingclass HomeLinkSpider(object):def __init__(self):self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "/ershoufang/pudong/a3p5/"def get_max_page(self):response = httpx.get(self.url, headers={"User-Agent": UserAgent().random})if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return None# 解析单页面，需传入单页面url地址def parse_single_page(self, url):print("多线程开始爬取:{}".format(url))response = httpx.get(url, headers={"User-Agent": UserAgent().random})selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = pile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = pile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万，匹配650price_pattern = pile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米，匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def parse_page(self):max_page = self.get_max_page()thread_list = []for i in range(1, max_page + 1):url = '/ershoufang/pudong/pg{}a3p5/'.format(i)t = threading.Thread(target=self.parse_single_page, args=(url,))thread_list.append(t)for t in thread_list:t.start()for t in thread_list:t.join()defwrite_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("FailtowriteCSVtopath:%s,Case:%s"%(self.path,e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时：{}秒".format(end-start))

结论

多进程, 多线程和异步协程均可以提高Python爬虫的工作效率。对于爬虫这种非计算密集型的工作，多进程编程对效率的提升不如多线程和异步协程。异步爬虫不总是最快的，同步爬虫+多线程也同样可以很快，有时甚至更快。

httpx 同步 + parsel: 16.1秒

httpx 异步 + parsel: 2.5秒

http 同步多进程 + parsel: 7.6秒

http 同步多线程 + parsel: 2.2秒

对于这样的结果，你满意吗? 欢迎留言！

大江狗

推荐阅读

神文必读: 同步Python和异步Python的区别在哪里?

Python爬虫下一代网络请求库httpx和parsel解析库测评

一文看懂Python多进程与多线程编程(工作学习面试必读)