1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 简单的python爬虫程序:爬取斗鱼直播人气前五十的主播

简单的python爬虫程序:爬取斗鱼直播人气前五十的主播

时间:2018-07-21 02:13:51

相关推荐

简单的python爬虫程序:爬取斗鱼直播人气前五十的主播

1.URL 地址分析

我选取的是斗鱼直播王者荣耀系列的网址:/directory/game/wzry

个人有玩王者荣耀,偶尔看看直播。

2.页面抓取

首先要引入两模块:(安装请自行百度,pycharm安装方便很多)

from bs4 import BeautifulSoupimport requests

然后要给requests个url

url = '/directory/game/wzry

因为网站反爬,还要伪装个header(浏览器可以查看自己的agent)

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}

整体页面爬取的代码是:

from urllib import requestimport urllib.requestfrom bs4 import BeautifulSoupimport bs4def __fetch_content(self):# url = '/directory/game/wzry'在主函数那里有网址了,所以注释掉print(url)header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}# 网站反爬,要构造合理的HTTP请求头request = urllib.request.Request(url, headers=header)#爬取网站内容r = urllib.request.urlopen(request).read()soup = BeautifulSoup(r)

3.提取需要的信息

找到主播的名字,人气和房间号:

divList = soup.findAll("span",attrs={"class":"dy-name ellipsis fl"})name=soup.findAll("span",attrs={"class":'dy-num fr'})link=soup.findAll("a",attrs={"class":'play-list-link'})

用for in循环出人气前50的主播:

for i in range(0,50):print(divList[i].string)print(name[i].string)print(""+link[i].get("href"))print("-------------")

4.把数据以文本的形式保存下来:

with open('D:\\douyu.txt',mode='a',encoding='utf-8')as jb:jb.write(divList[i].string)jb.write("\n")jb.write(name[i].string)jb.write("\n")jb.write(""+link[i].get("href"))jb.write("\n")jb.write("\n")

5.全部代码

from urllib import requestimport urllib.requestfrom bs4 import BeautifulSoupimport bs4def __fetch_content(self):# url = '/directory/game/wzry'在主函数那里有网址了,所以注释掉print(url)header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}# 网站反爬,要构造合理的HTTP请求头request = urllib.request.Request(url, headers=header)#爬取网站内容r = urllib.request.urlopen(request).read()soup = BeautifulSoup(r)#找到主播的名字,人气和房间号divList = soup.findAll("span",attrs={"class":"dy-name ellipsis fl"})name=soup.findAll("span",attrs={"class":'dy-num fr'})link=soup.findAll("a",attrs={"class":'play-list-link'})#找出人前钱五十的主播及其房间连接for i in range(0,50):print(divList[i].string)print(name[i].string)print(""+link[i].get("href"))print("-------------")#把数据以文本的方式保存下来with open('D:\\douyu.txt',mode='a',encoding='utf-8')as jb:jb.write(divList[i].string)jb.write("\n")jb.write(name[i].string)jb.write("\n")jb.write(""+link[i].get("href"))jb.write("\n")jb.write("\n")if __name__=="__main__":url = '/directory/game/wzry'__fetch_content(url)

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。