1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > Python爬虫抓取去哪儿网景点信息告诉你国庆哪儿最堵

Python爬虫抓取去哪儿网景点信息告诉你国庆哪儿最堵

时间:2023-12-18 12:44:14

相关推荐

Python爬虫抓取去哪儿网景点信息告诉你国庆哪儿最堵

摘要

本文主要介绍了使用Python抓取去哪儿网站的景点信息并使用BeautifulSoup解析内容获取景点名称、票销售量、景点星级、热度等数据,然后使用xlrd、xlwt、xlutils等库来处理Excel,写入到Excel中,最后使用matplotlib可视化数据,并用百度的heatmap.js来生成热力图。

首先,上张效果图:

如果想了解更多Python的伙伴或者小白中有任何困难不懂的可以加入我们python交流学习QQ群:250933691,多多交流问题,互帮互助,群里有不错的学习教程和开发工具。资源分享

下面就详细来介绍如何一步步实现。

准备省份名单

访问是按照省份来进行搜索的,所以我们需要准备一份全国各省份的名单,这里,我已经准备好了这份名单

北京市,天津市,上海市,重庆市,河北省,山西省,辽宁省,吉林省,黑龙江省,江苏省,浙江省,安徽省,福建省,江西省,山东省,河南省,湖北省,湖南省,广东省,海南省,四川省,贵州省,云南省,陕西省,甘肃省,青海省,台湾省,内蒙古自治区,广西壮族自治区,西藏自治区,宁夏回族自治区,新疆维吾尔自治区,香港,澳门

将这些数据保存为TXT,一行

然后使用Python加载:

defProvinceInfo(province_path):tlist=[]withopen(province_path,'r',encoding='utf-8')asf:lines=f.readlines()forlineinlines:tlist=line.split(',')returntlist

构建URL

这里URL是根据城市名称信息来生成的

site_name=quote(province_name)#处理汉字问题url1='/ticket/list.htm?keyword='url2='&region=&from=mps_search_suggest&page='url=url1+site_name+url2

当然上面这个URL还不是最终的URL,因为一个城市搜索后有很多页面,我们需要定位到具体页面才行,这涉及到了如何判断页面数的问题,放在下文。

抓取页面信息函数:

#获得页面景点信息defGetPageSite(url):try:page=urlopen(url)exceptAttributeError:logging.info('抓取失败!')return'ERROR'try:bs_obj=BeautifulSoup(page.read(),'lxml')#不存在页面iflen(bs_obj.find('div',{'class':'result_list'}).contents)<=0:logging.info('当前页面没有信息!')return'NoPage'else:page_site_info=bs_obj.find('div',{'class':'result_list'}).childrenexceptAttributeError:logging.info('访问被禁止!')returnNonereturnpage_site_info

获得页面数目

#获取页面数目defGetPageNumber(url):try:page=urlopen(url)exceptAttributeError:logging.info('抓取失败!')return'ERROR'try:bs_obj=BeautifulSoup(page.read(),'lxml')#不存在页面iflen(bs_obj.find('div',{'class':'result_list'}).contents)<=0:logging.info('当前页面没有信息!')return'NoPage'else:page_site_info=bs_obj.find('div',{'class':'pager'}).get_text()exceptAttributeError:logging.info('访问被禁止!')returnNone#提取页面数page_num=re.findall(r'\d+\.?\d*',page_site_info.split('...')[-1])returnint(page_num[0])

对取得的数据进行解析取得感兴趣的数据

#格式化获取信息defGetItem(site_info):site_items={}#储存景点信息site_info1=site_info.attrssite_items['name']=site_info1['data-sight-name']#名称site_items['position']=site_info1['data-point']#经纬度site_items['address']=site_info1['data-districts']+''+site_info1['data-address']#地理位置site_items['salenumber']=site_info1['data-sale-count']#销售量site_level=site_info.find('span',{'class':'level'})ifsite_level:site_level=site_level.get_text()site_hot=site_info.find('span',{'class':'product_star_level'})ifsite_hot:site_hot=site_info.find('span',{'class':'product_star_level'}).em.get_text()site_hot=site_hot.split('')[1]site_price=site_info.find('span',{'class':'sight_item_price'})ifsite_price:site_price=site_info.find('span',{'class':'sight_item_price'}).em.get_text()site_items['level']=site_levelsite_items['site_hot']=site_hotsite_items['site_price']=site_pricereturnsite_items

获取一个省的全部页面数据,用到了前面的函数

#获取一个省的所有景点defGetProvinceSite(province_name):site_name=quote(province_name)#处理汉字问题url1='/ticket/list.htm?keyword='url2='&region=&from=mps_search_suggest&page='url=url1+site_name+url2NAME=[]#景点名称POSITION=[]#坐标ADDRESS=[]#地址SALE_NUM=[]#票销量SALE_PRI=[]#售价STAR=[]#景点星级SITE_LEVEL=[]#景点热度i=0#页面page_num=GetPageNumber(url+str(i+1))#页面数logging.info('当前城市%s存在%s个页面'%(province_name,page_num))flag=True#访问非正常退出标志whilei<page_num:#遍历页面i=i+1#随机暂停1--5秒,防止访问过频繁被服务器禁止访问time.sleep(1+4*random.random())#获取网页信息url_full=url+str(i)site_info=GetPageSite(url_full)#当访问被禁止的时候等待一段时间再进行访问whilesite_infoisNone:wait_time=60+540*random.random()whilewait_time>=0:time.sleep(1)logging.info('访问被禁止,等待%s秒钟后继续访问'%wait_time)wait_time=wait_time-1#继续访问site_info=GetPageSite(url_full)ifsite_info=='NoPage':#访问完成logging.info('当前城市%s访问完成,退出访问!'%province_name)breakelifsite_info=='ERROR':#访问出错logging.info('当前城市%s访问出错,退出访问'%province_name)flag=Falsebreakelse:#返回对象是否正常ifnotisinstance(site_info,Iterable):logging.info('当前页面对象不可迭代,跳过%s'%i)continueelse:#循环获取页面信息forsiteinsite_info:info=GetItem(site)NAME.append(info['name'])POSITION.append(info['position'])ADDRESS.append(info['address'])SALE_NUM.append(info['salenumber'])SITE_LEVEL.append(info['site_hot'])SALE_PRI.append(info['site_price'])STAR.append(info['level'])logging.info('当前访问城市%s,取到第%s组数据:%s'%(province_name,i,info['name']))returnflag,NAME,POSITION,ADDRESS,SALE_NUM,SALE_PRI,STAR,SITE_LEVEL

最后就是把数据写入到Excel中,这里因为数据量很大,而且是获得了一个城市的数据后再写入一次,而在爬取过程中很可能由于各种原因中断,因而每次读取Excel都会判断当前省份是否已经读取过。

#创建ExceldefCreateExcel(path,sheets,title):try:logging.info('创建Excel:%s'%path)book=xlwt.Workbook()forsheet_nameinsheets:sheet=book.add_sheet(sheet_name,cell_overwrite_ok=True)forindex,iteminenumerate(title):sheet.write(0,index,item,set_style('TimesNewRoman',220,True))book.save(path)exceptIOError:return'创建Excel出错!'#设置Excel样式defset_style(name,height,bold=False):style=xlwt.XFStyle()#初始化样式font=xlwt.Font()#为样式创建字体font.name=name#'TimesNewRoman'font.bold=boldfont.color_index=4font.height=height#borders=xlwt.Borders()#borders.left=6#borders.right=6#borders.top=6#borders.bottom=6style.font=font#style.borders=bordersreturnstyle#加载Excel获得副本defLoadExcel(path):logging.info('加载Excel:%s'%path)book=xlrd.open_workbook(path)copy_book=copy(book)returncopy_book#判断内容是否存在defExistContent(book,sheet_name):sheet=book.get_sheet(sheet_name)iflen(sheet.get_rows())>=2:returnTrueelse:returnFalse#写入Excel并保存defWriteToTxcel(book,sheet_name,content,path):logging.info('%s数据写入到(%s-%s)'%(sheet_name,os.path.basename(path),sheet_name))sheet=book.get_sheet(sheet_name)forindex,iteminenumerate(content):forsub_index,sub_iteminenumerate(item):sheet.write(sub_index+1,index,sub_item)book.save(path)

数据分析、可视化

完成了前面几个步骤之后,我们就已经做好了爬取数据的工作了,现在就是需要可视化数据了,这里,设计的主要内容有:读取Excel数据,然后对每一个sheet(一个省份)读取数据,并去处重复数据,最后按照自己的要求可视化,当然,这里地图可视化部分使用了百度的heatmap.js工具,首先需要把景点的经纬度等信息生成json格式。

defGenerateJson(ExcelPath,JsonPath,SalePath,TransPos=False):try:ifos.path.exists(JsonPath):os.remove(JsonPath)ifos.path.exists(SalePath):os.remove(SalePath)sale_file=open(SalePath,'a',encoding='utf-8')json_file=open(JsonPath,'a',encoding='utf-8')book=xlrd.open_workbook(ExcelPath)exceptIOErrorase:returnesheets=book.sheet_names()sumSale={}#总销售量forsheet_nameinsheets:sheet=book.sheet_by_name(sheet_name)row_0=sheet.row_values(0,0,sheet.ncols-1)#标题栏数据#获得热度栏数据forindx,headinenumerate(row_0):ifhead=='销售量':index=indxbreaklevel=sheet.col_values(index,1,sheet.nrows-1)#获得景点名称数据forindx,headinenumerate(row_0):ifhead=='名称':index=indxbreaksite_name=sheet.col_values(index,1,sheet.nrows-1)ifnotTransPos:forindx,headinenumerate(row_0):ifhead=='经纬度':index=indxbreakpos=sheet.col_values(index,1,sheet.nrows-1)temp_sale=0#临时保存销售量fori,pinenumerate(pos):ifint(level[i])>0:lng=p.split(',')[0]lat=p.split(',')[1]lev=level[i]temp_sale+=int(lev)sale_temp=sheet_name+site_name[i]+','+levjson_temp='{"lng":'+str(lng)+',"lat":'+str(lat)+',"count":'+str(lev)+'},'json_file.write(json_temp+'\n')sale_file.write(sale_temp+'\n')sumSale[sheet_name]=temp_saleelse:passjson_file.close()sale_file.close()returnsumSale

当然,上面这个函数同时还绘制了景点销量信息的图。不过这里先讨论生成json文本后接下来处理。运行上面的程序会在你指定的路径下生成一个名为LngLat.json的文件,使用文本编辑器打开,然后把内容复制到heatmap.html这个文件的数据部分,这里为了代码不至于太长我删除了大部分数据信息,只保留了一部分,你只需要把下面的代码复制保存为html格式然后在 var points =[]中添加生成的json内容就可以了。最后使用浏览器打开,即可看到下面这样的效果:

<!DOCTYPEhtml><html><head><metahttp-equiv="Content-Type"content="text/html;charset=utf-8"/><metaname="viewport"content="initial-scale=1.0,user-scalable=no"/><scripttype="text/javascript"src="http://gc.kis.v2.scr.kaspersky-/C8BAC707-C937-574F-9A1F-B6E798DB62A0/main.js"charset="UTF-8"></script><scripttype="text/javascript"src="http://api./api?v=2.0&ak=x2ZTlRkWM2FYoQbvGOufPnFK3Fx4vFR1"></script><scripttype="text/javascript"src="http://api./library/Heatmap/2.0/src/Heatmap_min.js"></script><title>热力图功能示例</title><styletype="text/css">ul,li{list-style:none;margin:0;padding:0;float:left;}html{height:100%}body{height:100%;margin:0px;padding:0px;font-family:"微软雅黑";}#container{height:500px;width:100%;}#r-result{width:100%;}</style></head><body><divid="container"></div><divid="r-result"><inputtype="button"onclick="openHeatmap();"value="显示热力图"/><inputtype="button"onclick="closeHeatmap();"value="关闭热力图"/></div></body></html><scripttype="text/javascript">varmap=newBMap.Map("container");//创建地图实例varpoint=newBMap.Point(105.418261,35.921984);map.centerAndZoom(point,5);//初始化地图,设置中心点坐标和地图级别map.enableScrollWheelZoom();//允许滚轮缩放varpoints=[{"lng":116.403347,"lat":39.922148,"count":19962},{"lng":116.03293,"lat":40.369733,"count":3026},{"lng":116.276887,"lat":39.999497,"count":3778},{"lng":116.393097,"lat":39.942341,"count":668},{"lng":116.314607,"lat":40.01629,"count":1890},{"lng":116.03,"lat":40.367229,"count":2190},{"lng":116.404015,"lat":39.912729,"count":904},{"lng":116.398287,"lat":39.94015,"count":392},{"lng":89.215713,"lat":42.94202,"count":96},{"lng":89.212779,"lat":42.941938,"count":83},{"lng":90.222236,"lat":42.850153,"count":71},{"lng":80.931218,"lat":44.004188,"count":82},{"lng":89.087234,"lat":42.952765,"count":40},{"lng":86.866582,"lat":47.707518,"count":54},{"lng":85.741271,"lat":48.36813,"count":4},{"lng":87.556853,"lat":43.894646,"count":83},{"lng":89.699515,"lat":42.862384,"count":81},{"lng":80.903663,"lat":44.286633,"count":53},{"lng":89.254534,"lat":43.025333,"count":50},{"lng":86.1271,"lat":41.789203,"count":63},{"lng":84.537278,"lat":43.314894,"count":81},{"lng":84.282954,"lat":41.286104,"count":94},{"lng":77.181601,"lat":37.397422,"count":32},{"lng":82.666502,"lat":41.611567,"count":64},{"lng":89.577441,"lat":44.008065,"count":57},{"lng":83.056664,"lat":41.862089,"count":79},{"lng":82.639664,"lat":41.588593,"count":53},{"lng":89.537959,"lat":42.888903,"count":61},{"lng":89.52734,"lat":42.876443,"count":95},{"lng":87.11464,"lat":48.310173,"count":86},{"lng":80.849732,"lat":44.238021,"count":6},{"lng":89.488521,"lat":42.991858,"count":59},{"lng":89.550783,"lat":42.882572,"count":92},{"lng":88.055115,"lat":44.13238,"count":61},{"lng":77.100143,"lat":39.095865,"count":63},{"lng":78.992124,"lat":41.103398,"count":42},{"lng":77.699877,"lat":39.013786,"count":62},{"lng":81.912557,"lat":43.222123,"count":61},{"lng":87.526264,"lat":47.75415,"count":33},{"lng":87.556853,"lat":43.894632,"count":110},{"lng":87.622686,"lat":43.820354,"count":10},]if(!isSupportCanvas()){alert('热力图目前只支持有canvas支持的浏览器,您所使用的浏览器不能使用热力图功能~')}//详细的参数,可以查看heatmap.js的文档/pa7/heatmap.js/blob/master/README.md//参数说明如下:/*visible热力图是否显示,默认为true*opacity热力的透明度,1-100*radius势力图的每个点的半径大小*gradient{JSON}热力图的渐变区间.gradient如下所示*{.2:'rgb(0,255,255)',.5:'rgb(0,110,255)',.8:'rgb(100,0,255)'}其中key表示插值的位置,0~1.value为颜色值.*/heatmapOverlay=newBMapLib.HeatmapOverlay({"radius":20});map.addOverlay(heatmapOverlay);heatmapOverlay.setDataSet({data:points,max:10000});//是否显示热力图functionopenHeatmap(){heatmapOverlay.show();}functioncloseHeatmap(){heatmapOverlay.hide();}closeHeatmap();functionsetGradient(){/*格式如下所示:{0:'rgb(102,255,0)',.5:'rgb(255,170,0)',1:'rgb(255,0,0)'}*/vargradient={};varcolors=document.querySelectorAll("input[type='color']");colors=[].slice.call(colors,0);colors.forEach(function(ele){gradient[ele.getAttribute("data-key")]=ele.value;});heatmapOverlay.setOptions({"gradient":gradient});}//判断浏览区是否支持canvasfunctionisSupportCanvas(){varelem=document.createElement('canvas');return!!(elem.getContext&&elem.getContext('2d'));}</script>

如果想了解更多Python的伙伴或者小白中有任何困难不懂的可以加入我们python交流学习QQ群:250933691,多多交流问题,互帮互助,群里有不错的学习教程和开发工具。资源分享

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。