1200字范文 > python爬虫爬取豆瓣电影信息城市_python爬虫爬取豆瓣电影信息

python爬虫爬取豆瓣电影信息城市_python爬虫爬取豆瓣电影信息

时间：2023-06-14 17:34:15

hhhhh开心，搞了一整天，查了不少python基础资料，终于完成了第一个最简单的爬虫：爬取了豆瓣top250电影的名字、评分、评分人数以及短评。

代码实现如下：#第一个最简单的爬虫

#爬取了豆瓣top250电影的名字、评分、评分人数以及短评

#观察豆瓣电影top250的网页可以发现:

#电影信息在一个ol标签之内，该标签的class属性值为grid_view

#1.电影的信息都在一个li标签里

#2.电影的电影名称在：第一个class属性值为hd的div标签下的第一个 class属性值为title 的span标签里

#3.电影的评分在对应li标签里一个class属性值为rating_num 的span标签里

#4.电影的评价人数在对应li标签里的一个 class属性值为star 的div标签中的最后一个数字

#5.电影的短评在对应li标签里的一个class属性值为inq的span标签里

#6.除第一页外，其他页的url：/top250?start=X&filter= X的值为25-225的等差数列，差为25

from lxml import etree

import requests

import re

def get_info(url):

movie_info = ''

#通过get访问页面

html = requests.get(url)

selector = etree.HTML(html.text)

content = selector.xpath('//ol[@class="grid_view"]/li')

#第一个for循环抓取一整页的数据

for r in content:

#抓取电影的名字

movie_name = r.xpath('./div[@class="item"]/div[@class="info"]/div[@class="hd"]/a/span[@class="title"][1]/text()')[0]

#抓取电影评分

movie_score = r.xpath('./div[@class="item"]/div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0]

#抓取电影评分人数

people_num = r.xpath('./div[@class="item"]/div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[last()]/text()')[0]

#抓取点评短评,这里使用try...except是因为遇到了几部电影没有短评！！！会报IndexError的错误，这样规避一下

try:

movie_quote = r.xpath('./div[@class="item"]/div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()')[0]

except IndexError:

print('这部电影没有短评:', movie_name)

movie_quote = ''

finally:

movie_info = movie_info + movie_name + '' + movie_score + '' + people_num + '' + '短评：' + movie_quote + '\n'

return movie_info

#这个函数用来把信息存储到文件

def save_info(movie_info):

#'a+':文件指针将会放在文件的结尾,是追加模式。

#此处注意要使用utf-8,否则会报错：'gbk' codec can’t encode character –> 说明是将Unicode字符编码为GBK时候出现的问题；

#往往最大的可能就是，本身Unicode类型的字符中，包含了一些无法转换为GBK编码的一些字符。

with open('movie_info.txt', 'a+',encoding='utf-8') as f:

f.write(movie_info)

if __name__ == "__main__":

i = 25

url_1 = '/top250'

first_page_info = get_info(url_1)

save_info(first_page_info)

while True:

if i > 225:

break

url_2 = '/top250?start={}&filter='.format(i)

other_page_info = get_info(url_2)

save_info(other_page_info)

i = i + 25

说几点遇到的坑，具体都在代码注释的地方标明了：

1. 有些电影没有豆瓣短评！！代码中：movie_quote = r.xpath('./div[@class="item"]/div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()')[0]

这句会报出IndexError的错误，使用try...except做出规避

2.有些电影信息在进行编码时会报错：

UnicodeEncodeError: 'gbk' codec can't encode character '\u22ef' in position 775: illegal multibyte sequence

是将Unicode字符编码为GBK时候出现的问题：往往最大的可能就是，本身Unicode类型的字符中，包含了一些无法转换为GBK编码的一些字符。

在open文件时，加上编码格式解决：

open('movie_info.txt', 'a+',encoding='utf-8')

第一次爬虫还比较简陋，后续会加上多线程、post、尝试抓取图片、尝试使用scrapy。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

python爬虫爬取豆瓣电影信息城市_python爬虫 爬取豆瓣电影信息

python爬虫爬取豆瓣电影信息城市_python爬虫爬取豆瓣电影信息