1200字范文 > Python爬虫——大众点评爬取用户电影评论 CSS字体加密

Python爬虫——大众点评爬取用户电影评论 CSS字体加密

时间：2023-07-03 07:13:12

最近一直在学习爬虫，刚好到了反爬这一块，听朋友说大众点评的反爬挺厉害，分析了一下发现还是老熟人，字体反爬，没错，与猫眼电影一样也是字体反爬，感兴趣的朋友可以点猫眼电影了解详情，不过和猫眼的字体反爬不同，大众点评是通过css文件来进行加密，下面就让我们一起来分析分析吧

爬取网址

本次爬取网址：华联影城（平谷店）

字体加密原理分析

这是网页显示的评论

点击检查打开开发者工具，可以看到一条评论有个别字是使用标签代替的，然后通过类选择器进行css加密

既然知道了css加密，那么我们首先要找的就是CSS文件了，右击点击查看源代码，这就是我们要找的css加密文件了

打开这个文件，可以看到密密麻麻的一堆标签，这就是每个类对应的字体坐标

找到坐标之后，我们还需要找到他加密的字体文件，字体文件的链接也藏在这个文件中，Ctrl+F搜索.svg，一共有3个文件，都是字体加密的文件

打开之后的样子

查看源代码

这里我们只要注意y轴就行了，x轴是直接通过除以字体大小得出来，不过y会有一些细微的差别，我们只需要取相近的值，就能判断是哪个字了

代码

分析完后，我们就开始爬取吧，因为爬取的数据也不是很多，所以我们直接使用requests模块进行爬取，简单的代码我就不上了，直接上干货

获取3个字体加密文件

取出.css文件，在依次取出里面的3个.svg文件，最后组成映射字典保存起来

def get_encryption_font(self, html):"""获取加密字体"""font_href = "http:" + html.xpath("//link[@rel='stylesheet'][2]/@href")[0]# 发送请求下载字体encryption_font = requests.get(font_href, headers=self.headers).textencryption_font_list = re.findall(r"(\w+)\{background:-(\d+).0px -(\d+).0px;\}", encryption_font)self.encryption_font_map_dict = {font_tuple[0]: (font_tuple[1], font_tuple[2]) for font_tuple inencryption_font_list}# 获取3个字体文件svg_font_url_list = re.findall(r"margin-top: (-?\d+?)px;background-image: url\((.+?\.svg)\)", encryption_font)# 请求3个字体文件并进行映射for svg_font_url in svg_font_url_list:svg_font = requests.get("http:" + svg_font_url[1], headers=self.headers).text# 取出字体文件的内容组成字典text_list = re.findall(r"""y="(\d+?)">(.+?)</text>""", svg_font)if not text_list:# 第二种形式y_list = re.findall(r"""d="M0 (\d+?) H600"/>""", svg_font)font_list = re.findall(r""">(.+?)</textPath>""", svg_font)# 判断2种取出来的是否对应if len(y_list) == len(font_list):text_list = list(zip(y_list, font_list))else:return '提取有误'font_map_dict = {(str(i*14), int(x[0])+int(svg_font_url[0])-9): font for x in text_list for i, font in enumerate(x[1])}self.font_map_dict.update(font_map_dict)

提取用户信息和评论

使用xpath提取我们每条评论想要的信息组成字典，最后保存到列表里面

def get_comments(self, html):"""提取用户信息和评论"""user_div_list = html.xpath("//div[@class='main-review']")# 取出评论divfor div in user_div_list:# 取出每条评论的用户基本信息添加到列表self.user_info_list.append({# 取出用户名"user_name": div.xpath("normalize-space(.//div[@class='dper-info']/a/text())"),# 取出综合评分和人均"comprehensive_score": div.xpath("normalize-space(.//div[@class='review-rank']/span[@class='score'])"),# 取出发表时间"published_time": div.xpath("normalize-space(.//div[@class='misc-info clearfix']/span[@class='time'])"),# 取出被评论地点"comments_on_site": div.xpath("normalize-space(.//div[@class='misc-info clearfix']/span[@class='shop'])"),# 取出评分"star_num": int(re.search(r"\d+", etree.tostring(div.xpath(".//div[@class='review-rank']/span[@class][1]")[0], encoding="utf-8").decode()).group()) // 10,# 取出评论"comment": "".join([etree.tostring(comment, encoding="utf-8").decode() for comment in div.xpath(".//div[@class='review-words Hide']|.//div[@class='review-words']")])})

解密评论

最后就是本次的重点解密评论了，因为评论是普通文字和标签混合的，所以我们使用正则表达式来提取，最后根据之前的映射字典，进行替换即可

def decryption_comment(self):"""解密评论"""for user_info_dict in self.user_info_list:# 提取评论comments = re.findall(r""">\n(.+?)<|>(\w*?.*?\w*?)<|svgmtsi class="(.+?)"/|>(\w+)\s+<""", user_info_dict["comment"])new_comment_list = list()for comment_tuple in comments:comment_tuple = list(comment_tuple)# 判断有没有数据if comment_tuple:# 判断是否是需要解密的if comment_tuple[2]:font_key = self.encryption_font_map_dict.get(comment_tuple[2])# 求出差值最接近的值for cos, font in self.font_map_dict.items():if abs(int(font_key[0]) - int(cos[0])) <= self.difference and abs(int(font_key[1]) - int(cos[1])) <= self.difference:comment_tuple[2] = fontnew_comment_list.append("".join(comment_tuple))user_info_dict["comment"] = re.sub(r"\t| |收起评论", "", "".join(new_comment_list))pprint(self.user_info_list)