1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 大众点评各城市热门餐厅评分字体加密信息数据采集

大众点评各城市热门餐厅评分字体加密信息数据采集

时间:2023-09-20 22:24:56

相关推荐

大众点评各城市热门餐厅评分字体加密信息数据采集

以前写过两篇大众点评的采集代码,不过由于历史原因,大众早已经更换了反爬策略,近期又看了看大众新的反爬机制,也做了小小的破解,先说说之前大众前端加密方式:

字体通过加载svg图片然后通过css样式控制雪花图的背景坐标,来进行绘制前端的展示字体,要解密,就需要我们反向计算svg中的字体与前端字体对应关系,便可解密,详情可看大众点评评论抓取-加密评论信息完整抓取

另外有两篇关于大众点评排行榜的分析,这个版本是以前大众出的热门排行,现在我也找不到入口了,但是以前排行榜依旧可以用,只是限定了全国少量城市,可供参考:大众点评各城市热门餐厅数据爬虫抓取,大众点评热门餐厅抓取与数据分析

本次大众各城市列表页信息采集及解密

先看看具体加密情况:

以西安大众点评为例,采集美食篇;

首页

2. 详情美食列表页:注意看链接构成:/城市拼音/ch10(美食)/p1(页码);

要采集的列表页

3. 页面中的点评量,商铺地址,等各项评分数据都采取了加密策略,看标签,显示都是:“口”, 然后看css样式发现采用了大众自定义字体,最右边有它的css样式链接,继续往后看其中内容;

详情采集页加密信息

4. 查看源码中的情况,发现:“口”都是&#xe89c,都是以:“&#x”开头的字体,直接看让我们束手无策,但是联系上图的css样式,便可知他们是加载了自己的字体库;

加密内容

5. 查看css样式可以看到每个对应标签内的加密字体链接,如果是chrome浏览器,则有个.woff后缀的自定义字体,我们取这个链接,这些链接最终都要去解析,才能解密字体;

css加密字库链接

6. 在Network的font中便能看到具体的加密字库,如果点击链接会下载下来;

加密字体woff文件

7. 查看加密字库,可以看到页面具体内容了,说说这个字库怎么生成,它后面存储的是无数的坐标点,也就是这个字是通过无数左标点一点一点画出来的,我们反向画图估计工作量有点大,注意每个字上面的标签,这些标签就是网页源码中去掉:“&#x”之后的内容了,是不是发现了新大陆,我们只要对应起来这个关系,然后去替换这些字就可以实现解密了,现在唯一难点就是这个字库是画出来的,并且大众这个字库不断在动态更新,所以虽然我们知道其中内容,但是无法直接复制粘贴替换,于是有了以下思路:

) 首先用TTFont去读取woff文件;) 通过Image模块进行坐标点绘制坐标点形成图片,相当于截图;) 通过OCR方式去读取字体;) 通过k,v关系去找对应字体

font = TTFont(file_name) # 打开加密文件codeList = font.getGlyphOrder()[2:]# 在画板绘制im = Image.new("RGB", (1800, 1000), (255, 255, 255))dr = ImageDraw.Draw(im)font = ImageFont.truetype(file_name, 40)count = 15list_img = numpy.array_split(codeList, count) # 将列表切分成15份,以便于在图片上分行显示

加密字体

识别完之后的字

8. 剩下一步就是通过识别之后的字与它加密的标签对应起关系来,然后替换掉源码中的加密标签,便可成功获取到数据了。

查看最终采集的第一页店铺数据

[{"ID":"jfP5B6BhSEijjpRY","shopLi":"/shop/jfP5B6BhSEijjpRY","shopName":"蓝田印象","shopStar":"4.86","shopRecommend":["蓝田饸饹","桂花糯米糕","油饼"],"shopTotal":"871 ","shopAvg":"43","shopTag":"其他美食","shopArea":"北宝路","shopAddress":"北环路蓝丰家园对面","shopTaste":"4.86","shopEnvironment":"4.85","shopServer":"4.84"},{"ID":"l8VFgvYfgSIaQvz7","shopLi":"/shop/l8VFgvYfgSIaQvz7","shopName":"长安壹号","shopStar":"4.78","shopRecommend":["长安葫芦鸡","麻什","太宗吊烧肉"],"shopTotal":"2474","shopAvg":"222","shopTag":"陕菜","shopArea":"省体育场","shopAddress":"长安北路1号","shopTaste":"4.68","shopEnvironment":"4.89","shopServer":"4.81"},{"ID":"G9VBBvcEp6puNgUy","shopLi":"/shop/G9VBBvcEp6puNgUy","shopName":"旺顺阁鱼头泡饼(悦荟广场店)","shopStar":"4.64","shopRecommend":["经典鱼头泡饼","手工现烙饼","芝士焗红薯"],"shopTotal":"803","shopAvg":"116","shopTag":"京菜","shopArea":"民可园","shopAddress":"解放路116号悦荟广场L606a","shopTaste":"4.61","shopEnvironment":"4.86","shopServer":"4.82"},{"ID":"H3pvCM708cM764Z5","shopLi":"/shop/H3pvCM708cM764Z5","shopName":"荣宴·中餐厅","shopStar":"4.87","shopRecommend":["鲁式葱烧海参","国宴开水白菜","佛跳墙"],"shopTotal":"151 ","shopAvg":"1319","shopTag":"创校菜","shopArea":"当新路沿乐","shopAddress":"高新二路与科技二路什字东丹轩梓园北门(农业银行二楼)","shopTaste":"4.85","shopEnvironment":"4.89","shopServer":"4.9"},{"ID":"l1uuIVrie5SV9nAh","shopLi":"/shop/l1uuIVrie5SV9nAh","shopName":"糊涂记(新城广场店)","shopStar":"4.82","shopRecommend":["葫芦鸡","高陵油饼","关中四宝"],"shopTotal":"3820","shopAvg":"64","shopTag":"陕菜","shopArea":"钟楼/鼓楼","shopAddress":"南新街8号路西","shopTaste":"4.84","shopEnvironment":"4.82","shopServer":"4.69"},{"ID":"H1Lvg3y9EaOTV086","shopLi":"/shop/H1Lvg3y9EaOTV086","shopName":"张老板的店(民乐园店)","shopStar":"4.87","shopRecommend":["麻麻面","霸气双拼披萨","椒麻牛肚"],"shopTotal":"2093","shopAvg":"89","shopTag":"特色菜","shopArea":"民可园","shopAddress":"解放路111号民乐园万达步行街11号楼10101铺","shopTaste":"4.86","shopEnvironment":"4.89","shopServer":"4.88"},{"ID":"Ga60yrRErPAWOD8D","shopLi":"/shop/Ga60yrRErPAWOD8D","shopName":"长安大牌档之长安集市(赛格旗舰店)","shopStar":"4.70","shopRecommend":["长安葫芦鸡","豆皮涮牛肚锅","醪糟冰淇淋"],"shopTotal":"35020","shopAvg":"81","shopTag":"陕菜","shopArea":"小寨","shopAddress":"小寨东路赛格国际购物中心6楼西北角","shopTaste":"4.61","shopEnvironment":"4.84","shopServer":"4.62"},{"ID":"lazoXjXc4sGBvSPp","shopLi":"/shop/lazoXjXc4sGBvSPp","shopName":"和悦和牛火锅(迈科中心店)","shopStar":"4.93","shopRecommend":["5A三角牛腩和牛粒","5A和牛上脑","招牌松茸菌汤底"],"shopTotal":"213","shopAvg":"598","shopTag":"打边炉/港式火锅","shopArea":"丈八","shopAddress":"锦业路12号迈科中心A座1楼","shopTaste":"4.93","shopEnvironment":"4.93","shopServer":"4.93"},{"ID":"Eg7Os5JRBOLW8Xnu","shopLi":"/shop/Eg7Os5JRBOLW8Xnu","shopName":"胖子甑糕","shopStar":"4.79","shopRecommend":["甑糕","枣泥","蜜枣"],"shopTotal":"2574","shopAvg":"8","shopTag":"小吃","shopArea":"莲湖公园","shopAddress":"洒金桥路与劳武巷交叉口杨天玉腊牛羊肉店旁","shopTaste":"4.69","shopEnvironment":"4.16","shopServer":"4.67"},{"ID":"k4MQcT0ou69m0Ult","shopLi":"/shop/k4MQcT0ou69m0Ult","shopName":"爱骅裤带面馆(总店)","shopStar":"4.89","shopRecommend":["biangbiang面","油泼面","蘸水面"],"shopTotal":"1609","shopAvg":"16","shopTag":"面馆","shopArea":"钟楼/鼓楼","shopAddress":"东木头市19号(秦豫肉夹馍东隔壁)","shopTaste":"4.9","shopEnvironment":"4.58","shopServer":"4.85"},{"ID":"l8qbUQaSQNjSLD2i","shopLi":"/shop/l8qbUQaSQNjSLD2i","shopName":"陕拾叁(鼓楼店)","shopStar":"4.87","shopRecommend":["醪糟味冰淇淋","秦酥","豆腐冰淇淋"],"shopTotal":"9418","shopAvg":"32","shopTag":"冰淇淋","shopArea":"钟楼/鼓楼","shopAddress":"北院门270号","shopTaste":"4.86","shopEnvironment":"4.87","shopServer":"4.87"},{"ID":"k4PFL1AksZDcU3a8","shopLi":"/shop/k4PFL1AksZDcU3a8","shopName":"爷们儿泥炉烤肉","shopStar":"4.88","shopRecommend":["品厚切五花肉","秘制梅花肉","调味澳洲肥牛"],"shopTotal":"762","shopAvg":"85","shopTag":"融合烤肉","shopArea":"钟楼/鼓楼","shopAddress":"东县门与饮马池十字路东","shopTaste":"4.88","shopEnvironment":"4.81","shopServer":"4.9"},{"ID":"H6oZsmKfP21fMtVy","shopLi":"/shop/H6oZsmKfP21fMtVy","shopName":"阳坊大都涮羊肉","shopStar":"3.48","shopRecommend":["苏尼特肥羊","软切羊肉","大都招牌肉"],"shopTotal":"21 ","shopAvg":"135","shopTag":"老北京火锅","shopArea":"丈八","shopAddress":"高新六路CROSS万象汇8号楼2层","shopTaste":"3.72","shopEnvironment":"3.92","shopServer":"3.77"},{"ID":"l24zZ7Ak8q6L2dhU","shopLi":"/shop/l24zZ7Ak8q6L2dhU","shopName":"醉长安(钟楼旗舰店)","shopStar":"4.83","shopRecommend":["老陕葫芦鸡","晾衣毛肚","妙笔生花"],"shopTotal":"8462","shopAvg":"83","shopTag":"陕菜","shopArea":"钟楼/鼓楼","shopAddress":"竹笆市鼓楼向南200米美豪丽致酒店1楼2楼","shopTaste":"4.75","shopEnvironment":"4.88","shopServer":"4.87"},{"ID":"G7L8e4Z2Oph1epU7","shopLi":"/shop/G7L8e4Z2Oph1epU7","shopName":"莲花餐饮(朱雀店)","shopStar":"4.88","shopRecommend":["紫阳蒸盆子","安康吊炉芝麻烧饼","清蒸鸭嘴鱼"],"shopTotal":"2519","shopAvg":"109","shopTag":"陕菜","shopArea":"省体育场","shopAddress":"朱雀大街中段1号","shopTaste":"4.85","shopEnvironment":"4.89","shopServer":"4.88"}]

源码

源码说明:大众点评字库是动态变化的,所以需要不断去请求新的字库,也可以查看变化规律,设置一定时间变动,代码有一点瑕疵,解密的时候我直接全局替换加密字体了,对于部分字体应该按照css样式的class样式,如:tagName去对应的替换,这部分主要做解密,并未细节化去替换。

#!/usr/bin/python3 # encoding: utf-8 """ @version: v1.0 @author: W_H_J @license: Apache Licence @contact: 415900617@ @software: PyCharm @file: dazhongFoodList.py@time: /6/17 10:12@describe: 大众点评各个城市列表页美食信息如果要翻看10页以后的,需要登录然后手动添加cookie"""import jsonimport randomimport reimport sysimport osimport numpyimport pytesseractfrom PIL import Image, ImageDraw, ImageFontfrom pyquery import PyQuery as pqimport requestsfrom fontTools.ttLib import TTFontsys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))sys.path.append("..")doc_path = "./secretDoc" # 下载下的woff字库存储文件夹if not os.path.exists(doc_path): os.mkdir(doc_path)class DaZhongFoodList:def __init__(self):self.USER_AGENT_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"]def get(self, url):head = {'User-Agent': '{0}'.format(random.sample(self.USER_AGENT_LIST, 1)[0]), # 随机获取'Host': '','Cookie': 'navCtgScroll=0; navCtgScroll=0; _lxsdk_cuid=171f22248aac8-0997e5681a77d9-376b4502-1fa400-171f22248abc8; _lxsdk=171f22248aac8-0997e5681a77d9-376b4502-1fa400-171f22248abc8; _hc.v=e8d6b5d2-6fac-becb-d1e9-8f2d9e0a75e3.1588905266; cye=xian; _dp.ac.v=205fd5cc-9ba6-4b29-932d-cc13b3e6244f; ua=dpuser_5832767585; ctu=a9f247ab89a4ee779f162d4b6923fc08fb12285c5ae4076901c1d499b665bee2; s_ViewType=10; fspop=test; Hm_lvt_602b80cf8079ae6591966cc70a3940e7=1591178871,1591179190,1591595589,1592359696; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; logan_session_token=4zjh7es2zv6d3liq9tvn; logan_custom_report=; default_ab=shopList%3AA%3A5; cityid=17; pvhistory="6L+U5ZuePjo8L3N1Z2dlc3QvZ2V0SnNvbkRhdGE/Y2FsbGJhY2s9anNvbnBfMTU5MjM2Mjg5MTgyMl80ODIyNz46PDE1OTIzNjI4OTA0MzFdX1s="; m_flash2=1; PHOENIX_ID=0a49a8ba-172c03ac912-85fcc; _tr.u=msMl25dgCdFtGAsS; _tr.s=oLyNfoSoVWA7xm17; cy=17; _lxsdk_s=172c13e1aa7-a9a-f22-225%7C%7C20; Hm_lpvt_602b80cf8079ae6591966cc70a3940e7=1592379974'}html = requests.get(url, headers=head)html.encoding = "UTF-8"print("STATUS:==>", html)# print(html.headers)page_url = html.urlif 'verify' in page_url:print("出现验证码,请验证")print(page_url)return False# 获取加密字体链接cssr1 = r'<link rel="stylesheet" type="text/css" href="(.*?)">'jia_mi_font_link = [x for x in re.findall(r1, html.text, re.S) if 'svgtextcss' in x]dict_secret_key_value = {}if jia_mi_font_link:jia_mi_font_link_href = "http:" + jia_mi_font_link[0]jia_mi_css_text = requests.get(jia_mi_font_link_href).text # 请求加密字体# 获取加密字体文件woff_url = re.findall(r'(//s3plus\.meituan\.net/.{,100}?woff)', jia_mi_css_text)secret_href = ["http:" + x for x in set(woff_url)]print("加密字体库==>", secret_href)list_secret = []for x in secret_href:file_name = x[x.rfind("/") + 1:] # 加密文件print("000--------", file_name)if os.path.exists(doc_path + "/" + file_name):print("111--------", file_name)else:content = requests.get(x).content # 获取下载加密字体内容with open(doc_path + "/" + file_name, "wb") as f:f.write(content)list_secret.append(self.font_convert(doc_path + "/" + file_name)) # 调用解密for x in list_secret:print("==>", x)dict_secret_key_value.update(x) # 最终解密字体字典print(dict_secret_key_value)print()str_html_base = html.textfor k, v in dict_secret_key_value.items():str_html_base = str_html_base.replace(k, v) # 用解密字体替换掉加密字体# print(str_html_base)print()doc = pq(str_html_base)div_li = doc("#shop-all-list > ul > li").items()list_shop_msg = []for x in div_li:shop_li = x("div.txt > div.tit > a").attr("href") # 商铺链接shop_id = x("div.txt > div.tit > a").attr("data-shopid") # 商铺ID:/shop/商铺IDshop_name = x("div.txt > div.tit > a").attr("title") # 商铺名称shop_star = x("div.txt > ment > div > div.star_score.star_score_sml").text()# 评价等级shop_recommend_temp = x("div.txt > div.recommend").text()# 推荐菜if shop_recommend_temp:shop_recommend = shop_recommend_temp.replace("推荐菜: ", "").split(" ")shop_total = x("div.txt > ment > a.review-num").text().replace("\n", "").replace("条点评", "") # 多少条评论shop_avg = x("div.txt > ment > a.mean-price").text().replace("\n", "").replace("人均 ¥", "") # 人均shop_tag = x("div.txt > div.tag-addr > a:nth-child(1) > span.tag").text().replace("\n", "") # 分类shop_area = x("div.txt > div.tag-addr > a:nth-child(3) > span").text().replace("\n", "")# 商圈shop_address = x("div.txt > div.tag-addr > span").text().replace("\n", "") # 商铺地址shop_taste = x("div.txt > span > span:nth-child(1)").text().replace("\n", "").replace("口味", "")# 口味shop_environment = x("div.txt > span > span:nth-child(2)").text().replace("\n", "").replace("环境", "") # 环境shop_server = x("div.txt > span > span:nth-child(3)").text().replace("\n", "").replace("服务", "") # 服务dict_shop={"ID":shop_id,"shopLi":shop_li,"shopName":shop_name,"shopStar":shop_star,"shopRecommend":shop_recommend,"shopTotal":shop_total,"shopAvg":shop_avg,"shopTag":shop_tag,"shopArea":shop_area,"shopAddress":shop_address,"shopTaste":shop_taste,"shopEnvironment":shop_environment,"shopServer":shop_server}msg = json.dumps(dict_shop, ensure_ascii=False)list_shop_msg.append(dict_shop)print(msg)print("-" * 50)print(json.dumps(list_shop_msg, ensure_ascii=False))def font_convert(self, file_name):"""将web下载的字体文件解析,返回其编码和汉字的对应关系:param file_name: 加密woff字体文件:return: {'&#xe105;': '2'}"""font = TTFont(file_name) # 打开加密文件codeList = font.getGlyphOrder()[2:]# 在画板绘制im = Image.new("RGB", (1800, 1000), (255, 255, 255))dr = ImageDraw.Draw(im)font = ImageFont.truetype(file_name, 40)count = 15list_img = numpy.array_split(codeList, count) # 将列表切分成15份,以便于在图片上分行显示for t in range(count):newList = [i.replace("uni", "\\u") for i in list_img[t]]text = "".join(newList)text = text.encode('utf-8').decode('unicode_escape')dr.text((0, 50 * t), text, font=font, fill="#000000")im.save(file_name.replace(".woff", "") + ".jpg") # 可以将图片保存到本地,以便于手动打开图片查看im = Image.open(file_name.replace(".woff", "") + ".jpg")testdata_dir_config = '--tessdata-dir "D:\\Program Files (x86)\\Tesseract-OCR\\tessdata"' # OCR文字识别路径,如果路径加入系统环境变量了,则无需设置此值result = pytesseract.image_to_string(im, config=testdata_dir_config, lang="chi_sim") # 指定lang解析为:中文简体-chi_sim# print("===>",result)result = result.replace(" ", "").replace("\n", "") # OCR识别出来的字符串有空格换行符codeList = [i.replace("uni", "&#x") + ";" for i in codeList] # 大众点评加密规则就是将加密字体的:uni替换成:&#xreturn dict(zip(codeList, list(result))) # 生成形如:{'&#xe105;': '2'} 的解密加密对应密文def run(self, page_num:int):for i in range(1, page_num+1):# 城市链接构成:/城市拼音/ch10(美食)/p1(页码)self.get("/xian/ch10/p"+str(i))if __name__ == '__main__':dzr = DaZhongFoodList()dzr.run(1)# print(dzr.fontConvert())

代码仅供学习参考,不可商用,否则后果由使用者个人承担,转载请注明出处

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。