由于爬取的网页编码格式是“gb2312”格式的,所以第一反应就是也用这个格式编码和解码
import refrom lxml import etreeimport htmlwith open('test.html','r',encoding='gbk') as f:c = f.read()s = re.sub(r'\n',' ',c)tree = etree.HTML(c)rows = tree.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")for row in rows:boards = {}s1 = etree.tostring(row).decode('gbk')s1 = html.unescape(s1)print(s1)break
由于 “gbk” 包括 “gb2312”所以使用了 “gbk”,其实结果都一样
翻看了好多博客发现:
爬取的所有网页无论何种编码格式,都转化为 utf-8 格式进行存储
具体什么原因现在我也没清楚,留着后续补充吧
但是关于 gbk 或者 gb2312 格式的网页牵扯到存储时,转换成 utf-8 格式是没错的
import refrom lxml import etreeimport htmlwith open('test.html','r',encoding='utf-8') as f:c = f.read()s = re.sub(r'\n',' ',c)tree = etree.HTML(c)rows = tree.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")for row in rows:boards = {}s1 = etree.tostring(row).decode('utf-8')s1 = html.unescape(s1)print(s1)break
正常显示