1.网页分析
网址:/#/search/m/?s=许嵩&type=1
观察网页,所有的歌曲信息都在class="srchsongst"的div标签下
2.爬取信息
selenium安装报错请戳:/weixin_43746433/article/details/95237254
from selenium import webdriverfrom lxml import etreeimport timeimport csvdef get_info(url):chrome_driver=r"D:\Python\Anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe" #你的chromedriver.exe地址driver=webdriver.Chrome(executable_path=chrome_driver)driver.maximize_window()driver.get(url)driver.implicitly_wait(10)iframe=driver.find_elements_by_tag_name('iframe')[0]driver.switch_to.frame(iframe)html=etree.HTML(driver.page_source)infos=html.xpath('//div[@class="srchsongst"]/div')for info in infos:song_id=info.xpath('div[2]/div/div/a/@href')[0].split('=')[-1]song=info.xpath('div[2]/div/div/a/b/text()')[0]singer1=info.xpath('div[4]/div/a')[0]singer=singer1.xpath('string(.)')album=info.xpath('div[5]/div/a/@title')[0]print(song_id,song,singer,album)writer.writerow([song_id,song,singer,album])if __name__=='__main__':fp=open('music.csv','w',newline='',encoding='utf-8')writer=csv.writer(fp)writer.writerow(['song_id','song','singer','album'])url='/#/search/m/?s=%E8%AE%B8%E5%B5%A9&type=1'get_info(url)
文件展示
3.爬取歌词
通过歌词的api网址找到每首歌词,在通过爬取的csv文件读取歌曲的id和name
import requestsimport reimport jsonimport pandas as pdurl=''headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}def get_info(id):res=requests.get('/api/song/lyric?id={}&lv=1&kv=1&tv=-1'.format(id),headers=headers)json_data=json.loads(res.text)lyric=json_data['lrc']['lyric']lyric=re.sub('\[.*\]','',lyric)return str(lyric)def txt():data=pd.read_csv('music.csv')for i in range(len(data['song_id'])):fp=open(r'歌词/{}.txt'.format(data['song'][i]),'w',encoding='utf-8')fp.write(get_info(data['song_id'][i]))fp.close()txt()
爬取成功!
4 数据分析
4.1 数据基本情况
许嵩歌曲共计175首,妥妥的原创高产歌手~
4.2 专辑单曲数
早期的许嵩,是一个网络歌手,所以都放在了许嵩单曲集中,随后发的苏格拉没有底,寻雾启示是很不错的优秀专辑。
4.3 词云
词云的绘制请戳:/weixin_43746433/article/details/89856014