1200字范文 > python画名侦探柯南_Python实战爬B站 | 帮你梳理柯南主线剧情+弹幕

python画名侦探柯南_Python实战爬B站 | 帮你梳理柯南主线剧情+弹幕

时间：2022-08-17 01:50:59

原标题：Python实战爬B站 | 帮你梳理柯南主线剧情+弹幕

文末扫海报二维码领【本文完整代码】

皖渝 | 作者

凹凸数据 | 来源

https://mp./s/kVsQmTIh-okzH9WZRBC0FA

爬取介绍

利用Chrome浏览器抓包可知， B站的弹幕文件以XML文档式进行储存，如下所示(共三千条实时弹幕)

其URL为：

不难看出， CID则是对应着各个视频的ID，接下来用正则提取即可。

完整爬取代码记得添加【小数】领取哦~

最终的全部弹幕文件都在桌面的"柯南"文件下：

注：这里共爬取到980个弹幕文件。【B站的柯南自941集后就跳到994集(大会员才能观看的)。虽然目前更新到1032话，但并没有1032集内容，如下图所示】

弹幕可视化

1.主要人物讨论总次数分析

统计人数总次数

注： role.txt是主要人物名文件(需考虑到弹幕一般不会对人物的全名进行称呼，多数使用的是昵称，否则可能与实际情况相差较大。)

import jieba

import os

import pandas aspd

os. chdir( 'C:/Users/dell/Desktop')

jieba.load_userdict( 'role.txt')

role=[ i.replace( 'n', '') fori in open( 'role.txt', 'r',encoding= 'utf-8').readlines]

txt_all=os.listdir( './柯南/')

txt_all. sort(key=lambda x: int( x. split( '.')[ 0])) #按集数排序

count= 1

def role_count:

df = pd.DataFrame

forchapter in txt_al l:

names={}

data=[]

with open( './柯南/{}'.format(chapter), 'r',encoding= 'utf-8') asf:

forlinein f.readlines:

poss=jieba.cut( line)

forword in pos s:

ifword in role:

ifnames. get(word) isNone:

names[word]= 0

names[word]+= 1

df_new = pd.DataFrame.from_dict(names,orient= 'index',columns=[ '{}'.format( count)])

df = pd.concat([df,df_new],axis= 1)

print( '第{}集人物统计完毕'.format( count))

count+= 1

df.T.to_csv( 'role_count.csv',encoding= 'gb18030')

可视化

import numpy asnp

import matplotlib.pyplot asplt

plt.rcParams[ 'font.sans-serif']=[ 'kaiti']

plt.style.use( 'ggplot')

df=pd.read_csv( 'role_count.csv',encoding= 'gbk')

df=df.fillna( 0).set_index( 'episode')

plt.figure(figsize=( 10, 5))

role_sum=df.sum.to_frame.sort_values(by= 0,ascending=False)

g=sns.barplot(role_sum. index,role_sum[ 0],palette= 'Set3',alpha= 0.8)

index=np.arange( len(role_sum))

forname, countin zip( index,role_sum[ 0]):

g.text(name, count+ 50, int( count), ha= 'center',va= 'bottom',)

plt.title( 'B站名侦探柯南弹幕——主要人物讨论总次数分布')

plt.ylabel( '讨论次数')

plt.show

虽说是万年小学生，柯南还是有变回新一的时候，且剧情也并不只是"找犯人—抓犯人"。接下来从数据的角度来，扒扒一些精彩剧情集数。

2.柯南变回新一集数统计

考虑到部分集数中新一是在回忆中出现的，为减少偏差，将讨论的阈值设为250次，绘制如下分布图：

其讨论次数结果及剧集名如下表所示：

有兴趣的朋友可以码一下，除235集外，均是柯南变回新一的集数。相关代码如下：

df=pd.read_csv( 'role_count.csv',encoding= 'gbk')

df=df.fillna( 0).set_index( 'episode')

xinyi=df[df[ '新一']>= 250][ '新一'].to_frame

print(xinyi) #新一登场集数

plt.figure(figsize=( 10, 5))

plt.plot(df. index,df[ '新一'],label= '新一',color= 'blue',alpha= 0.6)

plt.annotate( '集数:50,讨论次数:309',

xy=( 50, 309),

xytext=( 40, 330),

arrowprops=dict(color= 'red',headwidth= 8,headlength= 8)

)

plt.annotate( '集数:206,讨论次数:263',

xy=( 206, 263),

xytext=( 195, 280),

arrowprops=dict(color= 'red',headwidth= 8,headlength= 8)

)

plt.annotate( '集数:571,讨论次数:290',

xy=( 571, 290),

xytext=( 585, 310),

arrowprops=dict(color= 'red',headwidth= 8,headlength= 8)

)

plt.hlines(xmin=df. index. min,xmax=df. index. max, y= 250,linestyles= '--',colors= 'red')

plt.legend( loc= 'best',frameon=False)

plt.xlabel( '集数')

plt.ylabel( '讨论次数')

plt.title( '工藤新一讨论次数分布图')

plt.show

以讨论次数最多的572集，绘制词云图(剔除了高频词"新一"，防止遗漏其他信息) 如下所示：

从图中可看出，出现频率较高地词有整容、服部、声音、爱情等。(看来凶手是整成了新一的模样进行犯罪的，还有新兰的感情戏在里面，值得一看)

3.主线集数内容分析

主线剧情主要是围绕着组织成员(琴酒、伏特加、贝尔摩德)展开，绘制分布图如下：

plt.figure(figsize=( 10, 5))

names=[ '琴酒', '伏特加', '贝姐']

colors=[ '#090707', '#004e66', '#EC7357']

alphas=[ 0.8, 0.7, 0.6]

forname,color,alpha in zip(names,colors,alphas):

plt.plot(df. index,df[name],label=name,color=color,alpha=alpha)

plt.legend( loc= 'best',frameon=False)

plt.annotate( '集数:{},讨论次数:{}'.

format(df[ '贝姐'].idxmax, int(df[ '贝姐']. max)),

xy=(df[ '贝姐'].idxmax,df[ '贝姐']. max),

xytext=(df[ '贝姐'].idxmax+ 30,df[ '贝姐']. max),

arrowprops=dict(color= 'red',headwidth= 8,headlength= 8)

)

plt.xlabel( '集数')

plt.ylabel( '讨论次数')

plt.title( '酒厂成员讨论次数分布图')

plt.hlines(xmin=df. index. min,xmax=df. index. max, y= 200,linestyles= '--',colors= 'red')

plt.ylim( 0, 400)

#输出主线剧集

mainline= set( list(df[df[ '贝姐']>= 200]. index)+ list(df[df[ '琴酒']>= 200]. index)) #伏特加可忽略不计

print(mainline)

从上图分析可知，组织成员的行动基本一致，其中贝姐(贝尔摩德)的人气在三人中是较高的，特别是在375集(与黑暗组织直面对决系列)，讨论次数高达379。

此外，统计其讨论次数大于200次的集数，结果如下：

以讨论次数最高的375集为内容，绘制词云图(剔除了高频词"贝姐"，防止遗漏其他信息) 如下：

从图中可知，天使、琴酒、干妈、心疼、狙击手等词汇出现频率较高。从词频较低的败北主线中可以看出，这次酒厂行动应该是失败告终。

人物形象网络分析

1.合并txt文件

为尽可能反映出弹幕观众对人物形象的描述，考虑到一集弹幕共3000条，为减少运行成本，这里仅选取特定人物讨论次数最多的20集合并后再进行分析。

import os

import pandas aspd

df=pd.read_csv( 'role_count.csv',encoding= 'gbk')

df=df.fillna( 0).set_index( 'episode')

huiyuan_ep= list(df.sort_values(by= '灰原哀',ascending=False). index[: 20])

mergefiledir = 'C:/Users/dell/Desktop/柯南'

file= open( 'txt_all.txt', 'w',encoding= 'UTF-8')

count= 0

forfilename in huiyuan_ep:

filepath=mergefiledir+ '/'+str(filename)+ '.txt'

forlinein open(filepath,encoding= 'UTF-8'):

file.writelines( line)

file. write( 'n')

count+= 1

print( '第{}集写入完毕'.format( count))

file. close

2.人物形象可视化

借助共现矩阵的思想，即同一句话中出现两个指定的词则计数1。

指定起始点Source为灰原哀，代码如下所示：

importcodecs

importcsv

importjieba

linesName=[]

names={}

relationship={}

jieba.load_userdict( 'role.txt')

txt=[ line.strip forline inopen( 'stopwords.txt', 'r',encoding= 'utf-8')]

name_list=[ i.replace( 'n', '') fori inopen( 'role.txt', 'r',encoding= 'utf-8').readlines]

defbase(path):

withcodecs.open(path, 'r', 'UTF-8') asf:

forline inf.readlines:

line=line.replace( 'rn', '')

poss = jieba.cut(line)

linesName.append([])

forword inposs:

ifword intxt:

continue

linesName[ -1].append(word)

ifnames.get(word) isNone:

names[word]= 0

relationship[word]={}

names[word]+= 1

returnlinesName,relationship

defrelationships(linesName,relationship,name_list):

forline inlinesName:

forname1 inline:

ifname1 inname_list:

forname2 inline:

ifname1==name2:

continue

ifrelationship[name1].get(name2) isNone:

relationship[name1][name2]= 1

else:

relationship[name1][name2]+= 1

returnrelationship

defwrite_csv(relationship):

csv_writer2=open( 'edges.csv', 'w',encoding= 'gb18030')

writer=csv.writer(csv_writer2,delimiter= ',',lineterminator= 'n')

writer.writerow([ 'Source', 'Target', 'Weight'])

forname,edges inrelationship.items:

fork,v inedges.items:

ifv> 10:

writer.writerow([name,k,v])

csv_writer2.close

if__name__== '__main__':

linesName,relationship=base( 'txt_all.txt')

data=relationships(linesName,relationship,na

注：其中，stopwods.txt为停止词文件，role.txt为人物昵称文件

将生成的文件导入Gephi，得到如下人物形象图：

线条越粗的线，代表该人物特征越明显。不难看出，大家对于哀酱的评价主要是美腻、可爱、心疼。

再做一张琴酒的：

哈哈哈，大家对琴酒的评价就比较逗逼了，变态，痴汉，聪明啥都有。你以为的琴酒，实际上的琴酒(手动滑稽

以上就是本次python实战的全部内容！

爱数据·10月职场专题直播

直播主题：第3季度 · 岗位调研——数据领域城市岗位调研报告

内容剧透：

各地区数据领域岗位招聘现状

数据领域职场人才需求情况调研

数据分析求职新方向

直播时间：10月29日明晚20:30

扫码回复

预约直播

即可0元领取直播入口！返回搜狐，查看更多

责任编辑：

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。