1200字范文 > python英语词汇读音_利用PYTHON 爬虫爬出自己的英语单词库

python英语词汇读音_利用PYTHON 爬虫爬出自己的英语单词库

时间：2018-06-15 17:21:55

为什么要建立自己的单词库

用过各种的背单词软件，总是在使用其他人的词库或者软件自己提供的词库，基本是人家提供什么自己就用什么，要想有更多的自主基本没有，最近看一个 COCA的按单词使用频率来提取的2万单词表，但没有对应的单词库，知米里倒是可以直接导入英文单词，系统帮你匹配上音标、读音、例句及解释，然而匹配后的结果你却无法导出。

特别是最近准备利用AnkiDroid来进行单词背诵，所以有种要建立自己的单词库的需求。更进一步或许可以自己开发一个背单词的软件也是有可能的。“万里长征第一步，先来建立单词库”，走一步看一步吧。

词库的需求分析

根据需求，词库应该包括如下内容

英文：对应英语单词

音标及读音：分为美语音标，读音，英语音标，读音

词性，中文释义：单词多个含义的不同词性和中文

例句：单词的例句

助记：比如词根或者其他有助于记忆的说明

输出一个文本文件好了，方便以后进行各种处理

使用技术的选择

获得单词的相关信息，目前可以通过百度翻译，有道翻译，必应翻译，谷歌翻译，金山词霸等方式，在综合考虑后选择通过必应字典模式获得相关数据。

数据爬取上，目前最为流行的并且相对成熟的是使用python（也就懂python），所以选择python

对于使用python爬取数据，一般有两种模式，一种是python+urllib+lxml, python+selenium+chrome。本身就是一个小项目，同时自身学习能力有限就没考虑scrapy的爬虫框架了。估计以后要是大量、各种、经常性爬取内容才会考虑这个。什么都要学习呀，学习是要成本的。

python+selenium+chrome

可以模拟浏览器动作，能有效的解决ajax模式下的数据爬取问题

很容易实现基于浏览器的测试

必须能够趟过 selenium 的一系列坑，相对学习成本要高

python+urllib+lxml

学习成本相对较低

ajax，动态网页的爬取不方便

当然两者都需要有一定的正则表达式能力。由于必应字典基本都属于静态网页，所以选择方式2就是python+urllib+lxml模式。

技术实现

1.python及相关环境安装：

使用anaconda 完成整体环境的安装，这里略过，详细见/p/f452f71860ab

核心代码分析

构造url

基本构造很简单/dict/search?q=单词

获得页面：构造一个函数，输入单词，通过urllib获得对应页面，并返回

def get_page(myword):

basurl='/dict/search?q='

searchurl=basurl+myword

response = urllib.request.urlopen(searchurl)

html = response.read()

return html

解析页面：主要使用lxml，通过xpath进行内容解析，以下以获得单词音标为例，其他相识。

def get_yingbiao(html_selector):

yingbiao=[]

yingbiao_xpath='/html/body/div[1]/div/div/div[1]/div[1]/div[1]/div[2]/div' #xpath

bbb="(https\:.*?mp3)" ##这个是为了获得对应的读音MP3文件，使用正则表达式

reobj1=pile(bbb,re.I|re.M|re.S)

get_yingbiao=html_selector.xpath(yingbiao_xpath)

for item in get_yingbiao:

it=item.xpath('div')

if len(it)>0: #处理没有读音或者音标的部分

ddd=reobj1.findall(it[1].xpath('a')[0].get('onmouseover',None))

yingbiao.append("%s||%s"%(it[0].text,ddd[0]))

ddd=reobj1.findall(it[3].xpath('a')[0].get('onmouseover',None))

yingbiao.append("%s||%s"%(it[2].text,ddd[0]))

if len(yingbiao)>0: #数据整形成一个字符串，用四个竖线分隔

return reduce(lambda x, y:"%s||||%s"%(x,y),yingbiao)

else:

return ""

多数据输入输出：输入文件为一个英语单词文件，每个单词一行，输出为一个包含单词，音标，释义，例句的文件，同样每个单词一行。

filename='words.txt' #输入文件

f=open(filename,"r")

words=f.readlines()

f.close()

filename2='words_jieguo.txt' #输出文件

f=open(filename2,"w")

i=0

for word in words:

time.sleep(0.25) #怕爬太快给必应干掉，所以歇一会再来

print(word.rstrip(),i)

word_line=get_word(word.rstrip()) #获得单词相关内容函数

f.write("%s\n"%(word_line.encode('utf-8'))) #写入输出文件

i=i+1

f.close()

整体代码: python3下的实现，在python2下需要进行一些微调。

import urllib.request

from lxml import etree

import re

import time

from functools import reduce

#获得页面数据

def get_page(myword):

basurl='/dict/search?q='

searchurl=basurl+myword

response = urllib.request.urlopen(searchurl)

html = response.read()

return html

#获得单词释义

def get_chitiao(html_selector):

chitiao=[]

hanyi_xpath='/html/body/div[1]/div/div/div[1]/div[1]/ul/li'

get_hanyi=html_selector.xpath(hanyi_xpath)

for item in get_hanyi:

it=item.xpath('span')

chitiao.append('%s||%s'%(it[0].text,it[1].xpath('span')[0].text))

if len(chitiao)>0:

return reduce(lambda x, y:"%s||||%s"%(x,y),chitiao)

else:

return ""

#获得单词音标和读音连接

def get_yingbiao(html_selector):

yingbiao=[]

yingbiao_xpath='/html/body/div[1]/div/div/div[1]/div[1]/div[1]/div[2]/div'

bbb="(https\:.*?mp3)"

reobj1=pile(bbb,re.I|re.M|re.S)

get_yingbiao=html_selector.xpath(yingbiao_xpath)

for item in get_yingbiao:

it=item.xpath('div')

if len(it)>0:

ddd=reobj1.findall(it[1].xpath('a')[0].get('onmouseover',None))

yingbiao.append("%s||%s"%(it[0].text,ddd[0]))

ddd=reobj1.findall(it[3].xpath('a')[0].get('onmouseover',None))

yingbiao.append("%s||%s"%(it[2].text,ddd[0]))

if len(yingbiao)>0:

return reduce(lambda x, y:"%s||||%s"%(x,y),yingbiao)

else:

return ""

#获得例句

def get_liju(html_selector):

liju=[]

get_liju_e=html_selector.xpath('//*[@class="val_ex"]')

get_liju_cn=html_selector.xpath('//*[@class="bil_ex"]')

get_len=len(get_liju_e)

for i in range(get_len):

liju.append("%s||%s"%(get_liju_e[i].text,get_liju_cn[i].text))

if len(liju)>0:

return reduce(lambda x, y:"%s||||%s"%(x,y),liju)

else:

return ""

def get_word(word):

#获得页面

pagehtml=get_page(word)

selector = etree.HTML(pagehtml.decode('utf-8'))

#单词释义

chitiao=get_chitiao(selector)

#单词音标及读音

yingbiao=get_yingbiao(selector)

#例句

liju=get_liju(selector)

return "%s\t%s\t%s\t%s"%(word,yingbiao,chitiao,liju)

filename='5.txt'

f=open(filename,"r")

words=f.readlines()

f.close()

filename2='5_jieguo.txt'

f=open(filename2,"wb")

i=0

for word in words:

time.sleep(0.2)

print(word.rstrip(),i)

word_line=get_word(word.rstrip())

f.write("%s\n"%(word_line))

i=i+1

f.close()

由于单词本身不多，而时间其实更多，所以没进行多线程的改造，按一小时3600秒，一秒爬4-5个单词，一小时也能爬下不少单词，多线程改造意义不大。

最后爬1w单词和对应mp3文件一并发了吧

语音（访问码：1386）

单词（访问码：7678）

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。