1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > python如何处理表格_如何处理表格/列表/标题等?

python如何处理表格_如何处理表格/列表/标题等?

时间:2023-08-31 08:27:44

相关推荐

python如何处理表格_如何处理表格/列表/标题等?

您可以使用像python-goose这样的工具,它旨在从html页面中提取文章。在

另外,我做了以下小程序,效果不错:from html5lib import parse

with open('page.html') as f:

doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()

body = html.xpath('//body')[0]

def sanitize(element):

"""Retrieve all the text contained in an element as a single line of

text. This must be executed only on blocks that have only inlines

as children

"""

# join all the strings and remove \n

out = ' '.join(element.itertext()).replace('\n', ' ')

# replace multiple space with a single space

out = ' '.join(out.split())

return out

def parse(element):

# those elements can contain other block inside them

if element.tag in ['div', 'li', 'a', 'body', 'ul']:

if element.text is None or element.text.isspace():

for child in element.getchildren():

yield from parse(child)

else:

yield sanitize(element)

# those elements are "guaranteed" to contains only inlines

elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:

yield sanitize(element)

else:

try:

print('> ignored', element.tag)

except:

pass

for e in filter(lambda x: len(x) > 80, parse(body)):

print(e)

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。