1200字范文 > python抓取html中特定的数据库 Python抓取网页中内容正则分析后存入mysql数据库...

python抓取html中特定的数据库 Python抓取网页中内容正则分析后存入mysql数据库...

时间：2019-08-05 19:49:50

firefox+httpfox可以查看post表单

首先在/这个地址输入用户名和密码，

输入用户名和密码之后post到下面这个网址：

/PLogin.do

#renren.py

import urllib

import urllib2

import cookielib

cookie = cookielib.CookieJar()

opener =

urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))

postdata=urllib.urlencode({

'email':'',

#your account

'password':'' #your password

})

req = urllib2.Request(

url='/PLogin.do',

data=postdata

)

result=opener.open(req)

print result.read()

这样就已经登陆人人网了。

打印出来的是已登陆界面的html源码。

二、抓取网页并获得需要的信息

这里以股票网站seekingalpha为例(sorry no offending)打开SA，准备抓取

import urllib

import urllib2

content=urllib2.urlopen('/symbol/GOOGL?s=googl').read()

print content

下面就会打印出GOOGL股票的页面。

*注意这里并没有使用post因为这个网站不登陆也可以看>

下面分析正则表达式：

写出正则表达式：pattern=pile(r'href="/article.*sasource’)

这样会找到所有指向评论页面的链接，若打印的话GOOG会有下面这些：

/article/2250373-energetic-moves-for-google

/article/2249173-google-bringing-satellite-internet-to-the-world

/article/2247383-what-googles-self-driving-car-says-about-the-company

/article/2238623-europe-tries-to-censor-google

/article/2236283-google-is-reportedly-mulling-expansion-in-outer-space

/article/2234863-what-will-googles-30-billion-in-foreign-acquisitions-do

/article/2229953-in-defense-of-google-glass

/article/2229163-android-fragmentation-and-the-cloud

/article/2227963-everything-you-need-to-know-about-twitch-tv-and-why-company-could-be-a-steal-for-google

/article/2226203-google-adds-quest-visual-to-its-portfolio-m-and-a-overview

/article/2223103-goog-vs-googl-a-classic-pairs-trade

/article/2222373-google-or-apple-which-is-the-better-long-term-bet

/article/2220023-a-look-at-everything-thats-wrong-with-google-glass

/article/2198683-analysis-of-oral-argument-in-vringo-vs-google-patent-infringement-appeal

/article/2193673-google-investors-can-expect-upside-potential

/article/2191843-google-is-a-stock-to-own-for-the-long-term

/article/2187033-google-7-different-insiders-have-sold-shares-during-the-last-30-days

/article/2169973-google-facing-some-problems-in-the-mobile-advertising-market

/article/2168773-google-strikes-deal-with-buffett-backed-wind-generator

/article/2165243-why-google-has-upside-to-nearly-650

/article/2251473-what-wwdc-says-about-apples-new-products

/article/2251063-how-apples-iphones-might-become-an-indispensable-piece-of-equipment-again

/article/2250973-will-apple-outsmart-google-in-the-internet-of-things

/article/2249683-demand-medias-c-and-m-business-prospects-boosted-by-new-google-search-algorithm-changes

/article/2248843-googles-satellites-pose-threat-to-sirius-xm

/article/2248193-facebook-battling-google-for-eyeballs

/article/2248143-wall-street-breakfast-must-know-news

/article/2246013-apple-something-extraordinary-is-certain

/article/2245693-why-you-shouldnt-believe-the-himax-google-break-up-rumor

/article/2244133-dividends-role-in-wealth-creation-sector-analysis

/article/2242083-the-defensive-portfolio-focusing-on-competitive-advantage

/article/224-vringos-q1-report-shows-mixed-results-is-a-secondary-offering-just-around-the-corner

/article/2241533-is-facebook-at-the-peak-of-its-share-price

/article/2240663-wall-street-breakfast-must-know-news

/article/2240493-blackberry-z3-seems-too-late-to-the-party

/article/2238893-why-apple-beats-partnership-will-change-competitive-landscape-for-music-streaming

/article/2238073-apples-split-what-you-need-to-know

/article/2236983-lady-liberty-rescues-vringo-google-royalty-tab-to-exceed-1_8-billion

/article/2236893-high-time-for-investors-to-buy-into-samsung

/article/2231733-lenovo-making-the-right-strategic-moves-to-build-value

下面是完整python代码：

#table commenturl

#CREATE TABLE `commenturl` (

# `id` int(11) unsigned NOT NULL

AUTO_INCREMENT,

# `object` varchar(30) DEFAULT NULL,

# `url` varchar(1024) DEFAULT NULL,

# PRIMARY KEY (`id`)

# ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

#truncate table commenturl----set autoincrement to be 1

import MySQLdb

import urllib2

headers =

{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1;

en-US; rv:1.9.1.6) Gecko/1201 Firefox/3.5.6'}

req = urllib2.Request(url = '/symbol/GOOG?s=goog',headers

= headers)

content=urllib2.urlopen(req).read()

import sys

import os

import re

links=re.findall(r'href="/article.*sasource',content)

try:

conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)

cur=conn.cursor()

conn.select_db('usr')

except MySQLdb.Error,e:

print "Mysql

Error %d: %s" % (e.args[0], e.args[1])

for url in links:

ct=len(url)

url=url[6:(ct-10)]

url=''+url

print url

cur.execute("INSERT INTO COMMENTURL(object,url)

VALUES('GOOG',%s)",url)

mit()

注意：网站会为了防止爬虫而出现Error 403 Forbidden，这时要模拟浏览器访问，代码：req =

urllib2.Request(url ='/symbol/GOOG?s=goog',headers

= headers)

总之上面是全的源码还有mysql建表语句。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

python抓取html中特定的数据库 Python抓取网页中内容 正则分析后存入mysql数据库...

python抓取html中特定的数据库 Python抓取网页中内容正则分析后存入mysql数据库...