1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > Python抓取行政区域划分存入MySQL数据库

Python抓取行政区域划分存入MySQL数据库

时间:2022-05-09 02:36:48

相关推荐

Python抓取行政区域划分存入MySQL数据库

强烈不建议直接抓取, 如果因频繁请求导致服务异常, 可能要承担一定的责任, 所以要慎重. 推荐在某宝上直接购买对应数据表, 本文只做学习使用

行政区域地址:

http://www./tjsj/tjbz/tjyqhdmhcxhfdm//index.html

MySQL表结构

CREATE TABLE `region` (`code` varchar(32) NOT NULL COMMENT '行政编码',`name` varchar(128) NOT NULL COMMENT '名称',`parent_code` varchar(32) NOT NULL COMMENT '父级行政编码',PRIMARY KEY (`code`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='行政划分区域';

代码

运行过程中可能会出现卡顿, 所以要通过db进行查询判断, 避免重复处理, 中途失败可重试继续.

import reimport requests as reqfrom bs4 import BeautifulSoupfrom region_reptile.region_db import RegionDBcity_Prefix = 'http://www./tjsj/tjbz/tjyqhdmhcxhfdm//'region_db = RegionDB()# 从首页获取省份列表def province_list():data = req.get(f'{city_Prefix}index.html')data.encoding = 'utf-8'html = data.textsoup = BeautifulSoup(html, 'html.parser').find('table', 'provincetable')provinces = soup.select('tr.provincetr > td')array = []for province in provinces:print(province.text)c_array = [province.find('a').get('href').replace('.html', ''), province.find('a').text, '0']array.append(c_array)if region_db.query(f'select code from region where parent_code={c_array[0]}') is not None:continuecity_list(c_array[0], province.find('a').get('href'))region_db.save_all("insert ignore into region values(%s, %s, %s)", array)# 根据省份获取市列表def city_list(province_id, province_href):data = req.get(f'{city_Prefix}{province_href}')data.encoding = 'utf-8'html = data.textsoup = BeautifulSoup(html, 'html.parser').find('table', 'citytable')citys = soup.select('tr.citytr > td:nth-of-type(2)')array = []for city in citys:print(city.text)c_array = [city.find('a').get('href').split('/')[1].replace('.html', ''), city.find('a').text, province_id]array.append(c_array)if region_db.query(f'select code from region where parent_code={c_array[0]}') is not None:continuecounty_list(c_array[0], build_href(province_href, city.find('a').get('href')))region_db.save_all("insert ignore into region values(%s, %s, %s)", array)# 根据市获取区县列表def county_list(city_id, city_href):data = req.get(f'{city_Prefix}{city_href}')data.encoding = 'utf-8'html = data.textsoup = BeautifulSoup(html, 'html.parser').find('table', 'countytable')countys = soup.select('tr.countytr > td:nth-of-type(2)')array = []for i, county in enumerate(countys):print(county.text)if county.find('a') is None:continuec_array = [county.find('a').get('href').split('/')[1].replace('.html', ''), county.find('a').text, city_id]array.append(c_array)if region_db.query(f'select code from region where parent_code={c_array[0]}') is not None:continuetown_list(c_array[0], build_href(city_href, county.find('a').get('href')))region_db.save_all("insert ignore into region values(%s, %s, %s)", array)# 根据区县获取乡镇列表def town_list(country_id, country_href):data = req.get(f'{city_Prefix}{country_href}')data.encoding = 'utf-8'html = data.textsoup = BeautifulSoup(html, 'html.parser').find('table', 'towntable')towns = soup.select('tr.towntr > td:nth-of-type(2)')array = []for town in towns:print(town.text)c_array = [town.find('a').get('href').split('/')[1].replace('.html', ''), town.find('a').text, country_id]array.append(c_array)if region_db.query(f'select code from region where parent_code={c_array[0]}') is not None:continuevillage_list(c_array[0], build_href(country_href, town.find('a').get('href')))region_db.save_all("insert ignore into region values(%s, %s, %s)", array)# 根据乡镇获取街道列表def village_list(town_id, town_href):data = req.get(f'{city_Prefix}{town_href}')data.encoding = 'utf-8'html = data.textsoup = BeautifulSoup(html, 'html.parser').find('table', 'villagetable')villages = soup.select('tr.villagetr')array = []for village in villages:print(village.text)c_array = [village.select('td:nth-of-type(1)')[0].text, village.select('td:nth-of-type(3)')[0].text, town_id]array.append(c_array)region_db.save_all("insert ignore into region values(%s, %s, %s)", array)def build_href(p_href, c_href):return re.sub('[0-9]*.html', '', p_href) + c_href# city_list('41', '41.html')# county_list('4109', '41/4109.html')# town_list('410923', '41/09/410923.html')# village_list('410923206', '41/09/23/410923206.html')# town_list('110101', '11/01/110102.html')province_list()

运行结果

北京市天津市市辖区和平区...造甲城镇1710921造甲城村委会1710920大王台村委会17109203220冯家台村委会17109204220付家台村委会17109205220东小王台村委会17109206220西小王台村委会17109207220田辛庄村委会17109208220赵温庄村委会

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。