1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > Python爬虫实战+数据分析+数据可视化(汽车之家)

Python爬虫实战+数据分析+数据可视化(汽车之家)

时间:2021-01-21 08:05:38

相关推荐

Python爬虫实战+数据分析+数据可视化(汽车之家)

随着经济的发展,科技的进步,车成为了每个家庭必备的交通工具,再加上现在结婚的前提条件就是要有车有房,无形之中加剧了男同胞们的压力,这个时候我们就需要急需一辆车,二手车市场近些年来也非常的火热,增加了男同胞们购买汽车的途径,于是博主通过对汽车之家江苏省的二手车进行详细的可视化分析为广大男同胞提供相应的意见

一、爬虫部分

爬虫说明:

1、本爬虫是以面向对象的方式进行代码架构的

2、本爬虫爬取的数据存入到MongoDB数据库中(提供有转换后的.xlsx文件)

3、爬虫代码中有详细注释

4、爬虫爬取的数据以江苏省的二手车为例为例

代码展示

import refrom pymongo import MongoClientimport requestsfrom lxml import htmlclass CarHomeSpider(object):def __init__(self):self.start_url = '/jiangsu/list/'self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}self.url_temp = '/jiangsu/{}/a0_0msdgscncgpi1ltocsp{}exx0/?pvareaid=102179#currengpostion'self.client = MongoClient()self.collection = self.client['test']['car_home']def get_url_list(self,sign,total_count):url_list = [self.url_temp.format(sign,i) for i in range(1,int(total_count)+1)]return url_listdef parse(self,url):resp = requests.get(url,headers=self.headers)return resp.textdef get_content_list(self,raw_html):resp_html = html.etree.HTML(raw_html)car_list = resp_html.xpath('//ul[@class="viewlist_ul"]/li')for car in car_list:item = {}# 获取汽车的标题信息card_name = car.xpath('.//h4[@class="card-name"]/text()')card_name = card_name[0] if len(card_name)>0 else ''car_series = re.findall(r'(.*?) \d{4}款',card_name)item['car_series'] = car_series[0].replace(' ','') if len(car_series)>0 else ''car_time_style = re.findall(r'.*? (\d{4})款',card_name)item['car_time_style'] = car_time_style[0] if len(car_time_style)>0 else ''car_detail = re.findall(r'\d{4}款 (.*)',card_name)item['car_detail'] = car_detail[0].replace(' ','') if len(car_detail)>0 else ''# 获取汽车的详细信息card_unit = car.xpath('.//p[@class="cards-unit"]/text()')card_unit = card_unit[0].split('/') if len(card_unit)>0 else ''item['car_run'] = card_unit[0]item['car_push'] = card_unit[1]item['car_place'] = card_unit[2]item['car_rank'] = card_unit[3]# 获取汽车的价格car_price = car.xpath('./@price')item['car_price'] = car_price[0] if len(car_price)>0 else ''print(item)self.save(item)def save(self,item):self.collection.insert(item)def run(self):# 首先请求首页获取页面分类数据rest = self.parse(self.start_url)rest_html = html.etree.HTML(rest)# 这里取的是按照价格的分类 形如:3万以下 3-5万 5-8万 8-10万 10-15万 15-20万 20-30万 30-50万 50万以上price_area_list = rest_html.xpath('//div[contains(@class,"condition-price")]//div[contains(@class,"screening-base")]/a')if price_area_list:for price_area in price_area_list:price_area_text = price_area.xpath('./text()')[0]price_area_link = ''+price_area.xpath('./@href')[0]# 获取每个分类的url并进行请求 获取每个分类下的总页数rest_ = self.parse(price_area_link)rest_html_ = html.etree.HTML(rest_)total_count = rest_html_.xpath('//div[@id="listpagination"]/a[last()-1]/text()')[0]# 获取每个分类url的唯一标识sign = re.findall(r'jiangsu/(.*?)/#pvareaid',price_area_link)[0]# 生成每个分类下的所有页面的url地址url_list = self.get_url_list(sign,total_count)for url in url_list:raw_html = self.parse(url)self.get_content_list(raw_html)if __name__ == '__main__':car_home = CarHomeSpider()car_home.run()

二、数据分析和数据可视化部分

数据分析和数据可视化说明:

1、本博客通过Flask框架来进行数据分析和数据可视化

2、项目的架构图为

代码展示

数据分析代码展示(analysis.py)

import refrom pymongo import MongoClientimport pandas as pdimport numpy as npimport pymysqldef pre_process(df):"""数据预处理函数:param df: dataFrame:return: df"""# 将数据中车的行驶路程单位万公里去掉 方便后续计算 比如:1.2万公里df['car_run'] = df['car_run'].apply(lambda x:x.split('万公里'))# 将数据中car_push字段中有未上牌的数据删除df['car_push'] = df['car_push'].apply(lambda x:x if not x=="未上牌" else np.nan)# 删除字段中存在有NAN的数据df.dropna(inplace=True)return dfdef car_brand_count_top10(df):"""计算不同品牌的数量的前十名:param df: dataFrame:return: df"""# 按照汽车的品牌进行分类grouped = df.groupby('car_series')['car_run'].count().reset_index().sort_values(by="car_run",ascending=False)[:10]data = [[i['car_series'],i['car_run']] for i in grouped.to_dict(orient="records")]print(data)return datadef car_use_year_count(df):"""计算二手车的使用时间:param df: dataFrame:return: df"""# 处理汽车的变卖时间date = pd.to_datetime(df['car_push'])date_value = pd.DatetimeIndex(date)df['car_push_year'] = date_value.year# 转换数据类型为intdf['car_time_style'] = df['car_time_style'].astype(np.int)df['car_push_year'] = df['car_push_year'].astype(np.int)df['cae_use_year'] = df['car_push_year']-df['car_time_style']# 对车的使用年限进行分类grouped = df.groupby('cae_use_year')['car_series'].count().reset_index()# 将使用年限为负的字段删除 并根据使用年限进行分组 分为 <一年 一年~三年 >三年grouped = grouped.query('cae_use_year>=0')grouped.loc[:,'cae_use_year'] = grouped.loc[:,'cae_use_year'].apply(lambda x:"<一年" if x==0 else x )grouped.loc[:,'cae_use_year'] = grouped.loc[:,'cae_use_year'].apply(lambda x:"一年~三年" if not x =='<一年' and x>0 and x<3 else x )grouped.loc[:,'cae_use_year'] = grouped.loc[:,'cae_use_year'].apply(lambda x:">三年" if not x =='<一年' and not x=="一年~三年" and x>=3 else x )# 再根据不同使用年限进行分组grouped_use_year = grouped.groupby('cae_use_year')['car_series'].sum().reset_index()data = [[i['cae_use_year'],i['car_series']] for i in grouped_use_year.to_dict(orient="records")]print(data)return datadef car_place_count(df):"""计算不同地区的二手车数量:param df: dataFrame:return: df"""grouped = df.groupby('car_place')['car_series'].count().reset_index()data = [[i['car_place'],i['car_series']] for i in grouped.to_dict(orient="records")]print(data)return datadef car_month_count(df):"""计算每个月的二手车数量:param df: dataFrame:return: df"""# 处理汽车的变卖时间date = pd.to_datetime(df['car_push'])date_value = pd.DatetimeIndex(date)month = date_value.monthdf['car_push_month'] = month# 对汽车变卖的月份进行分组grouped = df.groupby('car_push_month')['car_series'].count().reset_index()data = [[i['car_push_month'],i['car_series']] for i in grouped.to_dict(orient="records")]print(data)return datadef save(cursor,sql,data):result = cursor.executemany(sql,data)if result:print('插入成功')if __name__ == '__main__':# 1 从MongoDB中获取数据# 初始化MongoDB数据连接# client = MongoClient()# collections = client['test']['car_home']# 获取MongoDB数据# cars = collections.find({},{'_id':0})# 2 读取xlsx文件数据(已将MongoDB中数据转换成xlsx格式)cars = pd.read_excel('./carhome.xlsx',engine='openpyxl')# 将数据转换成dataFrame类型df = pd.DataFrame(cars)print(df.info())print(df.head())# 对数据进行预处理df = pre_process(df)# 计算不同品牌的数量的前十名data1 = car_brand_count_top10(df)# 计算二手车的使用时间data2 = car_use_year_count(df)# 计算不同地区的二手车数量data3 = car_place_count(df)# 计算每个月的二手车数量data4 = car_month_count(df)# 创建mysql连接conn = pymysql.connect(user='root',password='123456',host='localhost',port=3306,database='car_home',charset='utf8')try:with conn.cursor() as cursor:# 计算不同品牌的数量的前十名sql1 = 'insert into db_car_brand_top10(brand,count) values(%s,%s)'save(cursor,sql1,data1)# 计算二手车的使用时间sql2 = 'insert into db_car_area(area,count) values(%s,%s)'save(cursor,sql2,data2)# 计算不同地区的二手车数量sql3 = 'insert into db_car_use_year(year_area,count) values(%s,%s)'save(cursor, sql3, data3)# 计算每个月的二手车数量sql4 = 'insert into db_car_month(month,count) values(%s,%s)'save(cursor,sql4,data4)mit()except pymysql.MySQLError as error:print(error)conn.rollback()

数据转换文件MongoDB数据转xlsx(to_excle.py)

import pandas as pdimport numpy as npfrom pymongo import MongoClientdef export_excel(export):# 将字典列表转换为DataFramedf = pd.DataFrame(list(export))# 指定生成的Excel表格名称file_path = pd.ExcelWriter('carhome.xlsx')# 替换空单元格df.fillna(np.nan, inplace=True)# 输出df.to_excel(file_path, encoding='utf-8', index=False)# 保存表格file_path.save()if __name__ == '__main__':# 将MongoDB数据转成xlsx文件client = MongoClient()connection = client['test']['car_home']ret = connection.find({}, {'_id': 0})data_list = list(ret)export_excel(data_list)

数据库模型文件展示(models.py)

from . import dbclass BaseModel(object):id = db.Column(db.Integer, autoincrement=True, primary_key=True)count = db.Column(db.Integer)# 计算不同品牌的数量的前十名class CarBrandTop10(BaseModel,db.Model):__tablename__ = 'db_car_brand_top10'brand = db.Column(db.String(32))# 计算车二手车的使用时间class CarUseYear(BaseModel,db.Model):__tablename__ = 'db_car_use_year'year_area = db.Column(db.String(32))# 计算不同地区的二手车数量class CarArea(BaseModel,db.Model):__tablename__='db_car_area'area = db.Column(db.String(32))# 计算每个月的二手车数量class CarMonth(BaseModel,db.Model):__tablename__='db_car_month'month = db.Column(db.Integer)

配置文件代码展示(config.py)

# 基本配置class Config(object):SECRET_KEY = 'msqaidyq1314'SQLALCHEMY_DATABASE_URI = "mysql://root:123456@localhost:3306/car_home"SQLALCHEMY_TRACK_MODIFICATIONS = Trueclass DevelopmentConfig(Config):DEBUG = Trueclass ProductConfig(Config):pass# 创建配置类映射config_map = {'develop':DevelopmentConfig,'product':ProductConfig}

主工程目录代码展示(api_1_0/_init_.py)

from flask import Flaskfrom flask_sqlalchemy import SQLAlchemyimport pymysqlfrom config import config_mappymysql.install_as_MySQLdb()db = SQLAlchemy()def create_app(config_name='develop'):# 初始化app对象app = Flask(__name__)config = config_map[config_name]app.config.from_object(config)# 加载数据库db.init_app(app)# 注册蓝图from . import api_1_0app.register_blueprint(api_1_0.api,url_prefix="/show")return app

主程序文件代码展示(manager.py)

from car_home import create_app,dbfrom flask_migrate import Migrate,MigrateCommandfrom flask_script import Managerfrom flask import render_templateapp = create_app()manager = Manager(app)Migrate(app,db)manager.add_command('db',MigrateCommand)@app.route('/')def index():return render_template('index.html')if __name__ == '__main__':manager.run()

视图文件代码展示(api_1_0/views/_init_.py,show.py)

_init_.py

from flask import Blueprintfrom car_home import modelsapi = Blueprint('api_1_0',__name__)from . import show

show.py

from . import apifrom car_home.models import CarArea,CarUseYear,CarBrandTop10,CarMonthfrom flask import render_template# 计算不同品牌的数量的前十名@api.route('/showBrandBar')def showBrandBar():car_brand_top10 = CarBrandTop10.query.all()brand = [i.brand for i in car_brand_top10]count = [i.count for i in car_brand_top10]print(brand)print(count)return render_template('showBrandBar.html', **locals())# 计算二手车的使用时间@api.route('/showPie')def showPie():car_use_year = CarUseYear.query.all()data = [{'name':i.year_area,'value':i.count} for i in car_use_year]return render_template('showPie.html',**locals())# 计算不同地区的二手车数量@api.route('/showAreaBar')def showAreaBar():car_area = CarArea.query.all()area = [i.area for i in car_area]count = [i.count for i in car_area]return render_template('showAreaBar.html',**locals())# 计算每个月的二手车数量@api.route('/showLine')def showLine():car_month = CarMonth.query.all()month = [i.month for i in car_month]count = [i.count for i in car_month]return render_template('showLine.html',**locals())

主页展示(index.html)

主页简单创建了四个超链接指向对应的图表

<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>汽车之家可视化分析</title><style>ul{width: 800px;height: 600px;{#list-style: none;#}line-height: 60px;padding: 40px;margin: auto;}ul li{margin-bottom: 20px;}</style></head><body><ul><li><a href="{{ url_for('api_1_0.showBrandBar') }}"><h3>计算不同品牌的数量的前十名</h3></a></li><li><a href="{{ url_for('api_1_0.showPie') }}"><h3>计算车二手车的使用时间</h3></a></li><li><a href="{{ url_for('api_1_0.showAreaBar') }}"><h3>计算不同地区的二手车数量</h3></a></li><li><a href="{{ url_for('api_1_0.showLine') }}"><h3>计算每个月的二手车数量</h3></a></li></ul></body></html>

模板文件代码展示(showAreaBar.html,showBrandBar.html,showLine.html,showPie.html)

showPie.html

<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>计算不同地区的二手车数量</title><script src="../static/js/echarts.min.js"></script><script src="../static/js/vintage.js"></script></head><body><div class="cart" style="width: 800px;height: 600px;margin: auto"></div><script>var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')var data = {{data|tojson }}var option = {title:{text:'不同地区的二手车数量',textStyle:{fontSize:21,fontFamily:'楷体'},left:10,top:10},legend:{name:['地区'],left:10,bottom:10,orient:'vertical'},tooltip:{trigger:'item',triggerOn:'mousemove',formatter:function (arg){return '地区:'+arg.name+"<br>"+"数量:"+arg.value+"<br>"+"占比:"+arg.percent+"%"}},series:[{type:'pie',data:data,name:'使用时间',label:{show:true},radius:['50%','80%'],{#roseType:'radius'#}itemStyle:{borderWidth:2,borderRadius:10,borderColor:'#fff'},selectedMode:'multiple',selectedOffset:20}]}MyCharts.setOption(option)</script></body></html>

结论:通过观察饼图,可以看出江苏省的二手车出售最多的城市是苏州,其次是南京,由此可以得出经济越发达的城市,二手车市场越广大。

showBrandBar.html

<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>计算不同品牌的数量的前十名</title><script src="../static/js/echarts.min.js"></script><script src="../static/js/vintage.js"></script></head><body><div class="cart" style="height: 600px;width: 800px;margin: auto"></div><script>var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')var brand = {{brand|tojson }}var count = {{count|tojson }}var option = {title:{text:'不同品牌的数量的前十名',textStyle:{fontSize:21,fontFamily:'楷体'},left:10,top:10},xAxis:{type:'category',data:brand,axisLabel:{interval:0,rotate:30,margin:20}},legend:{name:['汽车品牌']},yAxis:{type:'value',scale:true},tooltip:{trigger:'item',triggerOn: 'mousemove',formatter:function(arg){return '品牌:'+arg.name+'<br>'+'数量:'+arg.value}},series:[{type:'bar',data:count,name:'汽车品牌',label:{show:true,position:'top',rotate: true},showBackground:true,backgroundStyle: {color:'rgba(180,180,180,0.2)'}}]}MyCharts.setOption(option)</script></body></html>

结论:通过观察柱状图可以看出江苏省的的二手车主要以宝马、奔驰和奥迪为为主,其中宝马二手车出售最多,宝马5系和宝马3系处与一、二位置。

showLine.html

<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>计算每个月的二手车发布数量</title><script src="../static/js/echarts.min.js"></script><script src="../static/js/vintage.js"></script></head><body><div class="cart" style="width: 800px;height: 600px;margin: auto"></div><script>var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')var month = {{month|tojson }}var count = {{count|tojson }}var option = {title:{text:'每个月的二手车发布数量',textStyle:{fontSize:21,fontFamily:'楷体'},left:10,top:10},xAxis:{type:'category',data:month,axisLabel:{interval:0,rotate:30,margin:20}},legend:{name:['数量']},tooltip:{trigger:'axis',triggerOn:'mousemove',formatter:function(arg){return '月份:'+arg[0].name+'月'+"<br>"+'数量:'+arg[0].value}},yAxis:{type:'value',scale:true},series:[{type:'line',name:'数量',data:count,label:{show:true},showBackground:true,backgroundStyle:{color:'rgba(180,180,180,0.2)'},markPoint:{data:[{name:'最大值',type:'max',symbolSize:[40,40],symbolOffset:[0,-20],label:{show: true,formatter:function (arg){return arg.name}}},{name:'最小值',type:'min',symbolSize:[40,40],symbolOffset:[0,-20],label:{show: true,formatter:function (arg){return arg.name}}}]},markLine:{data:[{type:"average",name:'平均值',label:{show:true,formatter:function(arg){return arg.name+':\n'+arg.value}}}]}}]}MyCharts.setOption(option)</script></body></html>

结论:通过观察折线图可以看出,一月份发布的二手车数量最多,二月份发布的二手车数量最少,大部分月份低于平均发布水平。

showAreaBar.html

<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>计算二手车的使用时间</title><script src="../static/js/echarts.min.js"></script><script src="../static/js/vintage.js"></script></head><body><div class="cart" style="width: 800px;height: 600px;margin: auto"></div><script>var MyCharts = echarts.init(document.querySelector('.cart'),'vintage')var area = {{area|tojson }}var count = {{count|tojson }}var option = {title:{text:'二手车的使用时间',textStyle:{fontSize:21,fontFamily:'楷体'}},xAxis:{type:'category',data:area,axisLabel:{interval:0,rotate:30,margin:10}},legend:{name:['汽车品牌']},yAxis:{type:'value',scale:true},tooltip:{tigger:'item',triggerOn:'mousemove',formatter:function(arg){return '年限:'+arg.name+"<br>"+'数量:'+arg.value}},series:[{type:'bar',data:count,name:'汽车品牌',label:{show:true,position:'top',rotate: 30,distance:15},barWidth:'40%',showBackground:true,backgroundStyle: {color:'rgba(180,180,180,0.2)'}}]}MyCharts.setOption(option)</script></body></html>

结论:通过观察柱状图可以看出,江苏省的二手车使用时间大部分在一年以内,使用时间超过三年以上的数量较少,当发现购买的车不喜欢时要早点卖哦。

以下是项目源码,希望能够帮助你们,如有疑问,下方评论

flask项目代码链接

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。