1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 超市零售数据分析-大数据项目

超市零售数据分析-大数据项目

时间:2020-02-05 11:58:33

相关推荐

超市零售数据分析-大数据项目

个人博客地址

[实验数据]

本实验所用数据为国内某超市从8月1日到8月1日共一年的交易数据,包含了812,847条交易、2,893,385件单个商品以及20,154名顾客。

该数据包含了3个数据集。

交易概况

数据集已经存放在HDFS上,路径为“/data/13/2/sales_head/sales_head.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.retail_sales_head”。

各字段的定义为:

字段定义BillId交易标识符CustId顾客会员标识符,非会员则为空Quantity交易包含的商品种类数TransTime交易时间OrigTotalPrice交易原始总价,可能包含分Pay顾客支付金额Change找零ActualTotalPrice交易实际总价,只精确到角

数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/2/sales_head/sales_head.csv | head -500034121002436593 60018 3 -08-01 07:46:10 8.84 10 1.2 8.800034121002436594 1 -08-01 07:46:53 19.59 20 0.5 19.500034121002436595 2 -08-01 07:47:35 7 7 700034121002436596 60018 3 -08-01 07:48:30 28.01 103 75 2800034121002436597 600799218 2 -08-01 07:49:32 19.9 20 0.1 19.9

交易明细

数据集已经存放在HDFS上,路径为“/data/13/2/sales_detail/sales_detail.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.retail_sales_detail”。

各字段的定义为:

字段定义BillId交易标识符RowNo商品在交易中的次序TransTime交易时间GoodId商品标识符Barcode商品条形码GoodName商品名称Unit单位Quantity商品数量OrigUnitPrice商品原始单价OrigTotalPrice商品原始总价ActualUnitPrice商品实际单价ActualTotalPrice商品实际总价

数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/2/sales_detail/sales_detail.csv | head -500034121002436593 1 -08-01 07:45:38 5440483 2186463 苦瓜(一级) 公斤 0.262 4 1.048 3.6 0.9400034121002436593 2 -08-01 07:45:39 5440483 2186463 苦瓜(一级) 公斤 0.192 4 0.768 3.6 0.6900034121002436593 3 -08-01 07:45:45 5440466 2186359 南瓜(一级) 公斤 4.052 1.98 8.023 1.78 7.2100034121002436594 1 -08-01 07:45:26 5110324 6934665081392 蒙牛益生菌酸牛奶(原味)1.2kg 桶 1 19.59 19.59 19.59 19.5900034121002436595 1 -08-01 07:47:18 5110467 6901209206146 光明酸牛奶(红枣味)180g 盒 2 3.5 7 3.5 7

商品信息

数据集已经存放在HDFS上,路径为“/data/13/2/good/good.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.retail_good”。

各字段的定义为:

字段定义GoodId商品标识符Category1Name商品所属的大类名称Category2Name商品所属的中类名称Category3Name商品所属的小类名称Category4Name商品所属的细类名称BrandName顾客支付金额GoodName商品名称

数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/2/good/good.csv | head -51010226 烟酒纯鲜奶 烟 香烟 进口烟 爱喜 爱喜薄荷香烟20支1010281 烟酒纯鲜奶 烟 香烟 进口烟 万宝路 万宝路(软红)1010280 烟酒纯鲜奶 烟 香烟 进口烟 万宝路 万宝路(硬白)1010151 烟酒纯鲜奶 烟 香烟 杭产烟 阳光利群 利群(阳光)1010400 烟酒纯鲜奶 烟 香烟 杭产烟 利群 利群(软老板)

[实验步骤提示]

在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。

用Hive做数据分析和数据准备

a. 统计周末和工作日每小时的销售额和交易数量

统计周末和工作日每小时的销售额和交易数量,其中8月4日是周日,8月6日是周一。

hive -e \"select substr(TransTime, 1, 10) as Date, hour(TransTime) as Hour, sum(ActualTotalPrice) as Revenue, count(1) as Transactionfrom bigdata_cases.retail_sales_headwhere TransTime like '-08-04%' or TransTime like '-08-06%'group by substr(TransTime, 1, 10), hour(TransTime);" \

1.csv

得到结果的前10行:

DateHourRevenueTransaction-08-0471585.400000000000358-08-04815329.099999999984395-08-04912841.179999999991329-08-041010570.569999999996225-08-04113532.960000000001112-08-04123795.790000000000476-08-04133080.299999999999357-08-04146073.070000000002116-08-04157045.820000000001167-08-04168614.669999999995207

b. 统计一年内每天的销售额和交易数量

统计一年内每天的销售额和交易数量,为之后做数据可视化做准备。

hive -e \"select substr(TransTime, 1, 10) as Date,sum(ActualTotalPrice) as Revenue, count(1) as Transactionfrom bigdata_cases.retail_sales_headgroup by substr(TransTime, 1, 10);" \

2.csv

得到结果的前10行:

DateRevenueTransaction-08-01112761.489999999732542-08-0293819.770000000122250-08-03118865.159999999892570-08-04121632.479999999822715-08-05119397.889999999962678-08-06109638.589999999442358-08-0796941.179999999862331-08-0812885.159999999996333-08-09111507.00999999972379-08-10127085.079999999992539

c. 统计一年内每天各类商品的销售额和交易数量

统计一年内每天各类商品的销售额和交易数量。

hive -e \"select substr(a.TransTime, 1, 10) as Date, b.Category1Name as Category,sum(ActualTotalPrice) as Revenue, count(1) as Transactionfrom bigdata_cases.retail_sales_detail ainner join bigdata_cases.retail_good b on a.GoodId=b.GoodIdwhere b.Category1Name <> '#N/A'group by substr(a.TransTime, 1, 10), b.Category1Name;" \

3.csv

得到结果的前10行:

DateCategoryRevenueTransaction-08-02干货16929.539999999981070-08-02文玩1906.689999999999886-08-02蔬菜20.273-08-03烟酒8616.389999999994591-08-03粮油20934.700000000071008-08-03纸品5170.299999999996469-08-03鞋子1116.1140-08-03饼干10314.9899999999671812-08-04家电1846.240000000000233-08-04服装2480.060000000001865

d. 统计所有用户的最近消费间隔、消费频次和消费金额

统计所有用户的最近消费间隔、消费频次和消费金额。

hive -e \"select CustId, min(round((unix_timestamp('-08-01 00:00:00') - unix_timestamp(TransTime)) / 3600 / 24)) as Recency,count(1) as Frequency, sum(coalesce(ActualTotalPrice,0)) as Monetaryfrom bigdata_cases.retail_sales_headwhere CustId is not nullgroup by CustId;" \

4.csv

得到结果的前10行:

CustIdRecencyFrequencyMonetary60000030942.011232.40000000000003600000430344.0111.7600000951146.020.06000009814.09204.496000009847.0311159.7600000000002600005706218.0129.3600008753100.0441238.800000000000260000875659.0131140.686000087570.024818182.71000000000660000875834.0744434.0

用R语言做数据可视化

a. 载入相关程序包

载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。

> library(ggplot2)> library(dplyr)Attaching package: 'dplyr'The following objects are masked from 'package:stats':filter, lagThe following objects are masked from 'package:base':intersect, setdiff, setequal, union

b. 画出周末和工作日每小时的交易数量

画出周末和工作日每小时交易数量的平滑曲线,其中横坐标表示小时,纵坐标表示交易数量,灰色区域表示交易数量的95%置信区间。

> data1 <- read.table("D:/workspace/1.csv", sep = "\t", stringsAsFactors = FALSE)> names(data1) <- c("Date", "Hour", "Revenue", "Transaction")> ggplot(data1, aes(x = Hour, y = Transaction, group = Date)) + geom_smooth(aes(colour = Date), +size = 2)geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

c. 画出一年内每天的交易数量

画出一年内每天交易数量的平滑曲线,其中横坐标表示日期,纵坐标表示交易数量,灰色区域表示交易数量的95%置信区间。

> data2 <- read.table("D:/workspace/2.csv", sep = "\t", stringsAsFactors = FALSE)> names(data2) <- c("Date", "Revenue", "Transaction")> data2$Date <- as.POSIXct(data2$Date)> ggplot(data2, aes(x = Date, y = Transaction)) + geom_smooth(colour = "red", +size = 2)geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

d. 画出一年内每天各类商品的交易数量

画出粮油、干货、烟酒、日化和饼干在一年内每天交易数量的平滑曲线,其中横坐标表示日期,纵坐标表示交易数量,灰色区域表示交易数量的95%置信区间。

> data3 <- read.table("D:/workspace/3.csv", sep = "\t", fileEncoding = "UTF-8")> names(data3) <- c("Date", "Category", "Revenue", "Transaction")> data3$Date <- as.POSIXct(data3$Date)> data3 <- filter(data3, Category == "粮油" | Category == "干货" | Category == +"烟酒" | Category == "日化" | Category == "饼干")> ggplot(data3, aes(x = Date, y = Transaction, group = Category)) + geom_smooth(aes(colour = Category), +size = 1)geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

e. 将所有顾客用RFM模型聚成3类

将所有顾客按最近消费间隔(recency)、消费频次(frequency)和消费金额(monetary)用Kmeans算法聚成3类。在聚类前需要对3个字段分别取对数并归一化。

> data4 <- read.table("D:/workspace/4.csv", sep = "\t", stringsAsFactors = FALSE)> names(data4) <- c("CustId", "Recency", "Frequency", "Monetary")> data4 <- filter(data4, Recency >= 0 & Frequency <= 150 & Monetary <= 10000)> data4$CustId <- as.character(data4$CustId)> data4.scale <- scale(log1p(data4[, -1]))> clust <- kmeans(data4.scale, 3)> expm1(t(t(clust$centers) * attr(data4.scale, "scaled:scale") + attr(data4.scale, +"scaled:center")))Recency Frequency Monetary1 196.915198 2.293729 76.298072 121.647601 12.065080 657.186973 4.366358 28.047474 1485.28480

画出每个顾客的类别和类别的中心(用空心圆圈表示)。

> plot(rbind(as.data.frame(expm1(t(t(data4.scale) * attr(data4.scale, "scaled:scale") + +attr(data4.scale, "scaled:center")))), expm1(t(t(clust$centers) * attr(data4.scale, +"scaled:scale") + attr(data4.scale, "scaled:center")))), col = c(clust$cluster, +1:3), pch = c(rep(1, dim(data4)[1]), rep(10, 3)), cex = c(rep(0.1, dim(data4)[1]), +rep(3, 3)))

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。