国科大——数据挖掘(0812课程)——课后作业

news/2025/2/26 6:09:11

前沿: 此文章记录了2024年度秋季学期数据挖掘课程的三次课后作业,答案仅供参考

第一次作业

1

假定数据仓库中包含4个维:date, product, vendor, location;和两个度量:sales_volume和sales_cost。

1)画出该数据仓库的星形模式图 。
在这里插入图片描述
2)由基本方体[date, product, vendor, location]开始,列出每年在Los Angles的每个vendor的sales_volume。

roll up on product from basic(key) to all
roll up on location from basic(key) to city
roll up on date from basic(key) to year
slice for location = ‘Los Angles’

3)对于数据仓库,位图索引是有用的。以该立方体为例,简略讨论使用位图索引结构的优点和问题。
在这里插入图片描述

2

Design a data warehouse for a regional weather bureau. The weather bureau has about 1000 probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour. All data are sent to the central station, which has collected such data for over 10 years. Your design should facilitate efficient querying and online analytical processing, and derive general weather patterns in multidimensional space. (note: please present the schema, the fact table(s) and the dimension tables with concept hierarchy)

在这里插入图片描述

3

下面是一个超市商品A连续20个月的销售数据(单位为百元)
A:21, 16, 19, 24, 27, 23, 22, 21, 20, 17, 16, 20, 23, 22, 18, 24, 26, 25, 20, 26。
B:38, 24, 38, 45, 46, 44, 42, 34, 40, 30, 31, 40, 40, 32, 36, 42, 50, 47, 46, 50。

1)Calculate the mean, median, and standard deviation of the sales data.

21.5; 21.5;3.22

2)Draw the boxplot.

Min = 16, Q1 = 19, median = 21.5, Q3 = 24, Max = 27.
在这里插入图片描述
3) Normalize the values based on min-max normalization.
在这里插入图片描述

4)假设商品B连续20个月的销售数据(单位为百元)如下:38, 24, 38, 45, 46, 44, 42, 34, 40, 30, 31, 40, 40, 32, 36, 42, 50, 47, 46, 50。
Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these products positively or negatively correlated?

相关计算相关系数结果为0.831,表明A和B产品为正相关。

5)Draw the scatter plot for the sales data of the two products.

在这里插入图片描述

4

下面是一个超市商品A连续20个月的销售数据(单位为百元)。21,16,19,24,27,23,22,21,20,17,16,20,23,22,18,24,26, 25,20,26。对以上数据进行噪声平滑,使用深度为5的Equal-depth binning方法。

答案:

首先对20个数据进行排序,排序后的结果如下:16, 16, 17, 18, 19, 20, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 26, 26, 27,使用深度为5的Equal-depth binning方法,则分箱结果如下:
Bin1: 16, 16, 17, 18, 19;
Bin2: 20, 20, 20, 21, 21;
Bin3: 22, 22, 23, 23, 24;
Bin4: 24, 25, 26, 26, 27;

1)采用bin median方法进行平滑;

Bin1的median为17,则平滑后结果为Bin1: 17, 17, 17, 17, 17;
Bin2的median为20,则平滑后结果为Bin2: 20, 20, 20, 20, 20;
Bin3的median为23,则平滑后结果为Bin3: 23, 23, 23, 23, 23;
Bin4的median为26,则平滑后结果为Bin4: 26, 26, 26, 26, 26;

2) 采用bin boundaries方法进行平滑。

平滑后结果为:
Bin1: 16, 16, 16, 19, 19;
Bin2: 20, 20, 20, 21, 21;
Bin3: 22, 22, 22, 22, 24 或者 22, 22, 24, 24, 24;
Bin4: 24, 24, 27, 27, 27;

第二次作业

1

Given a data set below for attributes {Height, Hair, Eye} and two classes {C1, C2}.
在这里插入图片描述
1)Compute the Information Gain for Height, Hair and Eye.
在这里插入图片描述
在这里插入图片描述
2)Construct a decision tree with Information Gain.

在这里插入图片描述
在这里插入图片描述

2

Classify the unknown sample Z based on the training data set in Q1:

Z = (Height = Short, Hair = blond, Eye = brown). What would a naïve Bayesian classifier classify Z?

在这里插入图片描述

3

注: 此题的答案应该有点问题。
1)Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers.
在这里插入图片描述
2)Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(Tall, Red, Brown)". Indicate your initial weight values and biases and the learning rate used.
在这里插入图片描述
在这里插入图片描述

4

Consider the data set shown in Table 1(min_sup = 60%, min_conf=70%).
在这里插入图片描述

1)Find all frequent itemsets using Apriori by treating each transaction ID as a market basket.
在这里插入图片描述
2)Use the results in part (a) to compute the confidence for the association rules {a, b}->{c} and {c}->{a, b}. Is confidence a symmetric measure?
在这里插入图片描述
3)List all of the strong association rules (with support s and confidence c) matching the following metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g. “A”, “B”, etc.)
在这里插入图片描述
在这里插入图片描述

5

Assume a supermarket would like to promote pasta. Use the data in “transactions” as training data to build a decision tree (C5.0 algorithm) model to predict whether the customer would buy pasta or not.

Build a decision tree using data set “transactions” that predicts pasta as a function of the other fields. Set the “type” of each field to “Flag”, set the “direction” of “pasta” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree.

在这里插入图片描述

6

Use the model (the full tree generated by Clementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy pasta.
1)Hand-in: your prediction for each of the 20 customers. (10 points)

在这里插入图片描述
2)Hand-in: rules for positive (yes) prediction of pasta purchase identified from the decision tree (up to the fifth level. The root is considered as level 1). (10 points)
在这里插入图片描述

第三次作业

1

Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:

A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2),C2(1,4,6), C3(9,1,7), C4(5,6,7)

The distance function is Euclidean distance. Suppose initially we assign A2,B2,C2 as the center of each cluster, respectively. Use the K-Means algorithm to show only.

1)The three cluster’s centers after the first round execution

在这里插入图片描述
2)The final three clusters
在这里插入图片描述

2

Table 2 gives a User-Product rating matrix.
在这里插入图片描述
1)List the top 3 most similar users of user 2 based on Cosine Similarity
在这里插入图片描述
2)Predict User 2’s rating for Product 2
在这里插入图片描述

3

The goal of this assignment is to introduce churn management using decision trees, logistic regression and neural network. You will try different combinations of the parameters to see their impacts on the accuracy of your models for this specific data set. This data set contains summarized data records for each customer for a phone company. Our goal is to build a model so that this company can predict potential churners.

Two data sets are available, churn_training.txt and churn_validation.txt. Each data set has 21 variables. They are:

State:

Account_length: how long this person has been in this plan

Area_code:

Phone_number:

International_plan: this person has international plan=1, otherwise=0

Voice_mail_plan: this person has voice mail plan=1, otherwise=0

Number_vmail_messages: number of voice mails

Total_day_minutes:

Total_day_calls:

Total_day_charge:

Total_eve_minutes:

Total_eve_calls:

Total_eve_charge:

Total_night_minutes:

Total_night_calls:

Total_night_charge:

Total_intl_minutes:

Total_intl_calls:

Total_intl_charge:

Number_customer_service_calls:

Class: churn=1, did not churn=0

Each row in “churn_training” represents the customer record. The training data contains 2000 rows and the validation data contains 1033 records.

1)Perform decision tree classification on training data set. Select all the input variables except state, area_code, and phone_number (since they are only informative for this analysis). Set the “Direction” of class as “out”, “type” as “Flag”. Then, specify the “minimum records per child branch” as 40, “pruning severity” as 70, click “use global pruning”. Hand-in the confusion matrices for validation data.

通过在clementine软件上,使用Decision Tree算法,并按照上述要求所计算的混淆矩阵如下图所示。
在这里插入图片描述
2)Perform neural network on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data.

通过在clementine软件上,使用neural network算法,并按照上述要求所计算的混淆矩阵如下图所示。
在这里插入图片描述

3)Perform logistic regression on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data.

通过在clementine软件上,使用logistic regression算法,并按照上述要求所计算的混淆矩阵如下图所示。
在这里插入图片描述

4) Hand-in your observations on the model quality for decision tree, neural network and logistic regression using the confusion matrices.
在这里插入图片描述

4

Learn the use of market basket analysis for the purpose of making product purchase recommendations to the customers.

The data set contains transactions from a large supermarket. Each transaction is made by someone holding the loyalty card. We limited the total number of categories in this supermarket data to 20 categories for simplicity. The field value for a certain product in the transaction basket is 1 if the customer has bought it and 0 if he/she has not. The file named “Transactions” has data for 46243 transactions.

The data are available from the class web page.

Your submission should consist only of those deliverables marked indicated by “Hand-in”.

Market basket analysis has the objective to discover individual products, or groups of products that tend to occur together in transactions. The knowledge obtained from a market basket analysis can be employed by a business to recognize products frequently sold together in order to determine recommendations and cross-sell and up-sell opportunities. It can also be used to improve the efficiency of a promotional campaign.

Run Apriori on “transaction” data set. Set the “Type” of “COD” as “Typeless”, set the “direction” of all the other 20 categories as “Both”, set their “Type” as “Flag”. Set “Minimum antecedent support” to be 7%, “Minimum confidence” to be 45%, and “Maximum number of antecedents” to be 4 in the modeling node (Apriori node). In general you should explore by trying different values of these parameters to see what type of rules you get.

· Hand-in: The list of association rules generated by the model.
在这里插入图片描述

Sort the rules by lift, support, and confidence, respectively to see the rules identified. Hand-in: For each case, choose top 5 rules (note: make sure no redundant rules in the 5 rules) and give 2-3 lines comments. Many of the rules will be logically redundant and therefore will have to be eliminated after you think carefully about them.

通过在clementine软件上,分别将lift、support和confidence作为排序字段,所获取的关联规则如上图所示,所选出的top5 rules如下表所示。
在这里插入图片描述
1)lift:结合上图中的图(a),我们先选取出top5的规则,如下所述:

  a) tomato source→pasta:买番茄酱的人会买意大利面,相对比较合理;
  b) coffee、milk→pasta:买咖啡和牛奶的人会买意大利面,不是很合理,排除此规则;
  c) biscuits、pasta→milk:买饼干和意大利面的会买牛奶,相对合理;
  d) pasta、water→milk:买意大利面和水的人会买牛奶,相对合理;
  e) juices→milk:买果汁的人会买牛奶,相对合理;

由于规则b)不是很合理,因此删除规则b),新增一下一个规则:

	f) yoghurt→milk:买酸奶的人会买牛奶,相对合理;

因此,所选出的top5规则如上表中的第二列所示。

2)support:结合上图中的图(b),我们先选取出top5的规则:pasta→milk,water→milk,biscuits→milk,brioches→milk以及yoghurt→milk,这五条规则都相对比较合理,比如买水或者酸奶等饮品的购物者常常会一起买上牛奶,因此选取此五条规则,如上表中的第三列所示。

3)confidence:结合上图中的图©,我们先选取出top5的规则:biscuits、pasta→milk,water、pasta→milk,juices→milk,tomato source→pasta以及yoghurt→milk,也相对来说比较符合常识,比如买番茄酱的购物者很可能意大利面,因此选取此五条规则,如上表中的第四列所示。


http://www.niftyadmin.cn/n/5868148.html

相关文章

金属热变形分析创新案例:红外相机与DIC技术耦合应用的深度研究与应用

一、方案背景 在航空航天、汽车制造、能源装备等领域,金属材料需要在高温和复杂应力条件下工作,热变形分析是确保材料可靠性和安全性的重要手段。 金属材料塑性变形阶段,大部分塑性功转化为热能,导致变形过程中温升,…

FFmpeg使用H.266/H.264/H.265编解码视频教程

以下是使用 FFmpeg 压缩视频的完整操作步骤,涵盖常用场景和参数优化: 1. 安装 FFmpeg 确保已安装最新版 FFmpeg(若已按此前步骤编译支持 H.266,可直接使用): bash 复制 sudo apt install ffmpeg # Ubuntu/…

【初阶数据结构】星河中的光影 “排” 象:排序(下)

文章目录 4.交换排序4.1 冒泡排序(BubbleSort)4.2 快速排序(QuickSort)4.2.1 hoare版本4.2.2 挖坑法4.2.3 前后指针法4.2.4 非递归实现 5.归并排序(MergeSort)5.1 递归实现5.2 非递归实现5.2.1 一次性全部拷…

算法与数据结构(格雷编码)

题目 思路 首先我们先看一下格雷编码的一些情况,为了一会方便理解,我们看它的二进制情况。 当n1时,输出[0,1] 当n2时,输出[00,01,11,10] 当n3时,输出[000, 001, 011, 010, 110, 111, 101, 100] 我们可…

栅格地图路径规划:基于雪橇犬优化算法(Sled Dog Optimizer,SDO)的移动机器人路径规划(提供MATLAB代码)

一、雪橇犬优化算法 雪橇犬优化算法(Sled Dog Optimizer,SDO)是一种于2024年10月发表在JCR1区、中科院1区SCI期刊《Advanced Engineering Informatics》的仿生元启发式算法。它受雪橇犬行为模式启发,通过模拟狗拉雪橇、训练和退役…

2025考研国家线首次全面下降,涵盖与24年对比分析!

2025年研考国家线发布,“调剂意向采集系统”将于3月28日开通;“调剂服务系统”将于4月8日开通。 “中国研究生招生信息网”中“调剂意向采集系统”将于3月28日开通,已完成一志愿录取的招生单位可发布调剂信息,有调剂意愿的考生可查…

HarmonyOS 无线调试

下载sdk 找到hdc位置> C:\Users\27638\AppData\Local\OpenHarmony\Sdk\14\toolchains 不要去DevEco Studio的窗口不知道为什么调不动 hdc tconn IP:PORT

基于FastGPT搭建本地DeepSeek R1服务+AI专属知识库

在这个快速发展的AI时代,如何高效搭建本地智能系统成为了许多开发者和企业关注的焦点。为了帮助轻松搭建一个强大的本地AI服务详细介绍如何通过一键部署 LM Studio + DeepSeek R1,并搭建 AI 专属知识库。 通过这个过程可以实现对自定义数据的快速处理、无缝集成以及数据安全…