Data Engineering Question

Problem 1 (20 points):This problem illustrates the classification approach by using decision trees using the Lupus data (you can download the data file “sledata”, sledata.txt from D2L site). The data consists of 300 patient records. Each record contains 12 elements. The first 11 elements stand for different symptoms and the final element of each record indicates the diagnosis. Build a decision tree and report:

1)The decision tree and the criteria used for building the tree for deciding the best split and the stopping condition (such as which impurity measure, how many cases for parents and children per node, etc)

2)How many nodes the final tree has and how many of them are terminal nodes;

3)What are the most important three Lupus data features in building the tree?Explain your answer.

Problem 2 (30 points):This problem illustrates the effect of the class imbalance of the accuracy of the decision trees. Download the red wine quality data from the UCI machine learning repository at:

http://archive.ics.uci.edu/ml/datasets/Wine+Qualit…

1.Report how many classes (treat each quality level as a different class) are and what is the distribution of these classes for the red wine data is.

2.Repeat Problem 1 on the red wine data.

3.Now bin the class variable in such a way that data is not so imbalanced with respect to the class variable. Repeat Problem 1 but on the wine data with a smaller number of classes (the binned class variable).

4.How the performance of the best classification model on the original class variable compares with the accuracy of the best classification model on the binned classification variable?

Problem 3 (5 points): Given the decision tree in Figure 1, show how the new examples in Table 1 would be classified by filling in the last column in the table. If an example cannot be classified, enter UNKNOWN in the last column. For each example, explain your answer by writing down the path from the root to the leaf that corresponds to that specific example.

SAMPLE ASSIGNMENT
Powered by WordPress