Regression

(2) You should have found some suspicious datapoints in the first part
of this question. Create two copies of the dataset: one with the suspicious points removed, the other with them retained. Try training
both LASSO and standard regression models on the training dataset
using the numerical covariates, optimizing the parameters using crossvalidation with the optimization dataset. In the end you should have
four models: standard and LASSO regression, with and without suspicious points. Test their accuracy on the test dataset.
(3) Which of the models in the previous question is best? Was it OK to
remove the suspicious datapoints from the training dataset? Were the
error estimates from the optimization dataset accurate?
(4) Some of the features, such as “description” and “amenities,” are text
fields or text lists. Choose a collection of words W that appear somewhere in these variables. For each word w 2 W, create a {0, 1}-valued
dummy variable that indicates the presence of a given word in a given
listing’s text field. For a word w, denote by p(w) the percentage of
listings containing that word. Denote by n the number of listings in
the training set. Ensure that your collection of words W satisfies:
(a) P
w2W p(w) n
2 (that is, the words appear in some nontrivial
fraction of all listings).
(b) |W| 12 (that is, there are many words in the collection).
(c) Very common words such as “the” or “and” should not appear in
the collection.
In order to extract words from a text field in R, you may find the the
command grepl useful.
(5) Repeat model-training for both standard and LASSO regression on
this larger collection of covariates (with outliers removed or not per
the results of the previous part of the question). How do the results
change?

DETAILED ASSIGNMENT

20210309200530homework

Powered by WordPress