Calculations on: Convexity, Gradient decent and Newton’s method and Biast and variance of random forest
Homework 2, Machine Learning, Fall 2022
*IMPORTANT* Homework Submission Instructions
1. All homeworks must be submitted in one PDF file to Gradescope.
2. Please make sure to select the corresponding HW pages on Gradescope for each question.
3. For all theory problems, please type the answer and proof using Latex or Markdown, and export the
tex or markdown into a PDF file.
4. For all coding components, complete the solutions with a Jupyter notebook/Google Colab, and export
the notebook (including both code and outputs) into a PDF file. Concatenate the theory solutions
PDF file with the coding solutions PDF file into one PDF file which you will submit.
5. Failure to adhere to the above submission format may result in penalties.
All homework assignments must be your independent work product, no collaboration is allowed.
You may check your assignment with others or the internet after you have done it, but you must
actually do it by yourself. Please copy the following statements at the top of your assignment file:
Agreement 1) This assignment represents my own work. I did not work on this assignment with others.
All coding was done by myself.
Agreement 2) I understand that if I struggle with this assignment that I will reevaluate whether this is
the correct class for me to take. I understand that the homework only gets harder.
1
to
1 Convexity (Chudi)
Convexity is very important for optimization. In this question, we will explore the convexity of some
functions.
1.1
Let h(x) = f(x) + g(x) where f(x) and g(x) are both convex functions. Show that h(x) is also convex.
1.2
In mathematics, a concave function is the negative of a convex function. Let g1(x), g2(x), …., gk(x) be concave
functions and gi(x) ! 0. Show that log(
Qk
i=1 gi(x)) is concave. (Note the base of log function is e.)
Hint 1: the sum of two concave function is concave.
Hint 2: log(a ⇥ b) = log(a) + log(b).
Hint 3: log function is concave and monotonically increasing in R+.
1.3
A line, represented by y = ax + b, is both convex and concave. Draw four lines with di↵erent values of a
and b and highlight their maximum and minimum in two di↵erent colors. What does this reveal about the
convexity of the maximum and minimum of these lines respectively?
1.4
Now we consider the general case. Let f(x) and g(x) be two convex functions. Show that h(x) =
max(f(x), g(x)) is also convex or give a counterexample.
1.5
Let f(x) and g(x) be two convex functions. Show that h(x) = min(f(x), g(x)) is also convex or give a
counterexample.
1.6
Since convex optimization is much more desirable then non-convex optimization, many commonly used risk
functions in machine learning are convex. In the following questions, we will explore some of these empirical
risk functions.
Suppose we have a dataset {X, y} = {xi, yi}ni
=1 that we will perform classification on, where xi 2 Rp is a
p-dimensional feature vector, and yi 2 {−1, 1} is the binary label. Please identify and show by proof whether
the following empirical risk functions are convex with respect to weight parameter vector !. In the model
below, ! = [!1,!2, …,!p]T 2 Rp, b 2 R are the weight and bias. We would like to show convexity with
respect to ! and b.
(a)
L(!, b) =
Xn
i=1
log(1 + e−yi(!T xi+b))
(b)
L(!, b,C) =
1
2||!||22
+ C
Xn
i=1
max(0, 1 − yi(!Txi + b)), C! 0
Page 2
1.7
Consider the lasso regression problem. Given a dataset {X, y} = {xi, yi}ni
=1 where yi 2 R, please identify and
show by proof whether the following empirical risk function is convex with respect to the weight parameter
vector !.
L(!, b,C) = ky − (X! + b1n)k22
+ Ck!k1, C! 0
2 Gradient Descent and Newton’s Method (Chudi)
Optimization algorithms are important in machine learning. In this problem, we will explore ideas behind
gradient descent and Newton’s method.
Suppose we want to minimize a convex function given by
f(x) = 2×21
+ 3×22
− 4×1 − 6×2 + 2x1x2 + 13.
2.1
(a) Find the expression for rf(x), the first derivative of f(x).
(b) Find the expression for Hf(x), the Hessian of f(x).
(c) Compute analytically the value of x⇤, at which the optimum is achieved for f(x). Hint: use the first
two parts.
2.2
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a di↵erentiable
function. Since f(x) is strictly convex and di↵erentiable, gradient descent can be used to find the
unique optimal solution.
Recall that gradient descent takes repeated steps in the opposite direction of the gradient of the function
at the current point until convergence. The algorithm is shown below. Given an initial point x0 and a step
size ↵,
• t = 0
• while xt is not a minimum
– xt+1 = xt − ↵rf(xt)
– t = t + 1
• end
For each iteration, xt+1 = xt −↵rf(xt). Suppose ↵ = 0.1. Show that xt+1 converges to x⇤ obtained in
2.1 as t!1.
Hint 1: Let A be a square matrix. I + A + A2 + . ..Ak = (I − A)−1(I − Ak+1).
Hint 2: Let A be a square matrix. Ak converges to zero matrix if all eigenvalues of matrix A have absolute
value smaller than 1.
2.3
Will xt+1 converge to x⇤ when ↵ grows larger? Hint: the previous question’s calculations should provide
you the answer.
Page 3
2.4
Gradient descent only uses the first derivative to make updates. If a function is twice di↵erentiable, is the
second derivative able to help find optimal solutions? Newton’s method takes the second derivative into
consideration and the key update step is modified as xt+1 = xt − (Hf(xt))−1rf(xt). The algorithm is
shown below.
• t = 0
• while xt is not a minimum
– xt+1 = xt − (Hf(xt))−1rf(xt)
– t = t + 1
• end
(a) Using the initial point x0 = [1, 2]T , give the value of x1 that Newton’s method would find in one
iteration. Does this method converge in one iteration? If not, give the value of x2 that Newton’s
method would find and report whether x2 is equal to x⇤.
(b) Using the initial point x0 = [1, 2]T and a fixed step size ↵ = 0.1, give the value of x1 that gradient
descent would find in one iteration. Does this method converge in one iteration? If not, give the value
of x2 that gradient descent would find and report whether x2 is equal to x⇤.
(c) According to the results above, which method converges faster? Please briefly explain why it happens.
3 Bias and Variance of Random Forest
In this question, we will analyze the bias and variance of the random forest algorithm. Bias is the amount
that a model’s prediction di↵ers from the target value in the training data. An underfitted model can lead
to high bias. Variance indicates how much the estimate of the target function will alter if di↵erent training
data were used. An overfitted model can lead to high variance.
3.1
Let {X, y} = {(xi, yi)}ni
=1 be the training dataset, where xi is a feature vector and yi 2 {−1, 1} is the binary
label. Suppose all overlapping samples are consistently labeled, i.e. 8i 6= j, yi = yj if xi = xj . Show that
100% training accuracy tree can be achieved if there is no depth limit on the trees in the forest. Hint: the
proof is short. One way to proof it is by induction.
3.2
Let D1,D2, …,DT be random variables that are identically distributed with positive pairwise correlation ⇢
and V ar(Di) = $2, 8i 2 {1, …,T}. Let ¯D be the average of D1, …,DT , i.e. ¯D = 1
T (
PT
i=1 Di). Show that
V ar(¯D ) =
1 − ⇢
T
$2 + ⇢$2. (1)
3.3
In the random forest algorithm, we can view the decision tree grown in each iteration as approximately Di
in Problem 3.2. Given this approximation, briefly explain how we expect the variance of the average of the
decision trees to change as T increases.
Page 4
3.4
As T becomes large, the first term in Eq(1) goes away. But the second term does not. What parameter(s)
in random forest control ⇢ in that second term?
4 Coding for SVM (Henry)
4.1
Generate a training set of size 100 (50 samples for each label) with 2D features (X) drawn at random as
follows:
• Xneg ⇠ N([-5, -5], 5·I2) and correspond to negative labels (-1)
• Xpos ⇠ N([5, 5], 5·I2) and correspond to positive labels (+1)
Accordingly, X = [Xpos
Xneg ] is a 100 ⇥ 2 array, Y is a 100 ⇥ 1 array of values 2 {−1, 1}. Draw a scatter plot of
the full training dataset with the points colored according to their labels.
4.2
Train a linear support vector machine on the data with C = 1 and draw the decision boundary line that
separates positives and negatives. Mark the support vectors separately (e.g., circle around the point).
4.3
Draw decision boundaries for 9 di↵erent C values across a wide range of values between 0 and infinity. Plot
the number of support vectors vs. C (plot on a log scale if that’s useful). How does the number of support
vectors change as C increases and why does it change like that?
4.4
Now, try to rescale the data to the [0,1] range and repeat the steps of the previous question (4.3) and over
the same range of C values. Are the decision boundaries di↵erent from those in the previous question? What
does this imply about (a) the geometric margin and (b) the relative e↵ect of each feature on the predictions
of the trained model?
4.5
Try using boosted decision trees (using any boosted decision tree algorithm you can find) with and without
changing the scaling of the dataset and plot both decision boundaries on one plot. Do the plots di↵er? Is
that what you expected? (Make sure you use the same random seed for both runs of your algorithm.)
5 Coding for Logistic Regression (Henry)
Download the Movie Review data file from Sakai named moviereview.tsv. The data file is in tab-separatedvalue
(.tsv) format. The data is from the Movie Review Polarity data set (for more details, see http:
//www.cs.cornell.edu/people/pabo/movie-review-data/). Currently, the original data was distributed
as a collection of separate files (one movie review per file). The data we are using converted this to have
one line per example, with the label 0 or 1 in the first column (the review is negative or positive) followed
by all of the words in the movie review (with none of the line breaks) in the second column. We provide
a dictionary file to limit the vocabulary to be considered in this assignment. The dictionary.txt uses the
format: [word, index], which represents the kth word (so it has the index k) in the dictionary.
Page 5
5.1
In this section, your goal is to derive the update formula for the logistic regression when optimizing with
gradient descent. Gradient descent is as follows:
To optimize J(✓, x, y)
choose learning rate ⌘
t as current time, T as max time
✓0 ! initial value
while t < T: do
Update: ✓j ! ✓j − ⌘ · @J(✓,x,y)
@✓j
end while
Assume you are given a data set with n training examples and p features. We first want to find the
negative conditional log-likelihood of the training data in terms of the design matrix X, the labels y, and the
parameter vector ✓. This will be your objective function J(✓) for gradient descent. Here we assume that each
feature vector xi,· contains a bias feature, that is, xi,1 = 1 8i 2 1, …,n. Thus, xi,· is a (p + 1)-dimensional
vector where xi,1 =1. As such, the bias parameter is folded into our parameter vector ✓.
(a) Derive the negative log-likelihood of p(y|X, ✓). (Hint: look at the course notes.)
(b) Then, derive the partial derivative of the negative log-likelihood J(✓) with respect to ✓j , j 2 1, …,p + 1.
(c) The update rule is ✓j ! ✓j − ⌘ · @J(✓)
@✓j
. Based on the update rule, write the gradient descent update
for parameter element ✓j .
We will implement this momentarily.
5.2
In this section, you are going to do some preprocessing. You are going to create something similar to a
bag-of-words feature representation, where the feature vector xi,v for movie review i and vocabulary word v
in the dictionary is set to 1 if movie review i contains at least one occurrence of vocabulary v. An example
of the output from your pre-processing for one movie review might be as follows: 20:1 22:1 30:1 35:1 45:1
……., which expresses that vocabulary words 20, 22, 30, 35, 45, etc. are in the movie review. The feature
vector ignores words not in the vocabulary of dict.txt. You can use csv and math packages in python.
5.3
In this section, implement a sentiment polarity analyzer using binary logistic regression. Specifically, you
will learn the parameters of a binary logistic regression model that predicts a sentiment polarity (i.e., the
label) for the corresponding feature vector of each movie review using gradient descent, as follows:
• Initialize all model parameters to 0.
• Use gradient descent to optimize the parameters for a binary logistic regression model. The number
of times gradient descent loops through all of the training data (number of epochs). Set your learning
rate as a constant ⌘=0.1.
• Perform gradient descent updates on the training data for 30 epochs.
Do not hard-code any aspects of the data sets into your code. You should only use the packages that are
provided. You may use sklearn for creating the train/test split. You can use csv and math packages in python.
Page 6
Let us provide a note about efficient computation of dot-products. In linear models like logistic regression,
the computation is often dominated by the dot-product ✓T x of the parameters ✓ 2 Rp with the feature vector
x 2 Rp. When a standard representation of x is used, the dot-product requires O(p) computation, because it
requires a sum over all entries in the vector: ✓T x =
Pp
j=1 ✓jx·,j However, if our feature vector is represented
sparsely, we can observe that the only elements of the feature vector that will contribute a non-zero value to
the sum are those where x·,j 6= 0, since this would allow ✓jx·,j to be nonzero. This requires only computation
proportional to the number of non-zero entries in x, which is generally very small for model compared to
the size of the vocabulary.
6 Boosted Decision Trees and Random Forest (Henry)
In this assignment, you are going to fit and tune random forest and boosted decision trees to predict whether
a given passenger survived the sinking of the Titanic based on the variables provided in the data set. You
may use sklearn and gridsearch.
6.1
First, download the Titanic.csv file from Sakai, and drop the columns that have “na” values from the data
set. Convert the gender variable into a binary variable with 0 representing male, and 1 representing female.
Then split 80% of the data into train and 20% of the data into test set.
6.2
Fit a random forest and boosted decision tree model on the training set with default parameters, and
estimate the time it required to fit each of the models. Which model required less time to train? Why do
you think that is? (Hint: look at the default values of the parameters.)
6.3
Choose a range of parameter values for each of the algorithms (tell us what parameters you played with),
and tune on the training set over 5 folds. Then, draw the ROC and provide the AUC for each of the
tuned models on the whole training set (in one figure) and test set (in another figure). Comment on the
similarities/di↵erences between the two models (if any).
7 Generalized Additive Models
In this question, you are going to fit a generalized additive model (GAM) to predict the species of penguins on
a given dataset. You can download the dataset penguins trunc.csv from Sakai. The dataset is preprocessed
and you can use it directly.
Split 80% of the data into a training set and 20% of the data into a test set. Please use one of the
following methods to fit a GAM model: (1) logistic regression with `1 penalty, (2) Explainable Boosting
Machine, (3) FastSparse, (4) pyGAM. Report training and test accuracy and plot the shape function for
each feature on a separate plot.
Hint 1: For logistic regression with `1 penalty, you can use sklearn packages. You need to first transform
each continuous feature into a set of binary features by splitting on the mid-point of every two consecutive
values realized in the dataset, i.e., CulmenLength 32.6, CulmenLength 33.3. You can then fit regularized
logistic regression to these indicator variables and construct the component function for variable j by
weighting the indicator variables that pertain to feature j by their learned coefficients and adding them up.
You can use sklearn LogisticRegression. Please remember to set penalty to “`1” and algorithm to “liblinear”.
Hint 2: Explainable Boosting Machine is a tree-based, cyclic gradient boosting generalized additive modeling
package. It can be pip installed for Python 3.6+ Linux, Mac, Windows. More details about the algorithm and
Page 7
usage are available in https://github.com/interpretml/interpret and https://interpret.ml/docs/
ebm.html.
Hint 3: FastSparse implements fast classification techniques for sparse generalized linear and additive models
in R. To install this package, please follow the instruction here https://github.com/jiachangliu/
fastSparse/tree/main/installation. Example code about how to run FastSparse is available https:
//github.com/jiachangliu/fastSparse/tree/main/application_and_usage and example code to visualize
shape functions is https://github.com/jiachangliu/fastSparse/tree/main/visualization.
Hint 4: You may also use pygam package to fit a GAM model which is available here https://github.com/
dswah/pyGAM.
Page