## Calculations on: Convexity, Gradient decent and Newton’s method and Biast and variance of random forest

Homework 2, Machine Learning, Fall 2022

*IMPORTANT* Homework Submission Instructions

1. All homeworks must be submitted in one PDF file to Gradescope.

2. Please make sure to select the corresponding HW pages on Gradescope for each question.

3. For all theory problems, please type the answer and proof using Latex or Markdown, and export the

tex or markdown into a PDF file.

4. For all coding components, complete the solutions with a Jupyter notebook/Google Colab, and export

the notebook (including both code and outputs) into a PDF file. Concatenate the theory solutions

PDF file with the coding solutions PDF file into one PDF file which you will submit.

5. Failure to adhere to the above submission format may result in penalties.

All homework assignments must be your independent work product, no collaboration is allowed.

You may check your assignment with others or the internet after you have done it, but you must

actually do it by yourself. Please copy the following statements at the top of your assignment file:

Agreement 1) This assignment represents my own work. I did not work on this assignment with others.

All coding was done by myself.

Agreement 2) I understand that if I struggle with this assignment that I will reevaluate whether this is

the correct class for me to take. I understand that the homework only gets harder.

1

to

1 Convexity (Chudi)

Convexity is very important for optimization. In this question, we will explore the convexity of some

functions.

1.1

Let h(x) = f(x) + g(x) where f(x) and g(x) are both convex functions. Show that h(x) is also convex.

1.2

In mathematics, a concave function is the negative of a convex function. Let g1(x), g2(x), …., gk(x) be concave

functions and gi(x) ! 0. Show that log(

Qk

i=1 gi(x)) is concave. (Note the base of log function is e.)

Hint 1: the sum of two concave function is concave.

Hint 2: log(a ⇥ b) = log(a) + log(b).

Hint 3: log function is concave and monotonically increasing in R+.

1.3

A line, represented by y = ax + b, is both convex and concave. Draw four lines with di↵erent values of a

and b and highlight their maximum and minimum in two di↵erent colors. What does this reveal about the

convexity of the maximum and minimum of these lines respectively?

1.4

Now we consider the general case. Let f(x) and g(x) be two convex functions. Show that h(x) =

max(f(x), g(x)) is also convex or give a counterexample.

1.5

Let f(x) and g(x) be two convex functions. Show that h(x) = min(f(x), g(x)) is also convex or give a

counterexample.

1.6

Since convex optimization is much more desirable then non-convex optimization, many commonly used risk

functions in machine learning are convex. In the following questions, we will explore some of these empirical

risk functions.

Suppose we have a dataset {X, y} = {xi, yi}ni

=1 that we will perform classification on, where xi 2 Rp is a

p-dimensional feature vector, and yi 2 {−1, 1} is the binary label. Please identify and show by proof whether

the following empirical risk functions are convex with respect to weight parameter vector !. In the model

below, ! = [!1,!2, …,!p]T 2 Rp, b 2 R are the weight and bias. We would like to show convexity with

respect to ! and b.

(a)

L(!, b) =

Xn

i=1

log(1 + e−yi(!T xi+b))

(b)

L(!, b,C) =

1

2||!||22

+ C

Xn

i=1

max(0, 1 − yi(!Txi + b)), C! 0

Page 2

1.7

Consider the lasso regression problem. Given a dataset {X, y} = {xi, yi}ni

=1 where yi 2 R, please identify and

show by proof whether the following empirical risk function is convex with respect to the weight parameter

vector !.

L(!, b,C) = ky − (X! + b1n)k22

+ Ck!k1, C! 0

2 Gradient Descent and Newton’s Method (Chudi)

Optimization algorithms are important in machine learning. In this problem, we will explore ideas behind

gradient descent and Newton’s method.

Suppose we want to minimize a convex function given by

f(x) = 2×21

+ 3×22

− 4×1 − 6×2 + 2x1x2 + 13.

2.1

(a) Find the expression for rf(x), the first derivative of f(x).

(b) Find the expression for Hf(x), the Hessian of f(x).

(c) Compute analytically the value of x⇤, at which the optimum is achieved for f(x). Hint: use the first

two parts.

2.2

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a di↵erentiable

function. Since f(x) is strictly convex and di↵erentiable, gradient descent can be used to find the

unique optimal solution.

Recall that gradient descent takes repeated steps in the opposite direction of the gradient of the function

at the current point until convergence. The algorithm is shown below. Given an initial point x0 and a step

size ↵,

• t = 0

• while xt is not a minimum

– xt+1 = xt − ↵rf(xt)

– t = t + 1

• end

For each iteration, xt+1 = xt −↵rf(xt). Suppose ↵ = 0.1. Show that xt+1 converges to x⇤ obtained in

2.1 as t!1.

Hint 1: Let A be a square matrix. I + A + A2 + . ..Ak = (I − A)−1(I − Ak+1).

Hint 2: Let A be a square matrix. Ak converges to zero matrix if all eigenvalues of matrix A have absolute

value smaller than 1.

2.3

Will xt+1 converge to x⇤ when ↵ grows larger? Hint: the previous question’s calculations should provide

you the answer.

Page 3

2.4

Gradient descent only uses the first derivative to make updates. If a function is twice di↵erentiable, is the

second derivative able to help find optimal solutions? Newton’s method takes the second derivative into

consideration and the key update step is modified as xt+1 = xt − (Hf(xt))−1rf(xt). The algorithm is

shown below.

• t = 0

• while xt is not a minimum

– xt+1 = xt − (Hf(xt))−1rf(xt)

– t = t + 1

• end

(a) Using the initial point x0 = [1, 2]T , give the value of x1 that Newton’s method would find in one

iteration. Does this method converge in one iteration? If not, give the value of x2 that Newton’s

method would find and report whether x2 is equal to x⇤.

(b) Using the initial point x0 = [1, 2]T and a fixed step size ↵ = 0.1, give the value of x1 that gradient

descent would find in one iteration. Does this method converge in one iteration? If not, give the value

of x2 that gradient descent would find and report whether x2 is equal to x⇤.

(c) According to the results above, which method converges faster? Please briefly explain why it happens.

3 Bias and Variance of Random Forest

In this question, we will analyze the bias and variance of the random forest algorithm. Bias is the amount

that a model’s prediction di↵ers from the target value in the training data. An underfitted model can lead

to high bias. Variance indicates how much the estimate of the target function will alter if di↵erent training

data were used. An overfitted model can lead to high variance.

3.1

Let {X, y} = {(xi, yi)}ni

=1 be the training dataset, where xi is a feature vector and yi 2 {−1, 1} is the binary

label. Suppose all overlapping samples are consistently labeled, i.e. 8i 6= j, yi = yj if xi = xj . Show that

100% training accuracy tree can be achieved if there is no depth limit on the trees in the forest. Hint: the

proof is short. One way to proof it is by induction.

3.2

Let D1,D2, …,DT be random variables that are identically distributed with positive pairwise correlation ⇢

and V ar(Di) = $2, 8i 2 {1, …,T}. Let ¯D be the average of D1, …,DT , i.e. ¯D = 1

T (

PT

i=1 Di). Show that

V ar(¯D ) =

1 − ⇢

T

$2 + ⇢$2. (1)

3.3

In the random forest algorithm, we can view the decision tree grown in each iteration as approximately Di

in Problem 3.2. Given this approximation, briefly explain how we expect the variance of the average of the

decision trees to change as T increases.

Page 4

3.4

As T becomes large, the first term in Eq(1) goes away. But the second term does not. What parameter(s)

in random forest control ⇢ in that second term?

4 Coding for SVM (Henry)

4.1

Generate a training set of size 100 (50 samples for each label) with 2D features (X) drawn at random as

follows:

• Xneg ⇠ N([-5, -5], 5·I2) and correspond to negative labels (-1)

• Xpos ⇠ N([5, 5], 5·I2) and correspond to positive labels (+1)

Accordingly, X = [Xpos

Xneg ] is a 100 ⇥ 2 array, Y is a 100 ⇥ 1 array of values 2 {−1, 1}. Draw a scatter plot of

the full training dataset with the points colored according to their labels.

4.2

Train a linear support vector machine on the data with C = 1 and draw the decision boundary line that

separates positives and negatives. Mark the support vectors separately (e.g., circle around the point).

4.3

Draw decision boundaries for 9 di↵erent C values across a wide range of values between 0 and infinity. Plot

the number of support vectors vs. C (plot on a log scale if that’s useful). How does the number of support

vectors change as C increases and why does it change like that?

4.4

Now, try to rescale the data to the [0,1] range and repeat the steps of the previous question (4.3) and over

the same range of C values. Are the decision boundaries di↵erent from those in the previous question? What

does this imply about (a) the geometric margin and (b) the relative e↵ect of each feature on the predictions

of the trained model?

4.5

Try using boosted decision trees (using any boosted decision tree algorithm you can find) with and without

changing the scaling of the dataset and plot both decision boundaries on one plot. Do the plots di↵er? Is

that what you expected? (Make sure you use the same random seed for both runs of your algorithm.)

5 Coding for Logistic Regression (Henry)

Download the Movie Review data file from Sakai named moviereview.tsv. The data file is in tab-separatedvalue

(.tsv) format. The data is from the Movie Review Polarity data set (for more details, see http:

//www.cs.cornell.edu/people/pabo/movie-review-data/). Currently, the original data was distributed

as a collection of separate files (one movie review per file). The data we are using converted this to have

one line per example, with the label 0 or 1 in the first column (the review is negative or positive) followed

by all of the words in the movie review (with none of the line breaks) in the second column. We provide

a dictionary file to limit the vocabulary to be considered in this assignment. The dictionary.txt uses the

format: [word, index], which represents the kth word (so it has the index k) in the dictionary.

Page 5

5.1

In this section, your goal is to derive the update formula for the logistic regression when optimizing with

gradient descent. Gradient descent is as follows:

To optimize J(✓, x, y)

choose learning rate ⌘

t as current time, T as max time

✓0 ! initial value

while t < T: do

Update: ✓j ! ✓j − ⌘ · @J(✓,x,y)

@✓j

end while

Assume you are given a data set with n training examples and p features. We first want to find the

negative conditional log-likelihood of the training data in terms of the design matrix X, the labels y, and the

parameter vector ✓. This will be your objective function J(✓) for gradient descent. Here we assume that each

feature vector xi,· contains a bias feature, that is, xi,1 = 1 8i 2 1, …,n. Thus, xi,· is a (p + 1)-dimensional

vector where xi,1 =1. As such, the bias parameter is folded into our parameter vector ✓.

(a) Derive the negative log-likelihood of p(y|X, ✓). (Hint: look at the course notes.)

(b) Then, derive the partial derivative of the negative log-likelihood J(✓) with respect to ✓j , j 2 1, …,p + 1.

(c) The update rule is ✓j ! ✓j − ⌘ · @J(✓)

@✓j

. Based on the update rule, write the gradient descent update

for parameter element ✓j .

We will implement this momentarily.

5.2

In this section, you are going to do some preprocessing. You are going to create something similar to a

bag-of-words feature representation, where the feature vector xi,v for movie review i and vocabulary word v

in the dictionary is set to 1 if movie review i contains at least one occurrence of vocabulary v. An example

of the output from your pre-processing for one movie review might be as follows: 20:1 22:1 30:1 35:1 45:1

……., which expresses that vocabulary words 20, 22, 30, 35, 45, etc. are in the movie review. The feature

vector ignores words not in the vocabulary of dict.txt. You can use csv and math packages in python.

5.3

In this section, implement a sentiment polarity analyzer using binary logistic regression. Specifically, you

will learn the parameters of a binary logistic regression model that predicts a sentiment polarity (i.e., the

label) for the corresponding feature vector of each movie review using gradient descent, as follows:

• Initialize all model parameters to 0.

• Use gradient descent to optimize the parameters for a binary logistic regression model. The number

of times gradient descent loops through all of the training data (number of epochs). Set your learning

rate as a constant ⌘=0.1.

• Perform gradient descent updates on the training data for 30 epochs.

Do not hard-code any aspects of the data sets into your code. You should only use the packages that are

provided. You may use sklearn for creating the train/test split. You can use csv and math packages in python.

Page 6

Let us provide a note about efficient computation of dot-products. In linear models like logistic regression,

the computation is often dominated by the dot-product ✓T x of the parameters ✓ 2 Rp with the feature vector

x 2 Rp. When a standard representation of x is used, the dot-product requires O(p) computation, because it

requires a sum over all entries in the vector: ✓T x =

Pp

j=1 ✓jx·,j However, if our feature vector is represented

sparsely, we can observe that the only elements of the feature vector that will contribute a non-zero value to

the sum are those where x·,j 6= 0, since this would allow ✓jx·,j to be nonzero. This requires only computation

proportional to the number of non-zero entries in x, which is generally very small for model compared to

the size of the vocabulary.

6 Boosted Decision Trees and Random Forest (Henry)

In this assignment, you are going to fit and tune random forest and boosted decision trees to predict whether

a given passenger survived the sinking of the Titanic based on the variables provided in the data set. You

may use sklearn and gridsearch.

6.1

First, download the Titanic.csv file from Sakai, and drop the columns that have “na” values from the data

set. Convert the gender variable into a binary variable with 0 representing male, and 1 representing female.

Then split 80% of the data into train and 20% of the data into test set.

6.2

Fit a random forest and boosted decision tree model on the training set with default parameters, and

estimate the time it required to fit each of the models. Which model required less time to train? Why do

you think that is? (Hint: look at the default values of the parameters.)

6.3

Choose a range of parameter values for each of the algorithms (tell us what parameters you played with),

and tune on the training set over 5 folds. Then, draw the ROC and provide the AUC for each of the

tuned models on the whole training set (in one figure) and test set (in another figure). Comment on the

similarities/di↵erences between the two models (if any).

7 Generalized Additive Models

In this question, you are going to fit a generalized additive model (GAM) to predict the species of penguins on

a given dataset. You can download the dataset penguins trunc.csv from Sakai. The dataset is preprocessed

and you can use it directly.

Split 80% of the data into a training set and 20% of the data into a test set. Please use one of the

following methods to fit a GAM model: (1) logistic regression with `1 penalty, (2) Explainable Boosting

Machine, (3) FastSparse, (4) pyGAM. Report training and test accuracy and plot the shape function for

each feature on a separate plot.

Hint 1: For logistic regression with `1 penalty, you can use sklearn packages. You need to first transform

each continuous feature into a set of binary features by splitting on the mid-point of every two consecutive

values realized in the dataset, i.e., CulmenLength 32.6, CulmenLength 33.3. You can then fit regularized

logistic regression to these indicator variables and construct the component function for variable j by

weighting the indicator variables that pertain to feature j by their learned coefficients and adding them up.

You can use sklearn LogisticRegression. Please remember to set penalty to “`1” and algorithm to “liblinear”.

Hint 2: Explainable Boosting Machine is a tree-based, cyclic gradient boosting generalized additive modeling

package. It can be pip installed for Python 3.6+ Linux, Mac, Windows. More details about the algorithm and

Page 7

usage are available in https://github.com/interpretml/interpret and https://interpret.ml/docs/

ebm.html.

Hint 3: FastSparse implements fast classification techniques for sparse generalized linear and additive models

in R. To install this package, please follow the instruction here https://github.com/jiachangliu/

fastSparse/tree/main/installation. Example code about how to run FastSparse is available https:

//github.com/jiachangliu/fastSparse/tree/main/application_and_usage and example code to visualize

shape functions is https://github.com/jiachangliu/fastSparse/tree/main/visualization.

Hint 4: You may also use pygam package to fit a GAM model which is available here https://github.com/

dswah/pyGAM.

Page