## IBM SPSS Modeler Lab

MODELER LAB ON SUPPORT VECTOR MACHINES (SVM)

Input Data: WineTrainValidate.xlsx

SETUP

Open the file WineTrainValidate.xlsx and examine the fields. These are numeric measurements taken on variables early

in the winemaking process and will eventually lead to a classification of the wine as types A, B, or C.

There are two sets, Train and Validate, and a worksheet with descriptive data. Make sure to close Excel before the nest

step.

PART 1

Set up your standard trio of nodes: Excel (Sources) – Type (Field Ops) – Table (Output). Attach the

WineTrainValidate.xlsx file to the Excel node and set it to read the Train dataset.

Right–click on this node and choose Copy Node to make a copy of this node. Set it to read the Validation dataset. Name

each node after its data set (under Annotations).

Note that the Excel node named Validation will not be connected to any other node.

Instantiate the data. Open the Type node and set Type as the Target and all other fields as Input.

PART 2

Attach three SVM nodes to the Type node. A support vector machine model is built on a specific type of underlying

function, and each type of function (called the kernel) gives a different kind of separation of the target data. Sometimes,

a kernel will just not work on a specific data set. If you notice that your model has been running for a while and is not

finishing, it may be that the kernel type cannot separate the data set. Here, we will try each of the three kernel types.

On the Expert tab, click the Expert mode, then set one to use an RBF kernel, one to Sigmoid, and the third to Polynomial

of degree 3. On the Analyze tab, check the box to show the variable importance in each one. On the Annotations tab,

name each model after the type of function it is using so that each one has a different name. We are going to run the

training data through each type of kernel to see which one does best on our dataset.

Now run each SVM model to get a generated SVM model for each kernel type.

If we Browse the generated models on the right, we can see how they view the importance of the inputs. Here we have

the RBF, followed by the Sigmoid, then the Polynomial.

The three variables with the greatest Predictor

Importance values here are Proline, OD280_OD315, and

Alcohol.

The three variables with the greatest Predictor

Importance values here are Proline, OD280_OD315,

and Flavanoids.

The three variables with the greatest Predictor

Importance values here are Alcohol, Color_intensity,

and Ash.

Note that these models do not have the same Predictor Importance rankings for the variables.

In order to see how well each model has done, we will attach matrix nodes to each generated model in the stream and

set them to show Type vs. $S–Type. Also attach a Plot node to each and set it to graph Type vs $S–Type (predicted type).

Choose a color overlay of the most important predictor variable for that model. Here is how the stream should look

now.

Next, we want to run each matrix node and observe the results.

Left to right, we have RBF, Polynomial, then Sigmoid. We see that the RBF and Polynomial functions divided the data set

perfectly while the Sigmoid did not.

Now, let’s run the Plot nodes attached to these generated models.

Here is the Plot node for the model using the Sigmoid function.

Notice that the highest values for Alcohol seem to be in Types

A and C, while Type B has lower to moderate. We also see

that there are fewer incorrect predictions for C than for A and

B.

Here is the Plot node for the model using the Polynomial

function.

Here is the Plot node for the model using the RBF function.

We see that the Plot nodes for the RBF and Polynomial are identical since they both gave perfect predictions, and both

used the same variable as most important. In these graphs, we have the A cluster with the highest values of Proline.

You can try other variables to see how their values are spread across the wine types.

PART 3

Next let’s see how the kernels do on the validation set.

Change the data set being read to the Validation set by disconnecting the Excel Train node from the Type node and

connection the Excel Validate node to the Type node.

Let’s rename the three Matrix nodes and the three Plot nodes using the kernel function type of their corresponding

model (RBF, Polynomial, Sigmoid).

Next, run the three matrix nodes. (Be sure you do not re–train the model. Do not run the model nodes. Only run the

Matrix nodes after the generated model.)

Here are the matrix node outputs.

The RBF model has every prediction correct on the validation set. The polynomial model missed 1. The sigmoid model

has 9 incorrect.

Let’s also look at the Plot nodes on the validation data. First, the RBF graph. We see the greatest amount of Proline in

A, the lightest in B, and moderate in C.

Now the Polynomial graph. We see a similar spread of Proline values, with the incorrect value having a Proline amount

of 550 (hover your mouse pointer over the spot).

For the Sigmoid plot, we see the lightest amounts of Alcohol occur in the B type, but the predictions for B have the

greatest amount of errors.

Given the results on the validation set, if we had to pick only one, the RBF would be the winner.

Disconnect the validation set and reconnect the training set.

DELIVERABLE

Save your stream and attach it to this Sakai assignment. Your stream should include the following:

The Excel–Type–Table nodes with the file WineTrainValidate attached and set to the Train data.

An Excel node with the Validation set data.

The Type node settings as specified.

Three SVM models, one with an RBF kernel, one with a Sigmoid kernel, and the last with a Polynomial kernel.

Generated models for each of the SVM models.

Matrix and Plot nodes attached to each generated model with appropriate settings.