# CSCE 421 :: Machine Learning :: Texas A&M University :: Fall 2021

# Programming Assignment 3 (PA 3)
**Name:**  
**UIN:**   

# Support Vector Machines

- **100 points**
- **Due Tuesday, Nov 16, 11:59 pm**

In this assignment, you'll be training support vector machines for classification.

### Instructions
- Download the dataset `satimage_train.csv` and `satimage_test.csv` from the course [webpage](https://people.engr.tamu.edu/guni/csce421/assignments.html) or from Canvas. **Place these files in the same directory as this notebook.**
- You are allowed to use machine learning libraries such as `scikit-learn` for this assignment. A few of the basic library methods have been already imported for you. Feel free to import any additional methods that you need.
- You are required to complete the functions defined in the code blocks following each question. Fill out sections of the code marked `"YOUR CODE HERE"`.
- You are free to add any number of additional code blocks that you deem necessary. 
- Once you've filled out your solutions, submit the notebook on Canvas following the instructions [here](https://people.engr.tamu.edu/guni/csce421/assignments.html).
- Do **NOT** forget to type in your name and UIN at the beginning of the notebook.

In [None]:
# importing libraries
import sys
import pickle

import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_absolute_error

import matplotlib.pyplot as plt
%matplotlib inline

## Question 1 (10 points)

## Data Preprocessing

For this assignment, we will use the Statlog dataset. This database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a number. The attributes are numerical, in the range 0 to 255. More information about the database can be found [here](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29).

In [None]:
# Read the data
train_df = pd.read_csv('satimage_train.csv')
test_df = pd.read_csv('satimage_test.csv')

In [None]:
train_df.head().T

### To-do steps
1. Remove rows with `NaN` values from `df_train` and `df_test`.
2. Create `X_train` and `X_test` by selecting columns `X1` through `X36`. Create `y_train` and `y_test` by selecting column `Class`.
2. Normalize `X_train` using `MinMaxScaler` from scikit-learn. Then normalize `X_test` on the normalization parameters derived from `X_train`.

In [None]:
# Step 1: Drop NaN values
######################
#   YOUR CODE HERE   #
######################


# Step 2: Create train and test data
######################
#   YOUR CODE HERE   #
######################


# Step 3: Normalize data
######################
#   YOUR CODE HERE   #
######################


## Question 2 (30 points) 

## Hyperparameter Tuning 

Consider the binary classification that consists of distinguishing class 6 from the rest of the data points. Use SVMs combined with polynomial kernels to solve this classification problem. For each value of the polynomial degree, $d$ = 1, 2, 3, 4, plot the average 5-fold cross-validation error plus or minus one standard deviation as a function of $C$ (let the other parameters of the polynomial kernels be equal to their default values) on the training data.

**Report the best value of the trade-off constant $C$ measured on the training internal cross-validation.**

In [None]:
def cross_validation_score(X, y, c_vals, n_folds, d_vals):
    """
    Calculates the cross validation error and returns its mean and standard deviation.
    
    Args:
        X: features
        y: labels
        c_vals: list of C values
        n_folds: number of cross-validation folds
        d_vals: list of degrees of the polynomial kernel
    
    Returns:
        Tuple of (list of error_mean, list of error_std)       
    """
    
    error_mean = np.zeros((len(c_vals),len(d_vals)))
    error_std = np.zeros((len(c_vals),len(d_vals)))
    
    ######################
    #   YOUR CODE HERE   #
    ######################
    
            
    return error_mean, error_std

In [None]:
######################
#   YOUR CODE HERE   #
######################
n_folds = 5
d_vals= [1, 2, 3, 4]
c_vals = # Provide a list of C values

In [None]:
error_mean, error_std = cross_validation_score(X_train, y_train, c_vals, n_folds, d_vals)

**Plot the average cross validation error.**

In [None]:
for i,d_val in enumerate(d_vals):
    plt.rcParams.update({'font.size': 12})
    plt.figure(figsize = (8,5)) 
    plt.bar(range(len(c_vals)), error_mean[:,i], 
            yerr=error_std[:,i],
            align='center',
            alpha=0.5,
            ecolor='k',
            capsize=10,
            label = "Average Error")
    plt.suptitle('Error vs C for d={} Kernel'.format(d_val), fontsize=15)
    plt.xlabel('C Values', fontsize=10)
    plt.xticks(range(len(c_vals)), c_vals, rotation='vertical')
    plt.ylabel('Average Error Across {} Folds'.format(n_folds), fontsize=10)

**Plot $(C; d)$ pairs with their corresponding cross validation errors.**

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize = (8,5))
colors = ['r', 'g', 'b', 'c', 'y']
for i,d_val in enumerate(d_vals):
    plt.plot(c_vals, error_std[:,i],
             marker='o', 
             color=colors[i%5], 
             alpha=1 - 0.2 * d_val/len(d_vals), 
             label = "d={}".format(d_val))
plt.suptitle('Error vs C for all d values', fontsize=15)
plt.xlabel('C values', fontsize=10)
plt.ylabel('Average Error Across {} Folds'.format(n_folds), fontsize=10)
plt.legend()

## Question 3 (30 points) 

## Model Training and Testing

**Build models on the full training data on the best $C$ value you found previously for each $d$ value using the 5-fold cross validation.**

In [None]:
def build_model(X_train, y_train, X_test, y_test, c_vals, d_vals):
    """
    Trains model on a dataset for given values of C and d. Returns the error on the test data,
    the number of support vectors, the number of margin violations, and the margin size.
    
    Args:
        X_train: features in training data
        y_train: train labels
        X_test: features in test data
        y_test: test labels
        c_vals: list of C values
        d_vals: list of degrees of the polynomial kernel
    
    Returns:
        Tuple of (error_test, support_vectors, margin_violations, margin_size)       
    """
    error_test = np.zeros(len(d_vals))
    support_vectors = np.zeros(len(d_vals))
    margin_violations = np.zeros(len(d_vals))
    margin_size = np.zeros(len(d_vals))
    
    ######################
    #   YOUR CODE HERE   #
    ######################
    
    
    return error_test, support_vectors, margin_violations, margin_size

In [None]:
######################
#   YOUR CODE HERE   #
######################
d_vals= [1,2,3,4] # List of degrees
c_vals = # Provide a list of corresponding best C values

In [None]:
error_test, support_vectors, margin_violations, margin_size = build_model(X_train, y_train, 
                                                                          X_test, y_test, 
                                                                          c_vals, d_vals)

**Plot the test errors for each model, as a function of $d$.**

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize = (8,5)) 
plt.plot(d_vals, error_test ,marker='o', color='b')
plt.suptitle('Test Error vs d values', fontsize=20)
plt.xlabel('d values', fontsize=10)
plt.ylabel('Test Error', fontsize=10);

## Question 4 (10 points) 

## Number of support vectors

**Plot the number of support vectors obtained as a function of $d$.**

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize = (8,5)) 
plt.plot(d_vals, support_vectors, marker='o', color='b')
plt.suptitle('Number of Support Vectors vs d values', fontsize=20)
plt.xlabel('d values', fontsize=10)
plt.ylabel('Number of Support Vectors', fontsize=10);

## Question 5 (10 points)

## Number of Margin Violations

**Plot the number of support vectors that violate the margin hyperplanes as a function of $d.**

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize = (8,5)) 
plt.plot(d_vals, margin_violations, marker='o', color='b')
plt.suptitle('Number of Support Vectors that Violate the Margin vs d values', fontsize=20)
plt.xlabel('d values', fontsize=10)
plt.ylabel('Number of Support Vectors that Violate the Margin ', fontsize=10);

## Question 6 (10 points) 

## Margin Size vs Support Vectors

**Explain how the parameter $d$ infuences the model fit (plot the margin size as a function of $d).**

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize = (8,5)) 
plt.plot(d_vals, margin_size, marker='o', color='b')
plt.suptitle('Hyperplane Margin Size vs d values', fontsize=20)
plt.xlabel('d Values', fontsize=12)
plt.ylabel('Hyperplane Margin Size', fontsize=12);

In [None]:
## TYPE YOUR ANSWER BELOW