CSCE 633 - Spring 2016
Project 2 - Neural Network
due date: Thurs, March 24 (by start of class, 3:55pm)

The goal of this assignment is to implement a multi-layer neural network (in any programming language of your choice), and test it on at least 5 datasets from the UCI Machine Learning Repository. Compare the performance of your neural net to your Decision Tree program.

Your program should implement the BackProp algorithm.

You will want to use flags to indicate the number of nodes and hidden layers. The program should be able to handle at least 0, 1, or 2 hidden layers.

You will also have to handle discrete attributes. If a discrete attribute has k values, you can try mapping them to k binary input nodes, or log2(k) nodes using a binary encoding, or 1 node with k different values (numeric assignment). Each approach has different advantages, so you might want to try both and see which works best.

If there are multiple class values, you should use multiple output nodes (i.e. 1 output for each class value). During testing, the output with the highest activation indicates the prediction class label. In the special case of binary classification (2-class problems), you may use 1 output node.

During training, you will have to monitor the error (MSE) on a validation set of examples. You will probably need to manually adjust the learning rate (eta) to get convergence that gradually achieves a minimum error over a reasonable amount of time (though it might take thousands of epochs).

Optionally, you might want to experiment with adding momentum.

Use cross-validation to report accuracies and confidence intervals. (that means testing on a different subset of examples than was used to update the weights or monitor MSE during training)

On a few datasets, experiment with different numbers of hidden layers and hidden units, and report your findings about how many nodes and hidden layers are optimal. Is there a dataset where 2 layers is statistically significantly better than 1? Does having 2 layers slow down training or impede convergence or cause overfit?

When comparing the performance of algorithms, use T-tests. Hints:

randomize the order of examples
normalize continuous attributes
monitor MSE on validation set
evaluate accuracy on independent test set
adjust learning rate to get good convergence in reasonable time
don't forget the bias input to each node

What to Turn In

Submit your files through the https://wiki.cse.tamu.edu/index.php/Turning_in_Assignments_on_CSNet (you might need to be inside the TAMU firewall to access this) Include the following:

Your source code. Also include a descripion of how to compile and run your program, sufficient so that it can be tested by the grader.
A write-up that describes salient details about your implementation, such as whether you use stochastic updates, momentum, handling of discrete inputs, normalization of continuous inputs, threshold function, stopping criterion, etc.
A table of results. Include the confidence interval on accuracy (from cross-validation) for the best version of your algorithm on at least 5 datasets (include brief descriptions of what they are). Also include details of the final number of layers, hidden nodes, learning rate, and iterations (to convergence) used in each case.
On at least 2 of the datasets, compare the performance with 0, 1, and 2 hidden layers, and different numbers of hidden nodes (for example, 5,10,15,20 or 2,4,8,16, etc.) You might want to also indicate the total number of parameters (weights) in each case.
In the table, include confidence intervals for your decision tree, and do a T-test to determine whether any of the differences are statistically significant. (Note that a paired T-test would be preferrable because it should be more sensitive, but I realize it might be difficult to run your programs on the same cross-validation splits, so just a simple pooled T-test would be sufficient, if you want.)
Include a discussion of your results, i.e. on which datasets did the neural network do better than the decision tree and why? What impacted the performance of your network the most? Did having more hidden layers and/or nodes increase accuracy? take longer to converge? What is optimal (in your opinion, based on your results)?