A4 - Machine Learning

Description

In this assignment, you shall implement the Naïve Bayes machine learning algorithm and use it on some datasets.
You can use any programming language you like.
You shall present your application and code at an oral examination.
You are not required to build a REST web service for this assignment.

Submission instructions

Requirements

Grade	Requirements
E	Implement the Naïve Bayes algorithm, using the code structure below (you are allowed to add more classes and methods if needed). Train the model on the Iris and Banknote authentication datasets (see Datasets page). Calculate classification accuracies for both datasets (use all data for both training and testing).
C-D	Implement code for generating confusion matrices, using the code structure below.
A-B	Implement code for n-fold cross-validation, using the code structure below. It shall be possible to use 3, 5 or 10 folds (it is okay if your implementation supports other folds). Calculate accuracy score for 5-fold cross-validation on both datasets.

Note! The purpose of this assignment is that you shall learn how to implement Naïve Bayes, encoding label strings to integers, calculating accuracy scores, performing cross-validation, and generating a confusion matrix. These functionalities are often available in machine learning libraries such as Weka or Scikit-learn, which you are not allowed to use. You are allowed to use library functions for loading and shuffling the data and all necessary mathematical operations, data structures, etc.

Code structure requirements

NaiveBayes class
void fit ( X:float[][], y:int[] )	Trains the model on input examples X and labels y.
int[] predict ( X:float[][] )	Classifies examples X and returns a list of predictions.

Other methods
float accuracy_score ( preds:int[], y:int[] )	Calculates accuracy score for a list of predictions.
int[][] confusion_matrix ( preds:int[], y:int[] )	Generates a confusion matrix and returns it as an integer matrix.
int[] crossval_predict ( X:float[][], y:int[], folds:int )	Runs n-fold cross-validation and returns a list of predictions.

Input data (a float matrix with input variables as columns and examples as rows) is usually denoted with ‌‌‌‌‌. The categories/labels (a list of integers) is usually denoted as ‌‌‌‌‌. Predictions (a list of integers) shall be compared with the actual labels (‌‌‌‌‌) when calculating the accuracy score (percentage correct predictions) and generating the confusion matrix.

Test cases

You can verify your results with the results in Web ML Demonstrator. The Iris dataset is built-in in Web ML (click the Try Iris dataset button), and the Banknote authentication can be uploaded from the CSV file. Note that the cross-validation results can differ slightly due to how the data is split into folds, but the accuracy you get should be almost similar to the accuracy in Web ML.