Workshop 1

Before the workshop

Watch lectures 1 to 4 and read the corresponding chapters in the course literature
Prepare questions (if you have any) on the contents in lectures 1 to 4

It is recommended that you use a suitable Docker image for the workshop assignments, see Docker page.

The aim of this workshop is to discuss the contents in lectures 1 to 4, and do practical assignments on what you have learned in the lectures.

A1: Text classification in the Weka tool

Download and install Weka if you haven’t done it already
Download and unzip the Wikipedia_300 dataset from the Datasets page
You need to apply the StringToWordVector filter in the Preprocess tab, and select Articletype in the target attribute dropdown list in the Classify tab
Classify the dataset in Weka using the algorithms NaiveBayes and NaiveBayesMultinomial and 10-fold cross validation. What are the differences between the two classifiers, and why do you think one is better than the other?

A2: Text classification using the Weka library

Write Java code for classifying the Wikipedia_300 dataset using the NaiveBayesMultinomial algorithm with the Weka.jar library
Make sure you apply the StringToWordVector filter and set correct index of the class label (should be 0)
Read about how to use the Weka library from Java code here

A3: Classification of Iris dataset in the Weka tool

Classify the Iris dataset in Weka using the algorithm k-Nearest Neighbor (lazy/IBk), Decision Trees (trees/J48) and Naïve Bayes (bayes/NaiveBayes)
The Iris dataset can be found in the data folder in the Weka installation or can be downloaded from the Datasets page
Which algorithm gives the best result?

A4: Classification of Iris dataset using the Weka library

Write Java code for classifying the Iris data using the Weka.jar library. Select algorithm based on your findings in assignment A3-1.

A5: Classification of Iris dataset in Scikit

Classify the Iris dataset in Scikit using the k-Nearest Neighbor algorithm
Experiment with different values for k. Which setting gives the best accuracy?
Test and compare the results when using a Decision Tree classifier instead. Which gives the best accuracy?

A6: Classification of Iris dataset in R

Classify the Iris dataset in R using Decision Trees (CART) and k-Nearest Neighbor (kNN) algorithms
Which algorithm gives the best result?
Does the result from kNN match the results from Scikit and Weka? If not, what are the reasons for the differences in result?

A7: Regression using the Weka library