A2 - Clustering

Description

  • In this assignment, you shall implement clustering on the blogs dataset containing 99 blogs.
  • You can use any programming language you like.
  • You can work alone or in a group of two students.
  • You shall present your application and code at an oral examination.

Submission instructions

See the Deadlines and Submissions page.

Requirements

GradeRequirements
E
  • Implement K-means Clustering with Pearson similarity.
  • Run the algorithm on the blog data dataset (see Datasets page) with five clusters.
  • The iteration shall stop after a specified number of iterations.
  • Present the result as a list of clusters and their assignments.
  • Implement the system using a REST web service where:
    1. client sends a request to a server.
    2. the server responds with json data.
    3. the json data is decoded and presented in a client GUI.
C-D
  • Instead of stopping after a specified number of iterations, you shall implement functionality for stopping when no new assignments are made.
  • Each cluster must keep track of the previous assignment, and a check is made if the new cluster assignment matches the previous one.
A-B
  • Implement Hierarchical Clustering with Pearson similarity.
  • Run the algorithm on the blog data dataset.
  • Present the result as an interactive tree in the client GUI (it shall be possible to expand/collapse branches).

Test cases

K-means

K-means is not deterministic (the results differ between runs), but you usually find related blogs such as the Google and search engine blogs in one cluster as shown here:

resources/A2-Kmeans.png

Note that you can make some performance improvements to speed up the cluster generation. In comparison, my implementation in Python takes around 3 seconds to generate the clusters and build a JSON response.

Hierarchical

Hierarchical clustering always gives the same result. The tree is too large to show here, but if you get the branch shown below, it most likely works correctly:

resources/A2-Hierarchical.png

Note that there are many performance improvements you can make to speed up tree generation. As a comparison, my implementation in Python takes around 10 seconds to generate the tree and build a JSON response.