A2 – Clustering

Description

  • In assignment 2 you shall implement clustering on the blogs dataset containing 99 blogs
  • You can use any programming language you like
  • You can work alone or in group of two students
  • You shall present your application and code at an oral examination

 

Requirements

Grade Requirements
E
  • Implement K-means Clustering with Pearson similarity
  • Run the algorithm on the blog data dataset (see Datasets page) with 5 clusters
  • The iteration shall stop after a specified number of iterations
  • Present the result as a list of clusters and their assignments
  • Implement the system using a REST web service where:
     1) client sends a request to a server
     2) the server responds with json data
     3) the json data is decoded and presented in a client GUI
C-D
  • Instead of stopping after a specified number of iterations, you shall implement functionality for stopping when no new assignments are made
  • Each cluster must keep track of the previous assignment, and a check is made if the new cluster assignment matches the previous one
A-B
  • Implement Hierarchical Clustering with Pearson similarity
  • Run the algorithm on the blog data dataset
  • Present the result as an interactive tree in the client GUI (it shall be possible to expand/collapse branches)

 

Test cases

Here are some test cases you can use to verify that your system works correctly.

K-means:
K-means is not deterministic (the results differ between runs), but you usually find related blogs such as the Google and search engine blogs in one cluster as shown here:


Note that there are some performance improvements you can make to speed up the cluster generation. As comparison, my implementation in Python takes around 3 seconds to generate the clusters and build json response.

Hierarchical:
Hierarchical clustering always gives the same result. The tree is too large to show here, but if you get the branch shown below it most likely works correctly:

Note that there are lots of performance improvements you can make to speed up the tree generation. As comparison, my implementation in Python takes around 10 seconds to generate the tree and build json response.

Welcome to CoursePress

en utav Linnéuniversitets lärplattformar. Som inloggad student kan du kommunicera, hålla koll på dina kurser och mycket mer. Du som är gäst kan nå de flesta kurser och dess innehåll utan att logga in.

Läs mer lärplattformar vid Linnéuniversitetet

Student account

To log in you need a student account at Linnaeus University.

Read more about collecting your account

Log in LNU