- In assignment 2 you shall implement clustering on the blogs dataset containing 99 blogs
- You can use any programming language you like
- You can work alone or in group of two students
- You shall present your application and code at an oral examination
Here are some test cases you can use to verify that your system works correctly.
K-means is not deterministic (the results differ between runs), but you usually find related blogs such as the Google and search engine blogs in one cluster as shown here:
Note that there are some performance improvements you can make to speed up the cluster generation. As comparison, my implementation in Python takes around 3 seconds to generate the clusters and build json response.
Hierarchical clustering always gives the same result. The tree is too large to show here, but if you get the branch shown below it most likely works correctly:
Note that there are lots of performance improvements you can make to speed up the tree generation. As comparison, my implementation in Python takes around 10 seconds to generate the tree and build json response.