P2 – Clustering

This is one of the pre-defined project ideas you can choose for your project.

Clustering Wikipedia articles

Modify your clustering system from Assignment 2 to use Wikipedia articles (90 articles about Programming, 90 about Games). The dataset can be downloaded on the Datasets page.

To use the dataset for clustering, you need to select some words and calculate the frequency of these words in each Wikipedia article. It is not recommended to use all words from the articles since similarity calculations will then take a long time. You can, for example, use the following words:

language, programming, computer, software, hardware, data, player, online, system, development, machine, console, developer, design, history, technology, standard, information, article, example

The article Arcade_game would then have the following frequencies:

0;4;14;1;58;1;11;7;12;4;9;17;0;5;33;8;1;2;7;1

Grading

GradeRequirements
E
  • Read all articles about programming and games and convert each article to word frequencies using the word list above.
  • Perform k-means clustering on the 180 articles using two clusters.
  • Are the articles well separated into one cluster of gaming related articles and one cluster about programming?
C-D
  • Perform hierarchical clustering on the 180 articles.
  • Are articles about similar topics well separated into branches?
A-B
  • Generate your own word list of at least 100 words.
  • Repeat k-means and hierarchical clustering using the new word list.
  • Are the results better with the new word list?